Wednesday, October 1, 2014

Unsafe Fences

The mysterious sun.misc.Unsafe class in JDK 1.8 has some methods that have been recently added. These are loadFence(), storeFence() and fullFence(). The documentation in is sparse telling us that they "ensure lack of re-ordering of" load, stores and loads/stores before the fence "with loads or stores after the fence."

These load-, store- and full-fence native methods are defined in unsafe.cpp in the OpenJDK source code just defer to OrderAccess::acquire(), ::release() and ::fence() respectively.

OrderAccess::release() doesn't do a great deal:

inline void OrderAccess::release() {
  // Avoid hitting the same cache-line from
  // different threads.
  volatile jint local_dummy = 0;

It is inlined so it's as if it were copy-and-pasted into all parts of the code that call it and when I put something similar into a (non-inlined) piece of C++ and disassemble it on an x86_64 with objdump -d, it looks like this:

00000000000003dd <_Z19OrderAccess_releasev>:
 3dd: 55                   push   %rbp
 3de: 48 89 e5             mov    %rsp,%rbp
 3e1: c7 45 fc 00 00 00 00 movl   $0x0,-0x4(%rbp)
 3e8: 5d                   pop    %rbp
 3e9: c3                   retq   

The highlighted line is what sets 0 to something in the stack (all other lines are setting up and tearing down the stack frame). So, I can only conclude that the volatile in the release() method is there to stop the compiler re-ordering things as nothing special is happening at the hardware level. This appears to be what the OpenJDK source code suggests:

"Note: as of 6973570, we have replaced the originally static "dummy" field (see above) by a volatile store to the stack. All of the versions of the compilers that we currently use (SunStudio, gcc and VC++) respect the semantics of volatile here. If you build HotSpot using other compilers, you may need to verify that no compiler reordering occurs across the sequence point represented by the volatile access."

The only place in the JDK's source that I can see a call to loadFence() is in StampedLock. What this call to loadFence() appears to be avoiding is a write to a volatile variable to ensure a LoadLoad memory barrier (that is, the data loaded in the second read is at least as new as the data loaded in the first). Why this LoadLoad is necessary is explained at this link here.

The LoadLoad is desired after reading a non-volatile variable but before reading a volatile variable. But no memory barrier is otherwise issued in a Normal Store/Volatile Load combination. So, one way to provide it is to write to a volatile variable. Unfortunately, this is expensive.

[Aside: the semantics of primitive memory fences operate between instructions. It does not make sense to associate a fence with a single instruction.]

An acquire fence is a LoadLoad and a LoadStore (see here for a good explanation). From OpenJDK's orderAccess.hpp, the comments tell us what this acquire does:

"Execution by a processor of acquire makes the effect of all memory accesses issued by it subsequent to the acquire visible to all processors *after* the acquire completes.  The effect of prior memory accesses issued by it *may* be made visible *after* the acquire.  I.e., prior memory accesses may float below the acquire, but subsequent ones may not float above it."

So, acquire issues the LoadLoad we are looking for (plus LoadStore which appears is not needed in this case) and avoids an expensive write to a volatile.

A note on x86 architecture

LoadLoad, LoadStore and StoreStore seem to be no-ops on x86. This is what Doug Lea's JMM Cookbook says and looking at the OpenJDK code, acquire() and release() do nothing but add 0 to registers (they do, however, use volatile so the compiler should not do any re-ordering). So, calls to these methods should have no effect on this architecture even if they're needed on others.

The method fence() does, however, have an effect on x86 as uses the lock assembly instruction.

Empirically, this seems to be the case. Using JMH, I put a simple class together that benchmarks the times for these three calls. Each method is called and nothing more:

Benchmark                               Mode  Samples  Score  Score error  Units
c.p.m.m.FenceBenchmarks.acquireFence    avgt        5  0.338        0.021  ns/op
c.p.m.m.FenceBenchmarks.fence           avgt        5  6.343        0.090  ns/op
c.p.m.m.FenceBenchmarks.releaseFence    avgt        5  0.340        0.013  ns/op

The results show fence()  to be an order of magnitude slower than the other two methods.

No comments:

Post a Comment