I mentioned how AtomicIntegerArray and its kin can be slow. I'm now a little closer to finding why.
I decompiled my code that has one thread writing to and one thread reading from an AtomicIntegerArray using the JVM arguments:
-Xmx32g -server -XX:+PrintOptoAssembly -XX:CompileCommand=print,*Runnable -XX:CompileThreshold=10000
Using Java version:
$ /usr/java/jdk1.7.0_10/bin/java -version
java version "1.7.0_10"
Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
I decompiled my code that has one thread writing to and one thread reading from an AtomicIntegerArray using the JVM arguments:
-Xmx32g -server -XX:+PrintOptoAssembly -XX:CompileCommand=print,*Runnable -XX:CompileThreshold=10000
Using Java version:
$ /usr/java/jdk1.7.0_10/bin/java -version
java version "1.7.0_10"
Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
My code is testing reading and writing to an array structure. Initially, all values are 0.
The writer thread goes through the array setting the values to 1.
The reader thread follows the writer as best it can. If an element is 0 (that is, uninitialized) it spins waiting for the value to change to 1.
The Java code for reading looks something like this in the super class:
public abstract class AbstractReaderRunnable implements Runnable {
protected int index = 0;
protected int lap = 0;
@Override
public void run() {
while (index < ARRAY_SIZE) {
lap = elementAtIndex();
while (lap == 0) {
lap = elementAtIndex();
}
index++;
}
}
protected abstract int elementAtIndex();
and in my particular subclass (that is, a class for which the array structure is a AtomicIntegerArray) looks like this:
class AtomicReaderRunnable extends AbstractReaderRunnable {
private final AtomicIntegerArray array;
protected int elementAtIndex() {
return array.get(index);
}
.
.
Fairly simple. Now comes the hard part: the machine code.
Note: "In Intel syntax the first operand is the destination, and the second operand is the source whereas in AT&T syntax the first operand is the source and the second operand is the destination." [1]
HotSpot by default appear to print the assembly in Intel format, although you can change this [2]
HotSpot by default appear to print the assembly in Intel format, although you can change this [2]
069 movl R10, [R13 + #24 (8-bit)] # int ! Field com/phenry/concurrent/lockless/AbstractReaderRunnable.index
06d movq R8, [R11 + #16 (8-bit)] # ptr ! Field java/util/concurrent/atomic/AtomicIntegerArray.array
.
.
So, the R10 registry is our index field and R8 points to the int[] array in the Java AtomicIntegerArray class.
.
.
094 movl R11, [R8 + #24 + R10 << #2] # int
099 MEMBAR-acquire ! (empty encoding)
099
099 movl [R13 + #28 (8-bit)], R11 # int ! Field com/phenry/concurrent/lockless/AbstractReaderRunnable.lap
In English: here we multiply R10 by 4 by shifting it twice to the left (<< #2) - because ints are 4 bytes wide - and apparently add this to the start of the array to find our memory address. The contents of this we put into the R11 registry. In turn, this is put in the memory address where we store our lap field in the Java world.
Note that HotSpot appears to have inlined methods [6]. The code generated by Hotspot is not necessarily isomorphic to the original Java code. Also note that it might recompile it to machine code more than once, apparently depending on the usage profile.
Now, the thing of interest is the MEMBAR-acquire ! (empty encoding) line. This is not an x86 instruction and doesn't even take an address (the left-most column). So, why is it there?
"Each CPU has its own peculiar memory-barrier instructions, which can make portability a challenge" [3] The x86 hardware has a NoOp for most membar operations:
void LIR_Assembler::membar() {
// QQQ sparc TSO uses this,
__ membar( Assembler::Membar_mask_bits(Assembler::StoreLoad));
}
void LIR_Assembler::membar_acquire() {
// No x86 machines currently require load fences
// __ load_fence();
}
void LIR_Assembler::membar_release() {
// No x86 machines currently require store fences
// __ store_fence();
}
void LIR_Assembler::membar_loadload() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::loadload));
}
void LIR_Assembler::membar_storestore() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::storestore));
}
void LIR_Assembler::membar_loadstore() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::loadstore));
}
void LIR_Assembler::membar_storeload() {
__ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad));
}
(from Hotspot's source code that can be found in hotspot/src/cpu/x86/vm/c1_LIRAssembler_x86.cpp).
The membar method itself looks like this:
// Serializes memory and blows flags
void membar(Membar_mask_bits order_constraint) {
if (os::is_MP()) {
// We only have to handle StoreLoad
if (order_constraint & StoreLoad) {
// All usable chips support "locked" instructions which suffice
// as barriers, and are much faster than the alternative of
// using cpuid instruction. We use here a locked add [esp],0.
// This is conveniently otherwise a no-op except for blowing
// flags.
// Any change to this code may need to revisit other places in
// the code where this idiom is used, in particular the
// orderAccess code.
lock();
addl(Address(rsp, 0), 0);// Assert the lock# signal here
}
}
}
(from assembler_x86.hpp in the HotSpot source code).
And true to the comment, we do see lines output like:
lock addl [rsp + #0], 0 ! membar_volatile
[1] http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Note that HotSpot appears to have inlined methods [6]. The code generated by Hotspot is not necessarily isomorphic to the original Java code. Also note that it might recompile it to machine code more than once, apparently depending on the usage profile.
Now, the thing of interest is the MEMBAR-acquire ! (empty encoding) line. This is not an x86 instruction and doesn't even take an address (the left-most column). So, why is it there?
"Each CPU has its own peculiar memory-barrier instructions, which can make portability a challenge" [3] The x86 hardware has a NoOp for most membar operations:
void LIR_Assembler::membar() {
// QQQ sparc TSO uses this,
__ membar( Assembler::Membar_mask_bits(Assembler::StoreLoad));
}
void LIR_Assembler::membar_acquire() {
// No x86 machines currently require load fences
// __ load_fence();
}
void LIR_Assembler::membar_release() {
// No x86 machines currently require store fences
// __ store_fence();
}
void LIR_Assembler::membar_loadload() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::loadload));
}
void LIR_Assembler::membar_storestore() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::storestore));
}
void LIR_Assembler::membar_loadstore() {
// no-op
//__ membar(Assembler::Membar_mask_bits(Assembler::loadstore));
}
void LIR_Assembler::membar_storeload() {
__ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad));
}
(from Hotspot's source code that can be found in hotspot/src/cpu/x86/vm/c1_LIRAssembler_x86.cpp).
The membar method itself looks like this:
// Serializes memory and blows flags
void membar(Membar_mask_bits order_constraint) {
if (os::is_MP()) {
// We only have to handle StoreLoad
if (order_constraint & StoreLoad) {
// All usable chips support "locked" instructions which suffice
// as barriers, and are much faster than the alternative of
// using cpuid instruction. We use here a locked add [esp],0.
// This is conveniently otherwise a no-op except for blowing
// flags.
// Any change to this code may need to revisit other places in
// the code where this idiom is used, in particular the
// orderAccess code.
lock();
addl(Address(rsp, 0), 0);// Assert the lock# signal here
}
}
}
(from assembler_x86.hpp in the HotSpot source code).
And true to the comment, we do see lines output like:
lock addl [rsp + #0], 0 ! membar_volatile
but it's in the writer code! The writer code looks like this:
070 movl R8, [RCX + #28 (8-bit)] # int ! Field com/phenry/concurrent/lockless/AbstractWriterRunnable.lap
.
.
078 movl R11, [RCX + #24 (8-bit)] # int ! Field com/phenry/concurrent/lockless/AbstractWriterRunnable.index
07c movq R10, [R10 + #16 (8-bit)] # ptr ! Field java/util/concurrent/atomic/AtomicIntegerArray.array
.
.
Similar to before we have a register that's our index field (R11) and the int[] array in the Java AtomicIntegerArray class (R10). Also similar to before, we calculate the address that this index points to in the array but this time we populate its value with R8 (our lap field in our Java code).
.
.
08e B8: # B4 B9 <- 287336="" b7="" font="" nbsp="" req:="">->
08e MEMBAR-release ! (empty encoding)
08e
08e movslq R11, R11 # i2l
091 movl [R10 + #24 + R11 << #2], R8 # int
096 lock addl [rsp + #0], 0 ! membar_volatile
As before we have the (redundant) MEMBAR pseudo instruction but we also have the very significant lock prefix.
The reader code does not need to worry since the membar has flushed the store buffers (which "enable our fast processors to run without blocking while data is transferred to and from the cache sub-system" [7]. These fences are more related to instruction ordering - another side effect of the synchronize semantics - than the cache subsystem). Furthermore:
"If a core wants to read some memory, and it does not have it ... then it must make a read on the ring bus. It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified. In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent." [7]
This seems to be a standard way of synchronizing outside the JVM [5]. So, why is it so slow? The JVM has some very clever code to avoid this heavy-weight approach that I will go into in a future post.
The reader code does not need to worry since the membar has flushed the store buffers (which "enable our fast processors to run without blocking while data is transferred to and from the cache sub-system" [7]. These fences are more related to instruction ordering - another side effect of the synchronize semantics - than the cache subsystem). Furthermore:
"If a core wants to read some memory, and it does not have it ... then it must make a read on the ring bus. It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified. In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent." [7]
This seems to be a standard way of synchronizing outside the JVM [5]. So, why is it so slow? The JVM has some very clever code to avoid this heavy-weight approach that I will go into in a future post.
[2] http://stackoverflow.com/questions/9337670/hotspot7-hsdis-printassembly-intel-syntax
[7] "CPU Cache Flushing Fallacy" - Martin Thompson's blog.
No comments:
Post a Comment