Thursday, October 27, 2011

Java Disassembled

In my last post, I talked about how false sharing due to objects being on the same cache line can cause code to run much slower. Now, I'm going to show why

matrix[fixed_index][i] = value + 1;

is slower than

matrix[i][fixed_index] = value + 1;

This slowness is a property of your CPU rather than something innate to Java but we'll use the JDK with the -XX:+PrintOptoAssembly flag and examine the generated assembler code to show the problem.

(Please note that I am not a professional x86 assembler programmer so any input would be gratefully received.)

First, some preliminaries. Most of the following was gleaned from using JDK 1.7.0 on Linux:

1. Java multi-dimensional arrays are actually arrays of arrays.

2. The second location in the memory representing an array (array reference + 1) yields information as to its type. We can show this by looking at the assembly generated by this Java:
       private void myInstanceOf(Object[] array) {
if (array instanceof Integer[]) {
aPublicInt++;
}
}
[Aside: the field aPublicInt is exactly that: an int that is publicly accessible. The reason we may need it is that Hotspot may optimize away our code if it considers that nothing noticeable changes.]

Anyway, the assembler it generates looks a little like this:
#r000 ecx   : parm 0: com/henryp/lang/SimpleClassForDisassembly:NotNull *
#r005 edx : parm 1: java/lang/Object *[int:>=0] *
# -- Old esp -- Framesize: 16 --
#r045 esp+12: return address
#r044 esp+ 8: pad2, in_preserve
#r043 esp+ 4: pad2, in_preserve
#r042 esp+ 0: Fixed slot 0
#
000 N45: # B1 <- BLOCK HEAD IS JUNK Freq: 1
000 CMP EAX,[ECX+4] # Inline cache check
JNE SharedRuntime::handle_ic_miss_stub
NOP
NOP
NOP

000
00c B1: # B5 B2 <- BLOCK HEAD IS JUNK Freq: 1
00c # stack bang
PUSHL EBP
SUB ESP,8 # Create frame
01a MOV EBX,[EDX + #4]
01d NullCheck EDX
01d
01d B2: # B4 B3 <- B1 Freq: 0.999999
01d CMPu EBX,precise klass [Ljava/lang/Integer;: 0xb76eeb68:Constant:exact *
023 Jne,us B4 P=0.100000 C=-1.000000
023
025 B3: # B4 <- B2 Freq: 0.899999
025 INC [ECX + #16] ! Field com/henryp/lang/SimpleClassForDisassembly.aPublicInt
028
028 B4: # N45 <- B3 B2 Freq: 0.999999
028 ADD ESP,8 # Destroy frame
POPL EBP
TEST PollPage,EAX ! Poll Safepoint

032 RET
032
033 B5: # N45 <- B1 Freq: 1.01328e-06
033 MOV EBP,ECX
035 MOV ECX,#-12
03a NOP # 1 bytes pad for loops and calls
03b CALL,static wrapper for: uncommon_trap(reason='null_check' action='make_not_entrant')
# com.henryp.lang.SimpleClassForDisassembly::myInstanceOf @ bci:1 L[0]=EBP L[1]=_ STK[0]=#NULL
# OopMap{ebp=Oop off=64}
040 INT3 ; ShouldNotReachHere
040

Verbose, isn't it?

(Note that this assembler was generated when the Java method was alternately passed an Integer[] and a String[] for many iterations. The JDK will generate other assembler depending on the which paths through the method are most commonly used.)

So what does this assembly code mean? The first line says the ecx register holds a reference to this. But the second line tells us the register edx points to an the argument that is passed to this method - the Object[].

So much for the references. The interesting code starts at the label B1. Here we create the stack for this method. EBP and ESP are special "32-bit registers that are used as pointers [,] the extended base pointer (EBP) and the extended stack pointer (ESP)...

"The EBP register (sometimes called the frame pointer (FP) or local base pointer (LB)) is used to reference variables in the current stack frame. Each stack frame contains the parameters to the function, its local variables, and two pointers that are necessary to put things back the way they were: the saved frame pointer (SFP) and the return address. The stack frame pointer is used to restore EBP to its previous value, and the return address is used to restore EIP [the address of the code currently executing] to the next instruction found after the function call." [1]

Breaking this code down, we see:
    PUSHL  EBP
(Push the old frame pointer value onto the stack).
    SUB    ESP,8    # Create frame
(Allocate 8 bytes on the stack. The stack starts high in the memory space and moves towards zero as more stack space is allocated so we SUBtract 8 bytes from the ESP).
01a       MOV    EBX,[EDX + #4]
01d NullCheck EDX
(Move the contents of the second memory space of the array - remember, EDX points to our array - into the EBX register. The Nullcheck is not an assembly instruction. It appears to be pseudo code as it has the same address - 01d - as the next line.)
01d       CMPu   EBX,precise klass [Ljava/lang/Integer;: 0xb76eeb68:Constant:exact *
023 Jne,us B4 P=0.100000 C=-1.000000
(Is EBX the same as the constant that represent an Integer array? If the unsigned (,us) comparison concludes they are not equal, go to B4. Otherwise...)
025       INC    [ECX + #16] ! Field com/henryp/lang/SimpleClassForDisassembly.aPublicInt
(add one to this.aPublicInt which occupies the memory address 16 bytes after the memory address of this. Remember, the first line tells us that ECX points to this).

Either way, we come to B4 where we undo all the work of allocating stack space and RETurn.

3. The third location in the memory holding an array (array reference + 2) yields information as to its length. We can show this by looking at the assembly generated by this Java:
       private static int myArraySize(Object[] array) {
return array.length;
}
The assembler generated here looks like this (ignoring all the stack frame manipulation etc that we saw in the previous example):
#r000 ecx   : parm 0: int[int:>=0]:exact *
.
.
.
00e MOV EAX,[ECX + #8]
011 NullCheck ECX
Here, ecx is the register that represents the argument (there is no reference to this in the code as you'll notice the Java method is static). We put the contents of the address space 2 memory slots (2 * 4 bytes = 8) after the array reference itself into EAX. This is the register that by convention holds the return value of any call.

Returning to our matrix navigation code, let's start with this method:
       private int[][] myMatrixNavigation() {
int size = 19;
int matrix[][] = new int[size][27];
for (int i = 0 ; i < size ; i++) {
matrix[i][7] = 11;
}
return matrix;
}

And run it enough times for the JDK to generate assembly language. It generates assembler that looks like:
       SUB    ESP,24   # Create frame
XOR EBP,EBP
MOV ECX,precise klass [[I: 0xa062f798:Constant:exact *
MOV EDX,#19
MOV EDI,#27
MOV [ESP + #0],EDI
NOP # 1 bytes pad for loops and calls
CALL,static wrapper for: _multianewarray2_Java
# com.henryp.lang.SimpleClassForDisassembly::myMatrixNavigation @ bci:6 L[0]=_ L[1]=#19 L[2]=_ L[3]=_
# OopMap{off=52}
#checkcastPP of EAX
MOV EDX,#19

B2: # B27 B3 <- B1 B5 Loop: B2-B5 inner stride: not constant pre of N159
CMPu EBP,#19
Jnb,u B27

B3: # B28 B4 <- B2
MOV EDI,[EAX + #12 + EBP << #2]
MOV ECX,[EDI + #8]
NullCheck EDI

B4: # B26 B5 <- B3
CMPu ECX,#7
Jbe,u B26

B5: # B2 B6 <- B4
MOV [EDI + #40],#11
INC EBP
CMP EBP,#1
Jl,s B2 # Loop end

B6: # B16 B7 <- B5
SUB EDX,EBP
AND EDX,#-4
ADD EDX,EBP
CMP EBP,EDX
Jge,s B16
NOP # 6 bytes pad for loops and calls

B7: # B28 B8 <- B6 B15 Loop: B7-B15 inner stride: not constant main of N96
MOV ECX,[EAX + #12 + EBP << #2]
MOV EDI,[ECX + #8]
NullCheck ECX

B8: # B23 B9 <- B7
CMPu EDI,#7
Jbe,u B23

B9: # B28 B10 <- B8
MOV EDI,[EAX + #16 + EBP << #2]
MOV [ECX + #40],#11
.
.
.
Hugely abbreviated, it does something like this:
        XOR    EBP,EBP
(Set EBP to 0. Exclusive-or of anything on itself will be 0. This is an optimisation. EBP will be our loop counter - equivalent to i in the Java code above).
        CALL,static  wrapper for: _multianewarray2_Java
(Call JVM code to create our array. Remember that the register EAX is used to return a value from a call so it's this that points to our array).
        MOV    EDI,[EAX + #12 + EBP << #2]
(Move the ith element of the outer array into EDI. Remember that the second memory location of an array (+4) is its type, the third (+8) its length so +12 is the first element).
        MOV    [EDI + #40],#11
(This element is itself an array. So, bearing in mind that the array reference + 4 is the type, the array reference + 8 is the length, the array reference + 12 the first element and so on, the array reference + 40 must be the 7th ((7 * 4) + 12) element. We set the contents of this memory address to 11 as our Java code says we should).

Note that this memory address is likely to be far away from the reference of the first array (stored in the EAX register). Compare that to the assembly code generated by this Java method:
       private int[][] myMatrixNavigation2ndIndice() {
int size = 19;
int matrix[][] = new int[27][size];
for (int i = 0 ; i < size ; i++) {
matrix[7][i] = 11;
}
return matrix;
}
This is very similar to before but this time we're fixing the 2nd index of the matrix, not the first. I'll spare you all the gory assembler, but the relevant code boils down to this:

        MOV    EDI,[EAX + #40]
(Once more, EAX stores the reference to our array and the +40 is used to access the 7th element).
        MOV    [EDI + #12 + EBP << #2],#11
(The 7th element is itself an array. So, we traverse this array once more with EBP as our index, setting the memory contents to 11, just as our Java code says we should).

Now the thing to note here is that if this code immediately above is multi-threaded, all threads will traverse a contiguous chunk of memory, that is [EDI + 12 + EBP] since EBP is just our integer index.

In the first piece of Java code where the first index is fixed, a thread will be writing to some element in the array that is found at [EDI + 12 + EBP] not the location [EDI + 12 + EBP] itself. It is much less likely that this memory is going to be close to the memory location for any other value of EBP (in fact for a sufficiently large array I think it might be impossible). Since cache lines are typically only 64 bytes, contention for the same cache line is much less likely and so the first piece of Java code is faster.

[1] The Art of Exploitation - Jon Erickson.

Wednesday, October 26, 2011

False Sharing

If multiple threads concurrently change the values in a matrix, which is faster: working through the matrix column-by-column or row-by-row? And why is there a difference?

Let's say we have a square, 2-dimensional matrix of ints that we model like so:
int SIZE    = 1000;
int ROWS = SIZE;
int COLUMNS = SIZE;

int[][] matrix = new int[ROWS][COLUMNS];

Technically, Java doesn't have multi-dimensional arrays. It has arrays of arrays. But for our purposes, let's define the first index as the row and the second as the column.

Now, let's say that at the same time our many threads change all the values in a given column with the following code:
  // change all the values in a fixed column of our matrix
for (int i = 0 ; i < ROWS ; i++) {
int value = matrix[i][fixed_index];
matrix[i][fixed_index] = value++;
}
and after this code has finished running, we change it slightly so that now the many threads change all the values in a given row.
    // change all the values in a fixed row of our matrix
for (int i = 0 ; i < COLUMNS ; i++) {
int value = matrix[fixed_index][i];
matrix[fixed_index][i] = value + 1;
}
This time, we've fixed the "row" index and let the "column" index vary so all values in a given row change.

The results on my MacBook Pro (Intel Core 2 Duo) running 100 threads each performing 1000 iterations look like this:

Run____fixed column____fixed row
1__________209_____________91___
2__________194_____________78___
3__________201_____________92___
4__________199_____________60___
5__________228_____________63___

[mean time in ms. Apologies for the formatting. Blogger keeps trying to eat whitespace].

As you can see, changing all the values in one of our rows is much faster than changing all the values in one of our columns.

Remembering that there are really no rows and columns in Java (we're actually modeling our matrix using arrays of arrays), the easy way to remember this is the 3 Fs: when it comes to changing the values along a give index of a 2-D array, Fixing the First is Fastest.

This phenomena is due to a mechanism called "false sharing" (there is a good podcast by Scott Meyers on the matter here).

False sharing slows concurrent changes to RAM. It occurs when one thread is writing to memory that is close to where another thread is writing. Robert Love defines it as when there are "two or more objects mapping to the same cache line despite existing at different addresses in memory" (Linux Kernel Development 3r Edition).

The maximum size of memory in which these two or more addresses need to be for this to happen is called a cache line. The exact size is dependent on the architecture but it is typically 64 bytes.

What's happening in the above example is that ints in the column of our matrix are evidently stored close together - so close there is some cache line collisions occurring - and that threads are contending to write to that piece of RAM.

I'll write another post as to how we can prove this.

Tuesday, October 4, 2011

Synchronicity

I had a debate over a beer with a friend about volatile arrays. I forgot who was trying to prove what but I did come across an interesting quirk.

We were talking about how adding the volatile keyword to a reference means that threads will always see the same thing once the reference had been set. But what if that reference were an array? Would the threads agree on the elements of the array? (The answer is "not necessarily" as pointed out in my blog last year).

But I had trouble demonstrating this with some simple, multi-threaded code. Annoyingly, my threads kept agreeing on the elements of the array. Maybe it was the chip architecture, I lazily thought.

Then it occurred to me. I was putting some System.out.println statements in my code to help me try to solve the problem. This, of course, was the problem. The reason why is that by introducing this logging, the threads are synchronizing on the same object - and as I mentioned in another blog post, this ensures "all threads see the most up-to-date values of shared mutable variables [when] the reading and writing threads ... synchronize on a common lock" (Java Concurrency in Practice, p37).

So, what was introducing this mysterious common lock? Why, System.out.println itself! The println method is synchronized so both threads synchronize on the common lock of the System.out object.

This reminded me of a time when a colleague was wondering why adding System.out.println appeared to be altering the behaviour of a piece of code. Stumped, I just thought it might have introduced a timing issue to his multi-threaded code. Now I know better.

But it begs the question: how much code is out there that is not properly multi-threaded with only some ad hoc logging holding it together?