Thursday, October 27, 2011

Java Disassembled

In my last post, I talked about how false sharing due to objects being on the same cache line can cause code to run much slower. Now, I'm going to show why

matrix[fixed_index][i] = value + 1;

is slower than

matrix[i][fixed_index] = value + 1;

This slowness is a property of your CPU rather than something innate to Java but we'll use the JDK with the -XX:+PrintOptoAssembly flag and examine the generated assembler code to show the problem.

(Please note that I am not a professional x86 assembler programmer so any input would be gratefully received.)

First, some preliminaries. Most of the following was gleaned from using JDK 1.7.0 on Linux:

1. Java multi-dimensional arrays are actually arrays of arrays.

2. The second location in the memory representing an array (array reference + 1) yields information as to its type. We can show this by looking at the assembly generated by this Java:
       private void myInstanceOf(Object[] array) {
if (array instanceof Integer[]) {
[Aside: the field aPublicInt is exactly that: an int that is publicly accessible. The reason we may need it is that Hotspot may optimize away our code if it considers that nothing noticeable changes.]

Anyway, the assembler it generates looks a little like this:
#r000 ecx   : parm 0: com/henryp/lang/SimpleClassForDisassembly:NotNull *
#r005 edx : parm 1: java/lang/Object *[int:>=0] *
# -- Old esp -- Framesize: 16 --
#r045 esp+12: return address
#r044 esp+ 8: pad2, in_preserve
#r043 esp+ 4: pad2, in_preserve
#r042 esp+ 0: Fixed slot 0
000 N45: # B1 <- BLOCK HEAD IS JUNK Freq: 1
000 CMP EAX,[ECX+4] # Inline cache check
JNE SharedRuntime::handle_ic_miss_stub

00c B1: # B5 B2 <- BLOCK HEAD IS JUNK Freq: 1
00c # stack bang
SUB ESP,8 # Create frame
01a MOV EBX,[EDX + #4]
01d NullCheck EDX
01d B2: # B4 B3 <- B1 Freq: 0.999999
01d CMPu EBX,precise klass [Ljava/lang/Integer;: 0xb76eeb68:Constant:exact *
023 Jne,us B4 P=0.100000 C=-1.000000
025 B3: # B4 <- B2 Freq: 0.899999
025 INC [ECX + #16] ! Field com/henryp/lang/SimpleClassForDisassembly.aPublicInt
028 B4: # N45 <- B3 B2 Freq: 0.999999
028 ADD ESP,8 # Destroy frame
TEST PollPage,EAX ! Poll Safepoint

032 RET
033 B5: # N45 <- B1 Freq: 1.01328e-06
035 MOV ECX,#-12
03a NOP # 1 bytes pad for loops and calls
03b CALL,static wrapper for: uncommon_trap(reason='null_check' action='make_not_entrant')
# com.henryp.lang.SimpleClassForDisassembly::myInstanceOf @ bci:1 L[0]=EBP L[1]=_ STK[0]=#NULL
# OopMap{ebp=Oop off=64}
040 INT3 ; ShouldNotReachHere

Verbose, isn't it?

(Note that this assembler was generated when the Java method was alternately passed an Integer[] and a String[] for many iterations. The JDK will generate other assembler depending on the which paths through the method are most commonly used.)

So what does this assembly code mean? The first line says the ecx register holds a reference to this. But the second line tells us the register edx points to an the argument that is passed to this method - the Object[].

So much for the references. The interesting code starts at the label B1. Here we create the stack for this method. EBP and ESP are special "32-bit registers that are used as pointers [,] the extended base pointer (EBP) and the extended stack pointer (ESP)...

"The EBP register (sometimes called the frame pointer (FP) or local base pointer (LB)) is used to reference variables in the current stack frame. Each stack frame contains the parameters to the function, its local variables, and two pointers that are necessary to put things back the way they were: the saved frame pointer (SFP) and the return address. The stack frame pointer is used to restore EBP to its previous value, and the return address is used to restore EIP [the address of the code currently executing] to the next instruction found after the function call." [1]

Breaking this code down, we see:
(Push the old frame pointer value onto the stack).
    SUB    ESP,8    # Create frame
(Allocate 8 bytes on the stack. The stack starts high in the memory space and moves towards zero as more stack space is allocated so we SUBtract 8 bytes from the ESP).
01a       MOV    EBX,[EDX + #4]
01d NullCheck EDX
(Move the contents of the second memory space of the array - remember, EDX points to our array - into the EBX register. The Nullcheck is not an assembly instruction. It appears to be pseudo code as it has the same address - 01d - as the next line.)
01d       CMPu   EBX,precise klass [Ljava/lang/Integer;: 0xb76eeb68:Constant:exact *
023 Jne,us B4 P=0.100000 C=-1.000000
(Is EBX the same as the constant that represent an Integer array? If the unsigned (,us) comparison concludes they are not equal, go to B4. Otherwise...)
025       INC    [ECX + #16] ! Field com/henryp/lang/SimpleClassForDisassembly.aPublicInt
(add one to this.aPublicInt which occupies the memory address 16 bytes after the memory address of this. Remember, the first line tells us that ECX points to this).

Either way, we come to B4 where we undo all the work of allocating stack space and RETurn.

3. The third location in the memory holding an array (array reference + 2) yields information as to its length. We can show this by looking at the assembly generated by this Java:
       private static int myArraySize(Object[] array) {
return array.length;
The assembler generated here looks like this (ignoring all the stack frame manipulation etc that we saw in the previous example):
#r000 ecx   : parm 0: int[int:>=0]:exact *
00e MOV EAX,[ECX + #8]
011 NullCheck ECX
Here, ecx is the register that represents the argument (there is no reference to this in the code as you'll notice the Java method is static). We put the contents of the address space 2 memory slots (2 * 4 bytes = 8) after the array reference itself into EAX. This is the register that by convention holds the return value of any call.

Returning to our matrix navigation code, let's start with this method:
       private int[][] myMatrixNavigation() {
int size = 19;
int matrix[][] = new int[size][27];
for (int i = 0 ; i < size ; i++) {
matrix[i][7] = 11;
return matrix;

And run it enough times for the JDK to generate assembly language. It generates assembler that looks like:
       SUB    ESP,24   # Create frame
MOV ECX,precise klass [[I: 0xa062f798:Constant:exact *
MOV [ESP + #0],EDI
NOP # 1 bytes pad for loops and calls
CALL,static wrapper for: _multianewarray2_Java
# com.henryp.lang.SimpleClassForDisassembly::myMatrixNavigation @ bci:6 L[0]=_ L[1]=#19 L[2]=_ L[3]=_
# OopMap{off=52}
#checkcastPP of EAX

B2: # B27 B3 <- B1 B5 Loop: B2-B5 inner stride: not constant pre of N159
CMPu EBP,#19
Jnb,u B27

B3: # B28 B4 <- B2
MOV EDI,[EAX + #12 + EBP << #2]
MOV ECX,[EDI + #8]
NullCheck EDI

B4: # B26 B5 <- B3
Jbe,u B26

B5: # B2 B6 <- B4
MOV [EDI + #40],#11
Jl,s B2 # Loop end

B6: # B16 B7 <- B5
Jge,s B16
NOP # 6 bytes pad for loops and calls

B7: # B28 B8 <- B6 B15 Loop: B7-B15 inner stride: not constant main of N96
MOV ECX,[EAX + #12 + EBP << #2]
MOV EDI,[ECX + #8]
NullCheck ECX

B8: # B23 B9 <- B7
Jbe,u B23

B9: # B28 B10 <- B8
MOV EDI,[EAX + #16 + EBP << #2]
MOV [ECX + #40],#11
Hugely abbreviated, it does something like this:
        XOR    EBP,EBP
(Set EBP to 0. Exclusive-or of anything on itself will be 0. This is an optimisation. EBP will be our loop counter - equivalent to i in the Java code above).
        CALL,static  wrapper for: _multianewarray2_Java
(Call JVM code to create our array. Remember that the register EAX is used to return a value from a call so it's this that points to our array).
        MOV    EDI,[EAX + #12 + EBP << #2]
(Move the ith element of the outer array into EDI. Remember that the second memory location of an array (+4) is its type, the third (+8) its length so +12 is the first element).
        MOV    [EDI + #40],#11
(This element is itself an array. So, bearing in mind that the array reference + 4 is the type, the array reference + 8 is the length, the array reference + 12 the first element and so on, the array reference + 40 must be the 7th ((7 * 4) + 12) element. We set the contents of this memory address to 11 as our Java code says we should).

Note that this memory address is likely to be far away from the reference of the first array (stored in the EAX register). Compare that to the assembly code generated by this Java method:
       private int[][] myMatrixNavigation2ndIndice() {
int size = 19;
int matrix[][] = new int[27][size];
for (int i = 0 ; i < size ; i++) {
matrix[7][i] = 11;
return matrix;
This is very similar to before but this time we're fixing the 2nd index of the matrix, not the first. I'll spare you all the gory assembler, but the relevant code boils down to this:

        MOV    EDI,[EAX + #40]
(Once more, EAX stores the reference to our array and the +40 is used to access the 7th element).
        MOV    [EDI + #12 + EBP << #2],#11
(The 7th element is itself an array. So, we traverse this array once more with EBP as our index, setting the memory contents to 11, just as our Java code says we should).

Now the thing to note here is that if this code immediately above is multi-threaded, all threads will traverse a contiguous chunk of memory, that is [EDI + 12 + EBP] since EBP is just our integer index.

In the first piece of Java code where the first index is fixed, a thread will be writing to some element in the array that is found at [EDI + 12 + EBP] not the location [EDI + 12 + EBP] itself. It is much less likely that this memory is going to be close to the memory location for any other value of EBP (in fact for a sufficiently large array I think it might be impossible). Since cache lines are typically only 64 bytes, contention for the same cache line is much less likely and so the first piece of Java code is faster.

[1] The Art of Exploitation - Jon Erickson.

No comments:

Post a Comment