Agile Java Man: Tail Recursion Optimization in Hotspot

Recursion can be more expensive than iterating over the same code.

Take this iterative method:

int iterativeCall(int count, int maxCalls) {
for (int i = 0 ; i < maxCalls ; i++) {
count ^= count +1;
}
return count;
}

And this recursive method:

int recursiveCall(int count, int countDown) {
if (countDown <= 0)
return count;
count ^= count +1;
return recursiveCall(count, countDown - 1);
}

And compare the two. Using Caliper, the difference looks something like this:

To make recursive loops perform more like iterative loops, various tricks can be employed to make the compiled code loop over a block rather than jump recursively.

I noticed Hotspot do something similar to this. It's not fully optimized because it appears to reduce rather than avoid recursions.

This is how I did it. If we now add a static main method to our class containing the code above:

public static void main(String[] args) {
RecursionVsIterativeBenchmark app = new RecursionVsIterativeBenchmark();
for (int i = 0 ; i < 1000 ; i++) {
app.recursiveCall(0, 1000);
}
}

and run the code with:

/usr/java/jdk1.7.0.fastdebug/bin/java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,com.henryp.test.recursion.RecursionVsIterativeBenchmark -server -XX:+PrintAssembly -XX:CompileThreshold=90000 -cp /home/henryp/workspace/myPerformance/target/classes/:/home/henryp/.m2/repository/com/google/caliper/caliper/0.5-rc1/caliper-0.5-rc1.jar:/home/henryp/.m2/repository/com/google/code/findbugs/jsr305/1.3.9/jsr305-1.3.9.jar:/home/henryp/.m2/repository/com/google/code/gson/gson/1.7.1/gson-1.7.1.jar:/home/henryp/.m2/repository/com/google/guava/guava/11.0.1/guava-11.0.1.jar:/home/henryp/.m2/repository/com/google/code/java-allocation-instrumenter/java-allocation-instrumenter/2.0/java-allocation-instrumenter-2.0.jar:/home/henryp/.m2/repository/asm/asm/3.3.1/asm-3.3.1.jar:/home/henryp/.m2/repository/asm/asm-analysis/3.3.1/asm-analysis-3.3.1.jar:/home/henryp/.m2/repository/asm/asm-commons/3.3.1/asm-commons-3.3.1.jar:/home/henryp/.m2/repository/asm/asm-tree/3.3.1/asm-tree-3.3.1.jar:/home/henryp/.m2/repository/asm/asm-util/3.3.1/asm-util-3.3.1.jar:/home/henryp/.m2/repository/asm/asm-xml/3.3.1/asm-xml-3.3.1.jar com.henryp.test.recursion.RecursionVsIterativeBenchmark

to make hotspot spit out the assembly code, we see:

{method}
- klass: {other class}
- this oop: 0x000000040d0e7970
- method holder: 'com/henryp/test/recursion/RecursionVsIterativeBenchmark'
- constants: 0x000000040d0e7078 constant pool [42] for 'com/henryp/test/recursion/RecursionVsIterativeBenchmark' cache=0x000000040d1372c0
- access: 0x81000000
- name: 'recursiveCall'
- signature: '(II)I'
- max stack: 4
- max locals: 3
- size of params: 3
- method size: 17
- vtable index: 21
- i2i entry: 0x00007f8d0501d960
- adapter: 0x00007f8d080b8f20
- compiled entry 0x00007f8d050dc877
- code size: 21
- code start: 0x000000040d0e7928
- code end (excl): 0x000000040d0e793d
- method data: 0x000000040d193b08
- checked ex length: 0
- linenumber start: 0x000000040d0e793d
- localvar length: 3
- localvar start: 0x000000040d0e794a
#
# int ( com/henryp/test/recursion/RecursionVsIterativeBenchmark:NotNull *, int, int )
#
#r018 rsi:rsi : parm 0: com/henryp/test/recursion/RecursionVsIterativeBenchmark:NotNull *
#r016 rdx : parm 1: int
#r010 rcx : parm 2: int
# -- Old rsp -- Framesize: 32 --
#r089 rsp+28: pad2, in_preserve
#r088 rsp+24: pad2, in_preserve
#r087 rsp+20: pad2, in_preserve
#r086 rsp+16: pad2, in_preserve
#r085 rsp+12: pad2, in_preserve
#r084 rsp+ 8: return address
#r083 rsp+ 4: Fixed slot 1
#r082 rsp+ 0: Fixed slot 0
#
000 N94: # B1 <- 1="" block="" font="" freq:="" head="" is="" junk="" nbsp="">
000 movl rscratch1, [j_rarg0 + oopDesc::klass_offset_in_bytes()] # compressed klass
decode_heap_oop_not_null rscratch1, rscratch1
cmpq rax, rscratch1 # Inline cache check
jne SharedRuntime::_ic_miss_stub
nop # nops to align entry point

nop # 12 bytes pad for loops and calls

020 B1: # B6 B2 <- 1="" block="" font="" freq:="" head="" is="" junk="" nbsp="">
020 # stack bang
pushq rbp
subq rsp, #16 # Create frame
02c testl RCX, RCX
02e jle,s B6 P=0.001002 C=198529.000000
02e
030 B2: # B7 B3 <- 0.998998="" b1="" font="" nbsp="" req:="">
030 movl R11, RCX # spill
033 decl R11 # int
036 movl R8, RDX # spill
039 incl R8 # int
03c xorl R8, RDX # int
03f testl R11, R11
042 jle,s B7 P=0.001002 C=198529.000000
042
044 B3: # B5 B4 <- 0.997996="" b2="" font="" nbsp="" req:="">
044 movl R11, RCX # spill
047 addl R11, #-2 # int
04b movl RAX, R8 # spill
04e incl RAX # int
050 xorl RAX, R8 # int
053 testl R11, R11
056 jle,s B5 P=0.001002 C=198529.000000
056
058 B4: # B8 B5 <- 0.996996="" b3="" font="" nbsp="" req:="">
058 addl RCX, #-3 # int
05b movl RDX, RAX # spill
05d incl RDX # int
05f xorl RDX, RAX # int
061 nop # 2 bytes pad for loops and calls
063 call,static com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17 L[0]=_ L[1]=_ L[2]=_
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17 L[0]=_ L[1]=_ L[2]=_
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17 L[0]=_ L[1]=_ L[2]=_
# OopMap{off=104}
068
068 B5: # N94 <- 0.99998="" b3="" b4="" b6="" b7="" font="" nbsp="" req:="">
068 addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

073 ret
073
074 B6: # B5 <- 0.00100237="" b1="" font="" nbsp="" req:="">
074 movl RAX, RDX # spill
076 jmp,s B5
076
078 B7: # B5 <- 0.00100137="" b2="" font="" nbsp="" req:="">
078 movl RAX, R8 # spill
07b jmp,s B5
07b
07d B8: # N94 <- 9.96996e-06="" b4="" font="" nbsp="" req:="">
07d # exception oop is in rax; no code emitted
07d movq RSI, RAX # spill
080 addq rsp, 16 # Destroy frame
popq rbp

085 jmp rethrow_stub
085

What does all this mean?

Well, I am not an assembler guru but it seems to me that this code is being partially rolled into a more iterative style

Let's go through it. Ints count and countDown are initially stored in registers rdx and rcx respectively. We know this from:

#r016 rdx : parm 1: int
#r010 rcx : parm 2: int

B1

(Note the stack bang - that is, preparing the stack - at B1 every time we enter this method. That can cause overhead.)

Registers "%rsp and %rbp typically hold the stack pointer and base pointer, as their names imply." [1]. So these lines push the old base pointer on the stack and then the stack pointer points 16 memory addresses lower (stacks progress "towards lower memory addresses" [2]), reserving that memory for us.

020    # stack bang

pushq rbp
subq rsp, #16 # Create frame

Next we test to see if our second argument (countDown or RCX) is 0

02c    testl RCX, RCX

and if so, we jump out of the method by jumping to B6 (which then jumps to B5 that removes the stack from and leaves this code).

02e    jle,s B6 P=0.001002 C=198529.000000

B2

Assuming countDown wasn't 0, we continue. This is the more common path as Freq is almost 1.0 (that is, a certainty).

030 B2: # B7 B3 <- b1="" b="" nbsp="">Freq: 0.998998

030    movl R11, RCX # spill
033    decl R11 # int

Here we put countDown into R11 (remember, RCX was our countDown value) and decrement this temporary store.

Next, we put our first argument, count, into a temporary store, R8. We increment this and XOR it with its old value.

036    movl R8, RDX # spill
039    incl R8 # int
03c    xorl R8, RDX # int

And then we go back to countDown into R11 and check it's not 0, leaving the method if it is.

03f    testl R11, R11
042    jle,s B7 P=0.001002 C=198529.000000

If it is, we've just benefited from a recursion optimization.

B3

But it generally continues. Inside the same block of code, we're going to decrease countDown by 2 and do the same thing again:

044 B3: # B5 B4 <- 0.997996="" b2="" font="" nbsp="" req:="">
044    movl R11, RCX # spill
047    addl R11, #-2 # int

Exactly the same adding of 1 and XORing of count is done again but this time time with temporary storage in register RAX.

04b    movl RAX, R8 # spill
04e    incl RAX # int
050    xorl RAX, R8 # int

And we again check countDown has not reached 0, ultimately dropping out of the routine if it has.

053    testl R11, R11
056    jle,s B5 P=0.001002 C=198529.000000

B4

I won't go into the details, but the same thing happens again. The original countDown is reduced by 3 this time and count is incremented and XORed with itself.

But this time, we do recurse with the values.

Finally

The rest of the paths end up at B5 through one path or another. (Note we poll for a garbage collection safepoint just before we leave the method. I spoke more about this here.)

Conclusion

Hotspot is making an attempt to partially tail optimize the code by making recursive code more iterative. Indeed, if the recursion was originally 2 deep, there would have been no recursion at all in the optimized code.

For all other recursions greater than 2, there is still a saving as there would be a recursion only every third calculation of count.

[1] Introduction to x64 Assembly, New York University

[2] Hacking Jon Erickson, p19.

Agile Java Man

Sunday, January 27, 2013

Tail Recursion Optimization in Hotspot

No comments:

Post a Comment

Blog Archive

About Me