## Sunday, January 27, 2013

### Tail Recursion Optimization in Hotspot

Recursion can be more expensive than iterating over the same code.

Take this iterative method:

int iterativeCall(int count, int maxCalls) {
for (int i = 0 ; i < maxCalls ; i++) {
count ^= count +1;
}
return count;
}

And this recursive method:

int recursiveCall(int count, int countDown) {
if (countDown <= 0)
return count;
count ^= count +1;
return recursiveCall(count, countDown - 1);
}

And compare the two. Using Caliper, the difference looks something like this:

To make recursive loops perform more like iterative loops, various tricks can be employed to make the compiled code loop over a block rather than jump recursively.

I noticed Hotspot do something similar to this. It's not fully optimized because it appears to reduce rather than avoid recursions.

This is how I did it. If we now add a static main method to our class containing the code above:

public static void main(String[] args) {
RecursionVsIterativeBenchmark app = new RecursionVsIterativeBenchmark();
for (int i = 0 ; i < 1000 ; i++) {
app.recursiveCall(0, 1000);
}
}

and run the code with:

to make hotspot spit out the assembly code, we see:

{method}
- klass: {other class}
- this oop:          0x000000040d0e7970
- method holder:     'com/henryp/test/recursion/RecursionVsIterativeBenchmark'
- constants:         0x000000040d0e7078 constant pool [42] for 'com/henryp/test/recursion/RecursionVsIterativeBenchmark' cache=0x000000040d1372c0
- access:            0x81000000
- name:              'recursiveCall'
- signature:         '(II)I'
- max stack:         4
- max locals:        3
- size of params:    3
- method size:       17
- vtable index:      21
- i2i entry:         0x00007f8d0501d960
- compiled entry     0x00007f8d050dc877
- code size:         21
- code start:        0x000000040d0e7928
- code end (excl):   0x000000040d0e793d
- method data:       0x000000040d193b08
- checked ex length: 0
- linenumber start:  0x000000040d0e793d
- localvar length:   3
- localvar start:    0x000000040d0e794a
#
#  int ( com/henryp/test/recursion/RecursionVsIterativeBenchmark:NotNull *, int, int )
#
#r018 rsi:rsi   : parm 0: com/henryp/test/recursion/RecursionVsIterativeBenchmark:NotNull *
#r016 rdx   : parm 1: int
#r010 rcx   : parm 2: int
# -- Old rsp -- Framesize: 32 --
#r083 rsp+ 4: Fixed slot 1
#r082 rsp+ 0: Fixed slot 0
#
000   N94: # B1 <- 1="" block="" font="" freq:="" head="" is="" junk="" nbsp="">
000   movl    rscratch1, [j_rarg0 + oopDesc::klass_offset_in_bytes()] # compressed klass
decode_heap_oop_not_null rscratch1, rscratch1
cmpq    rax, rscratch1 # Inline cache check
jne     SharedRuntime::_ic_miss_stub
nop # nops to align entry point

nop # 12 bytes pad for loops and calls

020   B1: # B6 B2 <- 1="" block="" font="" freq:="" head="" is="" junk="" nbsp="">
020   # stack bang
pushq   rbp
subq    rsp, #16 # Create frame
02c   testl   RCX, RCX
02e   jle,s   B6  P=0.001002 C=198529.000000
02e
030   B2: # B7 B3 <- 0.998998="" b1="" font="" nbsp="" req:="">
030   movl    R11, RCX # spill
033   decl    R11 # int
036   movl    R8, RDX # spill
039   incl    R8 # int
03c   xorl    R8, RDX # int
03f   testl   R11, R11
042   jle,s   B7  P=0.001002 C=198529.000000
042
044   B3: # B5 B4 <- 0.997996="" b2="" font="" nbsp="" req:="">
044   movl    R11, RCX # spill
047   addl    R11, #-2 # int
04b   movl    RAX, R8 # spill
04e   incl    RAX # int
050   xorl    RAX, R8 # int
053   testl   R11, R11
056   jle,s   B5  P=0.001002 C=198529.000000
056
058   B4: # B8 B5 <- 0.996996="" b3="" font="" nbsp="" req:="">
058   addl    RCX, #-3 # int
05b   movl    RDX, RAX # spill
05d   incl    RDX # int
05f   xorl    RDX, RAX # int
061   nop # 2 bytes pad for loops and calls
063   call,static  com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17  L[0]=_ L[1]=_ L[2]=_
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17  L[0]=_ L[1]=_ L[2]=_
# com.henryp.test.recursion.RecursionVsIterativeBenchmark::recursiveCall @ bci:17  L[0]=_ L[1]=_ L[2]=_
# OopMap{off=104}
068
068   B5: # N94 <- 0.99998="" b3="" b4="" b6="" b7="" font="" nbsp="" req:="">
068   addq rsp, 16 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

073   ret
073
074   B6: # B5 <- 0.00100237="" b1="" font="" nbsp="" req:="">
074   movl    RAX, RDX # spill
076   jmp,s   B5
076
078   B7: # B5 <- 0.00100137="" b2="" font="" nbsp="" req:="">
078   movl    RAX, R8 # spill
07b   jmp,s   B5
07b
07d   B8: # N94 <- 9.96996e-06="" b4="" font="" nbsp="" req:="">
07d   # exception oop is in rax; no code emitted
07d   movq    RSI, RAX # spill
080   addq rsp, 16 # Destroy frame
popq rbp

085   jmp     rethrow_stub
085

What does all this mean?

Well, I am not an assembler guru but it seems to me that this code is being partially rolled into a more iterative style

Let's go through it. Ints count and countDown are initially stored in registers rdx and rcx respectively. We know this from:

#r016 rdx   : parm 1: int
#r010 rcx   : parm 2: int

B1

(Note the stack bang - that is, preparing the stack - at B1 every time we enter this method. That can cause overhead.)

Registers "%rsp and %rbp typically hold the stack pointer and base pointer, as their names imply." [1]. So these lines push the old base pointer on the stack and then the stack pointer points 16 memory addresses lower (stacks progress "towards lower memory addresses" [2]), reserving that memory for us.

020    # stack bang

pushq   rbp
subq    rsp, #16 # Create frame

Next we test to see if our second argument (countDown or RCX) is 0

02c    testl   RCX, RCX

and if so, we jump out of the method by jumping to B6 (which then jumps to B5 that removes the stack from and leaves this code).

02e    jle,s   B6  P=0.001002 C=198529.000000

B2

Assuming countDown wasn't 0, we continue. This is the more common path as Freq is almost 1.0 (that is, a certainty).

030   B2: # B7 B3 <- b1="" b="" nbsp="">Freq: 0.998998
030    movl    R11, RCX # spill
033    decl    R11 # int

Here we put countDown into R11 (remember, RCX was our countDown value) and decrement this temporary store.

Next, we put our first argument, count, into a temporary store, R8. We increment this and XOR it with its old value.

036    movl    R8, RDX # spill
039    incl    R8 # int
03c    xorl    R8, RDX # int

And then we go back to countDown into R11 and check it's not 0, leaving the method if it is.

03f    testl   R11, R11
042    jle,s   B7  P=0.001002 C=198529.000000

If it is, we've just benefited from a recursion optimization.

B3

But it generally continues. Inside the same block of code, we're going to decrease countDown by 2 and do the same thing again:

044   B3: # B5 B4 <- 0.997996="" b2="" font="" nbsp="" req:="">
044    movl    R11, RCX # spill
047    addl    R11, #-2 # int

Exactly the same adding of 1 and XORing of count is done again but this time time with temporary storage in register RAX.

04b    movl    RAX, R8 # spill
04e    incl    RAX # int
050    xorl    RAX, R8 # int

And we again check countDown has not reached 0, ultimately dropping out of the routine if it has.

053    testl   R11, R11
056    jle,s   B5  P=0.001002 C=198529.000000

B4

I won't go into the details, but the same thing happens again. The original countDown is reduced by 3 this time and count is incremented and XORed with itself.

But this time, we do recurse with the values.

Finally

The rest of the paths end up at B5 through one path or another. (Note we poll for a garbage collection safepoint just before we leave the method. I spoke more about this here.)

Conclusion

Hotspot is making an attempt to partially tail optimize the code by making recursive code more iterative. Indeed, if the recursion was originally 2 deep, there would have been no recursion at all in the optimized code.

For all other recursions greater than 2, there is still a saving as there would be a recursion only every third calculation of count.

[1] Introduction to x64 Assembly, New York University

[2] Hacking Jon Erickson, p19.