Wednesday, December 31, 2025

Debugging JNI calls to the GPU

I'm playing aroung with a Java based LLM (code here). When running a JVM that calls the GPU using the TornadoVM, it crashed and in the log, I saw:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1302b0]
C  [libcuda.so.1+0x332420]
C  [libtornado-ptx.so+0x64b8]  Java_uk_ac_manchester_tornado_drivers_ptx_PTXStream_cuLaunchKernel+0x198
j  uk.ac.manchester.tornado.drivers.ptx.PTXStream.cuLaunchKernel([BLjava/lang/String;IIIIIIJ[B[B)[[B+0 tornado.drivers.ptx@2.2.1-dev
...

Now, finding the Shared Object files (*.so), I called: 

objdump -d /usr/lib/x86_64-linux-gnu/libcuda.so.1 
objdump -d /usr/local/bin/Java/tornadovm-2.2.1-dev-ptx/lib/libtornado-ptx.so

and looked at the addresses in the stack dump.

First, libtornado-ptx.so. Note that the address (0x64b8) is the return address from a call, that is the next line after the call that went Pete Tong. 

    64b3:       e8 b8 e1 ff ff          call   4670 <cuLaunchKernel@plt>
    64b8:       48 83 c4 30             add    $0x30,%rsp

So, it's the call to cuLaunchKernel that is interesting.

  33241b:       e8 00 de df ff          call   130220 <exit@plt+0x4e460>
  332420:       5a                      pop    %rdx

and the final (top most) stack frame:

  1302ab:       4d 85 e4                test   %r12,%r12
  1302ae:       74 58                   je     130308 <exit@plt+0x4e548>
  1302b0:       41 8b 04 24             mov    (%r12),%eax

The instruction test %x,%y is a common idiom in null checks (basically, it's x and y are ANDed and the je jumps if the Zero Flag is set - note that this flag is set if the result of the AND is non-zero or both x and y are zero).

So, it looks like we've essentially got what's equivalent to a NullPointerException in the machine code. Still looking at what's null... [Solved: had to use a model that is compatible with GPULlama3.java)

No comments:

Post a Comment