This is a nice Java implementation of various LLMs that can also defer to TornadoVM to process the maths on a GPU. Looking at the code gives a good idea of the impedance mismatch between CPU and GPU programming as the Java code covers both. Here are some miscellaneous notes.
GPU Terminology
In TornadoVM's KernelContext you'll see various references to Global, Group and Local IDs.
The global ID is the overall ID of the thread. Note that it can be virtualized - that is, it can be larger than the number of threads your hardware supports.
The group ID refers to a logical grouping of threads. Note that a warp is a physical (ie, hardware dependent) grouping of threads. Work groups are made of integer multiples of warps. Warps always have 32 threads in NVidia hardware and process in lockstep. Work groups however can execute their warps asynchronously.
If a warp hits an if/else statement, the branches are executed sequentially and you lose parallelism!
Threads in a work group can share memory. The local ID is the ID of a thread within that group.
GPU algorithms
Writing algorithms is different in the world of GPUs. This Java/TornadoVM code that is turned into GPU code:
context.localBarrier();
for (int stride = localWorkGroupSize / 2; stride > 0; stride >>= 1) {
if (localId < stride) {
localSum[localId] += localSum[localId + stride];
}
context.localBarrier();
}
is actually reducing an array to an int using the GPUs threads. It first with half the warf of 32 threads. Starting with 16 threads, then 8, then 4 etc each thread takes 2 elements in the array and adds them. Only half the number of threads are needed on the next iteration and so on. All the other threads are "masked", that is, not used.
Mapping TornadoVM to GPU concepts
There is a line in fusedQKVMatmulX that basically says:
if (context.globalId < input_array_size) ...
Yeah, but what if the maximum globalId (the actual thread ID) is much lower than the array size? Do we ignore the rest of the array?
The answer is no because globalId is a virtual ID and does not represent the physical limits of your hardware. As it happens, my RTX 4070 (Laptop) has 4608 CUDA cores whereas the model I am running (Llama-3.2-1B-Instruct-F16) has a hidden size of 4096 so it seems like it all fits into memory without resorting to virtualization tricks.
The functions above don't generally have loops in them. The reason is that the loop is implicit. Each GPU thread calls the function.
Graal- and TornadoVM
Note that TornadoVM heavily relies on GraalVM. If you look at the stack, you'll see the code in PTXCompiler.emitFrontEnd appears to exploit GraalVM's ability to examine the bytecode of the functions mentioned above. It does this so it can convert them into CUDA code.
Consequently, you'll never see any breakpoints hit in these TransformerComputeKernelsLayered functions.
No comments:
Post a Comment