Sunday, September 22, 2024

GPU Notes

Minecraft is written in Java and uses JOCL to leverage the GPU[Oracle blog, 2020]
The organization of threads in CUDA terms:
  1. Thread: single unit of execution --- each thread has its own memory called registers
  2. Block: group of threads --- all threads in a block has access to a shared memory [Cuda Terminology].
  3. Grid: group of blocks --- all threads in a grid has access to [mutable] global memory and [immutable, global] constant memory.
 [Penny Xu's blog]
Matrix multiplication is best done using tiles. The reason for this is that when you ask for a value, you actually get an array of them. You could throw them away. Or, you could employ a clever optimization (descibed by Horace He here) where we process the surplus and save for later.

TornadoVM

Oracle's Gary Frost calls [YouTube] TornadoVM as "state of the art" when bringing GPU processing to the JVM.

"There are two ways to express parallelism with TornadoVM" [Juan Fumero] Annotations and "via an explicit parallel kernel API" using KernelContext

Interestingly, by using the Tornado --printKernel --printBytecodes --fullDebug flags, you can get Tornado to print out all sorts of goodies like the raw backend code it has created.

OpenCL Terminology

We can use a poker game to convey the idea of what is going on in OpenCL. Imagine a dealer and players who don’t interact with each other but who make requests to the dealer for additional cards.  

"In this analogy, the dealer represents an Open CL host, each player represents a device, the card table represents a context, and each card represents a kernel. Each player’s hand represents a command queue." [1]

"[Nested] Loops like this are common but inefficient. The inefficiency arises because each iteration requires a separate comparison and addition. Comparisons are time-consuming on the best of processors, but they’re especially slow on dedicated number-crunchers like graphic processor units ( GPU s). GPU s excel at performing the same operations over and over again, but they’re not good at making decisions. If a GPU has to check a condition and branch, it may take hundreds of cycles before it can get back to crunching numbers at full speed.

"One fascinating aspect of Open CL is that you don’t have to configure these loops in your kernel. Instead, your kernel only executes code that would lie inside the innermost loop. We call this individual kernel execution a work-item." [1]

"A work-group is a combination of work-items that access the same processing resources... Work-items in a work-group can access the same block of high-speed memory called local memory. Work-items in a work-group can be synchronized using fences and barriers." [1]

"One of the primary advantages of using Open CL is that you can execute applications using thousands and thousands of threads, called work-items." [1]

"Each block of local memory is specific to the work-items in a work-group." [1]

[1] OpenCL in Action

No comments:

Post a Comment