Saturday, October 26, 2024

NVIDIA Rapids

You can make Spark go faster by offloading some of the work to the GPU. There is an NVIDIA library (spark-rapids) to do this. 

A quick introduction

There are quite a few references to UCX in Spark Rapids. Assuming you have the hardware, this allows remote direct memory access (RDMA), basically sharing data in memory that circumvents the kernel.

"UVM or universal memory can allow main host memory to act essentially as swap for device(GPU) memory. This allows the GPU to process more data than fits in memory, but  can result in slower processing. This is an experimental feature." (from RapidsConf)

The spillStorageSize is the "Amount of off-heap host memory to use for buffering spilled GPU data before spilling  to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools."

Old Ubuntu

Unfortunately, when running a test on Ubuntu 18 (I know, I know) I saw:

Caused by: java.lang.UnsatisfiedLinkError: /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so)
...
        at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:246)

After installing the latest Ubuntu on VMWare, it appears that you cannot access the GPU using VMWare Workstation.

You can however use a Docker image - just check that you have the Docker daemon that can handle NVIDIA installed by running:

henryp@adele: docker info | grep Runtimes 
Runtimes: io.containerd.runc.v2 nvidia runc

or 

henryp@adele:~$ grep -A2 -i nvidia /etc/docker/daemon.json
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }

and download an image from here.

Now, I just need to run something like:

docker run --rm --gpus all  --runtime=nvidia  -it   -v /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/:/home/henryp/Code/Scala/SparkEcosystem/spark-rapids/ -v /home/henryp/Tools:/home/henryp/Tools -v /home/henryp/.m2:/.m2  -v /usr/local/bin/Java:/usr/local/bin/Java  --user root  nvidia/cuda:12.6.1-devel-ubi8 /bin/bash

and once I'm in, run:

yum -y install git
yum -y install diffutils
yum -y install rsync
export MAVEN_HOME=/home/henryp/Tools/Build/Maven/Latest
export PATH=$MAVEN_HOME/bin:$PATH
export JAVA_HOME=/usr/local/bin/Java/Latest17
export PATH=$JAVA_HOME/bin:$PATH

Constantly installing some tools for Rapids to build proved a bit tedious, so I extended the NVIDIA docker image with this Dockerfile:

FROM nvidia/cuda:12.6.1-devel-ubi8

RUN yum -y install git
RUN yum -y install diffutils
RUN yum -y install rsync

Also, if I wanted to use mvnDebug [SO], I had to set the network of the container to host using --network host [SO]. Then, it's just a matter of running:

mvnDebug scalatest:test -DforkMode=never

and attaching a debugger from the host machine.

Unfortunately, sometimes when I close and re-open my laptop, the tests start failing with:

...
WindowedBlockIteratorSuite:
terminate called after throwing an instance of 'cudf::jni::jni_exception'
  what():  CUDA ERROR: code 999
/home/henryp/Tools/Build/Maven/Latest/bin/mvnDebug: line 36:   216 Aborted                 (core dumped) env MAVEN_OPTS="$MAVEN_OPTS" MAVEN_DEBUG_OPTS="$MAVEN_DEBUG_OPTS" "`dirname "$0"`/mvn" "$@"

Apparently, I must reboot my machine :(

Some miscellaneous code notes

Finally, I get to look at the code in action. Rapids adds RapidsExecutorPlugin (which extends the Spark ExecutorPlugin interface) that immediately initializes the GPU and memory (see GpuDeviceManager.initializeGpuAndMemory). Note that in setGpuDeviceAndAcquire we see a comment that says: 
"cudaFree(0) to actually allocate the set device - no process exclusive required since we are relying on Spark to schedule it properly and not give it to multiple executors"
This is why the tests (HashAggregatesSuite) a hard-coded to use just one CPU core.

Rapids then has a parallel set of classes that look a lot like the Spark classes that represent linear algebra structures. For example, there is a ColumnVector abstraction in both Spark and Rapids. The interesting Rapids one is GpuColumVector  - which implements this Spark interface - that can be instantiated by a GpuShuffleExchangeExecBase. Amongst other things, objects of these classes contain the address of their off-heap data and a reference counter.

Still playing.


Saturday, October 5, 2024

Optimising GPU code

I complained to Juan Fumero that a benchmark indicated that the GPU was not giving much of a performance improvement. JMH reported the GPU being a moderate 20% faster than the CPU:

tornado -jar tornado-benchmarks/target/jmhbenchmarks.jar uk.ac.manchester.tornado.benchmarks.sgemv.JMHSgemV
...
Benchmark              Mode  Cnt         Score         Error  Units
JMHSgemV.sgemVJava     avgt    5  72366270.751 ± 5916807.539  ns/op
JMHSgemV.sgemVTornado  avgt    5  57583087.103 ± 2523449.341  ns/op

(SGEMM is single precision general matrix multiplication. GEMV indicates that we're multiplying a matrix with a vector).

Juan replied that I should try TornadoVM's  --enableProfiler console switch and see where the time was being spent. Sure enough, COPY_IN_TIME was ~28ms, about the same as TOTAL_KERNEL_TIME.

Note that the total kernel time is the time it takes the GPU to perform the computation and the total kernel dispatch time is the time it takes to schedule the kernel (ie, the function being executed). In this case, dispatch time is ~6us - three orders of magnitude smaller than the execution time.

Juan also said that "Matrix Vector is not as compute intensive as other applications", so instead I tried the matrix/matrix multiplication. Here, the GPU shines:

Benchmark              Mode  Cnt           Score         Error  Units
JMHSgemm.sgemmJava     avgt    5  1773297262.188 ± 4115731.439  ns/op
JMHSgemm.sgemmTornado  avgt    5     8478409.506 ±  246919.368  ns/op

That makes the GPU 200 times faster than the CPU. Now COPY_IN_TIME is about 1ms and TOTAL_KERNEL_TIME is about 5.5ms.

Now we're talking. But continuing this optimization rampage, it's worth noting that "It has become tribal knowledge that the particular shapes chosen for matmuls has a surprisingly large effect on their performance." [Horace He] TL;DR; He's article explains how fitting the small memory tiles onto a large matrix can hugely change performance - basically, that in a row-major MxN matrix, N must be a factor of the GPU's cache line for best results.

Changes in Java's memory

In the old days, we'd use sun.misc.Unsafe.allocateMemory to use off-heap memory. This code goes straight to the OS and asks for memory via os::realloc. But using Unsafe is bad practise. Not only is it specific to a particular flavout of JVM, it allows access to raw memory. The latter is "fine" if that memory is off-heap but if you are using it to access a Java object, the garbage collector can change its memory location without warning.

There are several modern alternatives. Since Java 9, java.lang.invoke.VarHandle has been the recommended alternative. It provides the same level of low-level access as Unsafe but with better safety and control over memory visibility. That is, its memory access patterns apparently offer finer grained control - eg, volatile access without enforcing strict instruction ordering. 

It's interesting to note that the high performing interoperability framework, Apache Arrow, does not use VarHandle. It still uses Unsafe as VarHandle has bound checking etc that is slower than raw access. 

Since Java 20, we've had Project Panama's Foreign Function & Memory API (JEP-424) spec (it appears Apache Arrow doesn't use it because it's too new). If we run this code:

MemorySegment memorySegment = Arena.global().allocate(1024 * 1024 * 128, 8);         System.out.println(memorySegment.address());

then look for the address while it's still running in /proc/PID/maps (where PID is the ID of the Java process), we can see that the Linux OS now manages a new area of memory. For instance, when I ran it, the output was 0x7fbaccdbe010 and I can see in the maps pseudo file:

7fbaccdbe000-7fbad4dbf000 rw-p 00000000 00:00 0 

This represents the 128 megs of space plus 4096 bytes (presumably a page for meta data).

Note that an Arena in this context is a large chunk of memory that is managed in user space rather than the app code constantly calling the kernel requesting memory piecemeal. This is an optimization.

Now, since Java and C/C++ are IEEE 754 compliant, and now they can pass native memory to each other, you can transparently pass floating point numbers between code bases and run the C/C++ program in the JVM - no more need for JNI! (Interestingly, note that Python is often IEEE754 compliant but it is not guaranteed to be).

It's interesting to note that the GPU enabled Tornado VM uses the java.lang.foreign package to move data to and from the GPU.