You can make Spark go faster by offloading some of the work to the GPU. There is an NVIDIA library (spark-rapids) to do this.
A quick introduction
There are quite a few references to UCX in Spark Rapids. Assuming you have the hardware, this allows remote direct memory access (RDMA), basically sharing data in memory that circumvents the kernel.
"UVM or universal memory can allow main host memory to act essentially as swap for device(GPU) memory. This allows the GPU to process more data than fits in memory, but can result in slower processing. This is an experimental feature." (from RapidsConf)
The spillStorageSize is the "Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools."
Old Ubuntu
Caused by: java.lang.UnsatisfiedLinkError: /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so)
...
at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:246)
...
at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:246)
After installing the latest Ubuntu on VMWare, it appears that you cannot access the GPU using VMWare Workstation.
You can however use a Docker image - just check that you have the Docker daemon that can handle NVIDIA installed by running:
henryp@adele: docker info | grep Runtimes
Runtimes: io.containerd.runc.v2 nvidia runc
or
henryp@adele:~$ grep -A2 -i nvidia /etc/docker/daemon.json
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
and download an image from here.
Now, I just need to run something like:
docker run --rm --gpus all --runtime=nvidia -it -v /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/:/home/henryp/Code/Scala/SparkEcosystem/spark-rapids/ -v /home/henryp/Tools:/home/henryp/Tools -v /home/henryp/.m2:/.m2 -v /usr/local/bin/Java:/usr/local/bin/Java --user root nvidia/cuda:12.6.1-devel-ubi8 /bin/bash
and once I'm in, run:
yum -y install git
yum -y install git
yum -y install diffutils
yum -y install rsync
Constantly installing some tools for Rapids to build proved a bit tedious, so I extended the NVIDIA docker image with this Dockerfile:
export MAVEN_HOME=/home/henryp/Tools/Build/Maven/Latest
export PATH=$MAVEN_HOME/bin:$PATH
export JAVA_HOME=/usr/local/bin/Java/Latest17
export PATH=$JAVA_HOME/bin:$PATH
FROM nvidia/cuda:12.6.1-devel-ubi8
RUN yum -y install git
RUN yum -y install diffutils
RUN yum -y install rsync
Also, if I wanted to use mvnDebug [SO], I had to set the network of the container to host using --network host [SO]. Then, it's just a matter of running:
mvnDebug scalatest:test -DforkMode=never
and attaching a debugger from the host machine.
Unfortunately, sometimes when I close and re-open my laptop, the tests start failing with:
...
WindowedBlockIteratorSuite:
terminate called after throwing an instance of 'cudf::jni::jni_exception'
what(): CUDA ERROR: code 999
/home/henryp/Tools/Build/Maven/Latest/bin/mvnDebug: line 36: 216 Aborted (core dumped) env MAVEN_OPTS="$MAVEN_OPTS" MAVEN_DEBUG_OPTS="$MAVEN_DEBUG_OPTS" "`dirname "$0"`/mvn" "$@"
Apparently, I must reboot my machine :(
Some miscellaneous code notes
Finally, I get to look at the code in action. Rapids adds RapidsExecutorPlugin (which extends the Spark ExecutorPlugin interface) that immediately initializes the GPU and memory (see GpuDeviceManager.initializeGpuAndMemory). Note that in setGpuDeviceAndAcquire we see a comment that says:
"cudaFree(0) to actually allocate the set device - no process exclusive required since we are relying on Spark to schedule it properly and not give it to multiple executors"
This is why the tests (HashAggregatesSuite) a hard-coded to use just one CPU core.
Rapids then has a parallel set of classes that look a lot like the Spark classes that represent linear algebra structures. For example, there is a ColumnVector abstraction in both Spark and Rapids. The interesting Rapids one is GpuColumVector - which implements this Spark interface - that can be instantiated by a GpuShuffleExchangeExecBase. Amongst other things, objects of these classes contain the address of their off-heap data and a reference counter.
Still playing.