Agile Java Man: NVIDIA Rapids

You can make Spark go faster by offloading some of the work to the GPU. There is an NVIDIA library (spark-rapids) to do this.

A quick introduction

There are quite a few references to UCX in Spark Rapids. Assuming you have the hardware, this allows remote direct memory access (RDMA), basically sharing data in memory that circumvents the kernel.

"UVM or universal memory can allow main host memory to act essentially as swap for device(GPU) memory. This allows the GPU to process more data than fits in memory, but can result in slower processing. This is an experimental feature." (from RapidsConf)

The spillStorageSize is the "Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools."

Old Ubuntu

Unfortunately, when running a test on Ubuntu 18 (I know, I know) I saw:

Caused by: java.lang.UnsatisfiedLinkError: /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/integration_tests/target/tmp/cudf3561040550923512030.so)
...
at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:246)

After installing the latest Ubuntu on VMWare, it appears that you cannot access the GPU using VMWare Workstation.

You can however use a Docker image - just check that you have the Docker daemon that can handle NVIDIA installed by running:

henryp@adele: docker info | grep Runtimes

Runtimes: io.containerd.runc.v2 nvidia runc

henryp@adele:~$ grep -A2 -i nvidia /etc/docker/daemon.json

"nvidia": {

"path": "/usr/bin/nvidia-container-runtime",

"runtimeArgs": []

}

and download an image from here.

Now, I just need to run something like:

docker run --rm --gpus all --runtime=nvidia -it -v /home/henryp/Code/Scala/SparkEcosystem/spark-rapids/:/home/henryp/Code/Scala/SparkEcosystem/spark-rapids/ -v /home/henryp/Tools:/home/henryp/Tools -v /home/henryp/.m2:/.m2 -v /usr/local/bin/Java:/usr/local/bin/Java --user root nvidia/cuda:12.6.1-devel-ubi8 /bin/bash

and once I'm in, run:

yum -y install git

yum -y install diffutils

yum -y install rsync

export MAVEN_HOME=/home/henryp/Tools/Build/Maven/Latest

export PATH=$MAVEN_HOME/bin:$PATH

export JAVA_HOME=/usr/local/bin/Java/Latest17

export PATH=$JAVA_HOME/bin:$PATH

Constantly installing some tools for Rapids to build proved a bit tedious, so I extended the NVIDIA docker image with this Dockerfile:

FROM nvidia/cuda:12.6.1-devel-ubi8

RUN yum -y install git

RUN yum -y install diffutils

RUN yum -y install rsync

Also, if I wanted to use mvnDebug [SO], I had to set the network of the container to host using --network host [SO]. Then, it's just a matter of running:

mvnDebug scalatest:test -DforkMode=never

and attaching a debugger from the host machine.

Unfortunately, sometimes when I close and re-open my laptop, the tests start failing with:

...

WindowedBlockIteratorSuite:

terminate called after throwing an instance of 'cudf::jni::jni_exception'

what(): CUDA ERROR: code 999

/home/henryp/Tools/Build/Maven/Latest/bin/mvnDebug: line 36: 216 Aborted (core dumped) env MAVEN_OPTS="$MAVEN_OPTS" MAVEN_DEBUG_OPTS="$MAVEN_DEBUG_OPTS" "`dirname "$0"`/mvn" "$@"

Apparently, I must reboot my machine :(

Some miscellaneous code notes

Finally, I get to look at the code in action. Rapids adds RapidsExecutorPlugin (which extends the Spark ExecutorPlugin interface) that immediately initializes the GPU and memory (see GpuDeviceManager.initializeGpuAndMemory). Note that in setGpuDeviceAndAcquire we see a comment that says:

"cudaFree(0) to actually allocate the set device - no process exclusive required since we are relying on Spark to schedule it properly and not give it to multiple executors"

This is why the tests (HashAggregatesSuite) a hard-coded to use just one CPU core.

Rapids then has a parallel set of classes that look a lot like the Spark classes that represent linear algebra structures. For example, there is a ColumnVector abstraction in both Spark and Rapids. The interesting Rapids one is GpuColumVector - which implements this Spark interface - that can be instantiated by a GpuShuffleExchangeExecBase. Amongst other things, objects of these classes contain the address of their off-heap data and a reference counter.

Still playing.

Agile Java Man

Saturday, October 26, 2024

NVIDIA Rapids

No comments:

Post a Comment

Blog Archive

About Me