Monday, January 20, 2025

Notes on GPUs and Spark

There are moves afoot to give Spark GPU access out of the box. From Project Hydrogen for Spark: "Although Spark supports [Kubernetes and YARN], Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users."

To play with it, I tried to build Spark Rapids from NVidia. Unfortunately, the tests barfed with:

GpuDeviceManagerSuite:
*** RUN ABORTED ***
  java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-10-cuda11/target/libcudf/cmake-build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:181: Maximum pool size exceeded
  at ai.rapids.cudf.Rmm.allocInternal(Native Method)
  at ai.rapids.cudf.Rmm.alloc(Rmm.java:519)
  at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:147)
  at ai.rapids.cudf.DeviceMemoryBuffer.allocate(DeviceMemoryBuffer.java:137)
  at com.nvidia.spark.rapids.GpuDeviceManagerSuite.$anonfun$new$4(GpuDeviceManagerSuite.scala:57)
  at com.nvidia.spark.rapids.GpuDeviceManagerSuite.$anonfun$new$4$adapted(GpuDeviceManagerSuite.scala:52)
  at com.nvidia.spark.rapids.TestUtils$.withGpuSparkSession(TestUtils.scala:139)
  at com.nvidia.spark.rapids.GpuDeviceManagerSuite.$anonfun$new$3(GpuDeviceManagerSuite.scala:52)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)

OK, so let's use some NVidia tools to find out what's going on. Running nvidia-smi yields:

| NVIDIA-SMI 470.256.02   Driver Version: 470.256.02   CUDA Version: 11.4     |

So running the NVidia profiler, nsight-sys, means I can only profile the CPU not the GPU since the driver is too old (I'm using Ubuntu 20).

It seems [SO] that the CUDA toolkit v12 has a minimum CC of 5 and my Quadro T2000 has a capability factor of 7.5 [NVIDIA] so I should be good if I had an upgraded driver. (Seems that v535 of the NVIDIA driver may have some issues, though).

But it was about this time I bought a new laptop (Lenovo Thinkpad P1 Gen 7 - awesome machine) which came with Ubuntu 22 preinstalled and most of the software I needed. However, nsight-sys was barfing.


The actual error was: Cannot mix incompatible Qt library (5.15.3) with this library (5.15.2)

So, I reinstalled it from the NVIDIA website and now run the command line:

/usr/local/NVIDIA-Nsight-Compute-2024.3/ncu-ui

to see the NSight Compute GUI. There seem to be a few people on the forums suggesting that installing the NVidia tools by hand lead to them being more reliable.

Anyway, I can now run this little handy script:

$ cat ~/bin/nvidia_prof 
/usr/local/NVIDIA-Nsight-Compute-2024.3/target/linux-desktop-glibc_2_11_3-x64/ncu --verbose --config-file off --export /tmp/nvidia.log --force-overwrite  $@

appending it with the code I want to run and I can view the output in ncu-ui and (hopefully) see where the problem is.

Another Java GPU library

It's worth noting that another library in Java that accesses the GPU but this time in a different way. DJL appears to use JNA to access the CUDA library. The interface is CudaLibrary that native implementation of which appears to be a thin wrapper around some Cuda code.

JNA eliminates the boilerplate of JNI. It dynamically maps Java method calls to native functions at runtime. Consequently, JNA has higher runtime overhead compared to JNI because it uses reflection and runtime mapping.

No comments:

Post a Comment