Wednesday, October 25, 2023

Java and the GPU

Java was always built to abstract away the hardware on which it runs but it's approach to GPU has been somewhat late to the game [Gary Frost, YouTube].

There are projects out there that promise to give Java access to the GPU. I looked at Aparapi but it appears to be moribud. So, I gravitated to TornadoVM which Frost describes as "state of the art".

The trouble is that TornadoVM runs everything in a Docker image that has all the shared objects built in. This is fine for a quick demo - this is the result of running on my Quadro T2000:

docker-tornadovm$ ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication
Computing MxM of 512x512
CPU Execution: 1.17 GFlops, Total time = 230 ms
GPU Execution: 268.44 GFlops, Total Time = 1 ms
Speedup: 230x

This demonstrates how the GPU runs a nested for-loop doing matrix multiplication much faster than the same code on the CPU. But it runs it all in a Docker container and I need to package a JAR everytime I make a change. How do I run it outside the container?

To work this out, I opened a shell in the Docker image and saw that the TornadoVM build it uses was built from Git branch d3062accc. So, the first thing was to checkout that branch of TornadoVM and build it.

I built with:

mvn clean install -Pgraal-jdk-11-plus

using the graalvm-ee-java11-22.3.4 JDK.

Note that you'll need Graal as the TornadoVM code has dependencies on it. I built my own Graal JDK by following the instructions here but using a different branch as I couldn't find the download for the graal.version defined in the TornadoVM pom.xml. Note, you'll also need mx and a bootstrapping JDK that has the right compiler interface (JVMCI), in my case labsjdk-ce-21.0.1-jvmci-23.1-b19.

So far, so good. I ran the tornado script which is just a wrapper around a call to the java executable (don't forget to set your JAVA_HOME environment variable to point at the Graal JDK) but it complained it could not see a tornado.backend file.

Again, a sneaky look at the Docker container indicated that we have to tell it which driver to use. So, I created the file and told it tornado.backends=opencl-backend but then tornado complained it didn't have the OpenCL drivers. Oops. 

You have to build the drivers you want seperately it seems. But if you try to build Tornado drivers without the native OpenCL dev library, you'll see:

TornadoVM/tornado-drivers/opencl-jni$ mvn clean install # yes, Maven cmake via cmake-maven-plugin
....
/usr/bin/ld: cannot find -lOpenCL
...


The Docker image saves you from having to install the OpenCL libraries on your machine. To get it working on bare metal, I played it safe and got an old Ubuntu box and installed them there. You'll need to install them with:

sudo apt install ocl-icd-opencl-dev

and then ran Maven in the opencl* sub directories. This time, the Maven build completed successfully.

However, running tornado in the subsequent dist folder still pukes but with something like:

Caused by: uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: OpenCL JNI Library not found
at tornado.drivers.opencl@0.15.1/uk.ac.manchester.tornado.drivers.opencl.OpenCL.<clinit>(OpenCL.java:68)
... 11 more

Not what I was expecting. I found I needed to:

cp ./tornado-drivers/opencl-jni/target/linux-amd64-release/cmake/libtornado-opencl.so $TORNADO_SDK/lib

Where TORNADO_SDK is pointing at the relevent dist folder.

Now, finally, you can run on the bare metal:

$ tornado -cp target/classes/ example.MatrixMultiplication
Computing MxM of 512x512
CPU Execution: 1.21 GFlops, Total time = 222 ms
GPU Execution: 17.90 GFlops, Total Time = 15 ms
Speedup: 14x

(Results from an old NVIDIA GeForce GTX 650)

Note, you'll need to also run it with the Graal JVM. Set both the PATH and JAVA_HOME environment variables to point to it.

Where now?

This is a nice introduction to running Java on the GPU but it's just the start. There are many caveats. Example: what if your Java code throws an Exception? GPUs have no equivalent of exceptions so what happens then? More to come.

Thursday, October 12, 2023

Dependency hell

In these days of ChatGPT, it's easy to forget that most of the time, a developer isn't actually cutting code at all, but debugging it. This is my own personal hell in getting Spark and Kafka in Docker containers talking to a driver on the host.

Firstly, I was seeing No TypeTag available when my code was trying to use the Spark Encoders. This SO answer helped. Basically, my code is Scala 3 and "Encoders.product[classa] is a Scala 2 thing. This method accepts an implicit TypeTag. There are no TypeTags in Scala 3". Yikes. This is probably one reason the upgrade path in Spark to Scala 3 is proving difficult. The solution I used was to create a SBT sub Project that was entirely Scala 2 and from here I called Spark.

The next problem was seeing my Spark jobs fail with:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (172.30.0.7 executor 0): java.lang.ClassCastException: cannot assign instance of scala.collection.generic.DefaultSerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition.inputPartitions of type scala.collection.immutable.Seq in instance of org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition

This is a contender for the error message with the greatest misdirection. You think it's a serialization problem but it isn't directly so.

Although other Spark users have reported it, Ryan Blue mentions that it isn't really a Spark issue but a Scala issue.

Anyway, I tried all sorts of things like change my JDK (note the sun.* packages have been removed in later JDKs so you need to follow the advice in this SO answer). I tried creating an uber jar but was thwarted by duped dependencies [SO], Invalid signature file digest errors as some were signed [SO] that forced me to strip the signtures out [SO] but still falling foul of Kafka's DataSourceRegister file being stripped out [SO].

The first step in the right direction came from here and another SO question where the SparkSession is recommended to be built by adding .config("spark.jars", PATHS) where PATHS is a comma delimited string of the full paths of all the JARs you want to use. Surprisingly, this turned out to include Spark JARs themselves, including in my case spark-sql-kafka-0-10_2.13 which oddly does not come as part of the Spark installation. By adding them as spark.jars, they are uploaded into the work subdirectory of a Spark node.

After this, there was just some minor domain name mapping issues to clear up in both the host and container before the whole stack worked without any further errors being puked.

Monday, October 9, 2023

My "What data science can learn from software engineering" presentation

Dr Chris Monit and I presented this presentation at the London MLOps meetup last week. TL;DR: maximize your chances of a successful delivery in data science by adopting best practices that the software industry has established.

"Think of MLOps as the process of automating machine learning using DevOps methodologies" - Practical MLOps (O'Reilly)

Monday, October 2, 2023

Packaging Python

Python build tools are unifying behind a common interface of pyproject.toml.
This and this are great guides. The gist of the former is that you create a TOML file that conforms to a specification then you can use any build tool to run it. The gist of the latter is the whole Python packaging ecosystem.

The salient commands for building and deploying with your TOML file are:

python3 -m build
python3 -m twine upload --repository pypi dist/*


Note, you want to clean your dist directory first.

The Snag

The idea of using any Python build tool is not quite there yet. Poetry only implements a subset of the specification. Also, the specification has a leaky abstraction. On Discord, Prof. Nick Radcliffe explains that the promise of using "any" lead him to naively use setuptools.

Nick Radcliffe — 08/21/2023 2:37 PM

Also, in case anyone is interested (related to packaging, above) I'm currently in the process of packaging a fairly large Python codebase using new-style packaging (pyproject.toml rather than setup.py). It wasn't quite my first use of it, but this project is much more complex. Initially, I chose setuptools as the build backend, since (a) it didn't seem like it should matter much and (b) I didn't think I needed anything special. That was a big mistake for me: it turns out the setuptools back-end ignores almost everything except Python code in building your package. Whereas my package (which has over 10k files) also have about 1,000 non-python files (everything from .txt and .json to shape files, CSV files, and HTML and markdown and all sorts). Some of these are needed for testing (which for some reason some people think don't need to be distributed...as if people shouldn't care about whether the installed software works in situ, rather than just on the developer's machine in the CI system), but others are needed just in the ordinary course of using the software.  setuptools has a way to let you include extra stuff, but it's very manual and would be very error-prone for me. Anyway, the TL;DR is that I switched to Flit as the backend and everything "just worked". Not saying Flit will work better for you; but it sure as hell worked better for me!

Also, the reason I chose flit was that the third bullet in "Why use Flit?" is "Data files within a package directory are automatically included. Missing data files has been a common packaging mistake with other tools."

It also says: "The version number is taken from your package’s version attribute, so that always matches the version that tools like pip see." Which also seems extremely sane (and probably I don't need to do the automatic updating of my pyproject.toml to do that.

Success has many parents...

... but it appears that PyPI packages have only one. Although the authors tag can take a list, adding multiple entries is ignored. The reason is that it's best practise to use a mailing list (see here).

And so my package to facilitate the creation of synthetic data now lives in PyPI much like my Java code is deployed to mvnrepository.