Agile Java Man: Optimising GPU code

Saturday, October 5, 2024

Optimising GPU code

I complained to Juan Fumero that a benchmark indicated that the GPU was not giving much of a performance improvement. JMH reported the GPU being a moderate 20% faster than the CPU:

tornado -jar tornado-benchmarks/target/jmhbenchmarks.jar uk.ac.manchester.tornado.benchmarks.sgemv.JMHSgemV

...

Benchmark Mode Cnt Score Error Units

JMHSgemV.sgemVJava avgt 5 72366270.751 ± 5916807.539 ns/op

JMHSgemV.sgemVTornado avgt 5 57583087.103 ± 2523449.341 ns/op

(SGEMM is single precision general matrix multiplication. GEMV indicates that we're multiplying a matrix with a vector).

Juan replied that I should try TornadoVM's --enableProfiler console switch and see where the time was being spent. Sure enough, COPY_IN_TIME was ~28ms, about the same as TOTAL_KERNEL_TIME.

Note that the total kernel time is the time it takes the GPU to perform the computation and the total kernel dispatch time is the time it takes to schedule the kernel (ie, the function being executed). In this case, dispatch time is ~6us - three orders of magnitude smaller than the execution time.

Juan also said that "Matrix Vector is not as compute intensive as other applications", so instead I tried the matrix/matrix multiplication. Here, the GPU shines:

Benchmark Mode Cnt Score Error Units

JMHSgemm.sgemmJava avgt 5 1773297262.188 ± 4115731.439 ns/op

JMHSgemm.sgemmTornado avgt 5 8478409.506 ± 246919.368 ns/op

That makes the GPU 200 times faster than the CPU. Now COPY_IN_TIME is about 1ms and TOTAL_KERNEL_TIME is about 5.5ms.

Now we're talking. But continuing this optimization rampage, it's worth noting that "It has become tribal knowledge that the particular shapes chosen for matmuls has a surprisingly large effect on their performance." [Horace He] TL;DR; He's article explains how fitting the small memory tiles onto a large matrix can hugely change performance - basically, that in a row-major MxN matrix, N must be a factor of the GPU's cache line for best results.

Agile Java Man

Saturday, October 5, 2024

Optimising GPU code

No comments:

Post a Comment

Blog Archive

About Me