Wednesday, January 3, 2024

GPU vs CPU vs AVX


Vector databases are all the rage. So, I looked at three different ways of multiplying vectors: CPU, GPU and Advanced Vector Extensions that leverages SIMD instructions if your hardware supports them. To access the GPU, I'm using the Tornado Java VM. For AVX, I'm using the JVM's jdk.incubator.vector module, available since JDK16.

(Code in my GitHub repo here).

The reason we're looking at vector mulitplication is that searching for vectors (what the vector DB is all about) usually uses something like the approximate nearest neighbour algorithm. One way to implement it is something like Ethan Lui's implementation mentioned in a past blogpost here. Briefly: it multiplies your vector by random vectors resulting in a vector whose bits are on or off depending on the sign of each element in the product.

The results are as follow (note, the GPU is a Quadro T2000 that apparently has 4gb of memory, 1024 cores and a bandwidth of 128 gigabits per second).

You can see that there is a huge fixed cost to using the GPU but once you get sufficiently large vectors, it's worth it. But what causes this fixed cost?

On my Intel Xeon E-2286M  CPU @ 2.40GHz, kernel calls take typically 17.8ns.

  17.776 ±(99.9%) 0.229 ns/op [Average]
  (min, avg, max) = (17.462, 17.776, 19.040), stdev = 0.306
  CI (99.9%): [17.547, 18.005] (assumes normal distribution)

JNI calls take a little longer at about 21.9ns:

  21.853 ±(99.9%) 0.488 ns/op [Average]
  (min, avg, max) = (21.345, 21.853, 23.254), stdev = 0.651
  CI (99.9%): [21.365, 22.340] (assumes normal distribution)

So, it doesn't seem that the fixed costs incurred in the GPU vector multiplication is due to context switching when calling the kernel or calls via JNI.

Note the maximum vector size for this test was 8 388 608 floats. 

That's 268 435 456 bits or 0.25 gigabits.

Based on just bandwidth alone and ignoring everything else, each call should be about 1.95ms. This matches the average observed time (1.94971ms). 

This suggests the actual calculation is incredibly fast and only the low bandwidth is slowing it down. Tornado VM appears to have minimal room for improvement - you really are getting the best you can out of the hardware.

No comments:

Post a Comment