Saturday, March 9, 2024

Big Data and CPU Caches

I'd previously posted about how Spark's data frame schema is an optimization not an enforcement. If you look at Spark's code, schemas save checking whether something is null. That is all. 

Can this really make so much of a diffence? Surprisingly, omitting a null check can optimize your code by an order of magnitude. 

As ever, the devil is in the detail. A single null check is hardly likely to make a difference to your code. But when you are checking billions of times, you need to take it seriously. 

There is another dimension to this problem. If you're checking the same reference (or a small set of them) then you're probably going to be OK. But if you are null checking large numbers of references, this is where you're going to see performance degradation.

The reason is that a small number of references can live happily in your CPU cache. As this number grows, they're less likely to be cached and your code will force memory to be loaded from RAM into the CPU.

Modern CPUs cache data to avoid hitting RAM. My 2.40GHz Intel Xeon E-2286M has three levels of cache, each bigger (and slower) than the next:

$ sudo dmidecode -t cache  
Cache Information                       
Socket Designation: L1 Cache                     
Maximum Size: 512 kB                 
...                  
Socket Designation: L2 Cache                   
Maximum Size: 2048 kB                
...                      
Socket Designation: L3 Cache                      
Maximum Size: 16384 kB             

Consequently, the speed we can randomly access an array of 64-bit numbers depends on the size of the array. To demonstrate, here is some code that demonstrates. The results look like this:


Who would have thought little optimizations on big data can make such a huge difference?

No comments:

Post a Comment