Thursday, November 30, 2023

Memories are made of these

Some notes on new memory models I've been looking at recently.

Zero copy

"Device controllers cannot do DMA directly into user space, but the same effect is achievable by exploiting ... [the fact] more than one virtual address can refer to the same physical memory location. [Thus] the DMA hardware (which can access only physical memory addresses) can fill a buffer that is simultaneously visible to both the kernel and a user space process." - Java NIO, Ron Hitchens

Virtual memory paging is "often referred to as swapping, though true swapping is done at the process level, not the page level" [ibid].

An excellent visual representation of what's going on in during a zero-copy is here from Stanislav Kozlovski (who ends with the knock-out punch that is makes very little difference to Kafka since generally costs of network IO and encryption cancel any savings). Anyway, the take-away points are: 

  • Zero-copy "doesn’t actually mean you make literally zero copies" it's just that it "does not make unnecessary copies of the data."
  • Fewer context switches happen.
  • A further optimization to DMA is where the disk "read buffer directly copies data to the NIC buffer - not to the socket buffer.  This is the so-called scatter-gather operation (a.k.a Vectorized I/O).  [It is] the act of only storing read buffer pointers in the socket buffer, and having the DMA engine read those addresses directly from memory."

Java's new vector API

A new way of dealing with vectors is outlined at JEP426 (Vector API). It leverages new CPU features like Advanced Vector Extensions [Wikipedia] that provide new machine instructions to execute Single Instructions on Multiple Data (SIMD).  

Martin Stypinski has an interseting article that shows adding two floating point vectors together gain very little from the new API but a linear equation like y = mx + c (which has obvious applications to machine learning) can improve performance by an order of magnitude.

Project Panama

Project Panama deals with interconnecting the JVM with native code. Oracle's Gary Frost talks about this in his presentation on accessing the GPU from Java. The difficulty he encountered was allocating heap memory and passing it to the GPU. Unfortunately, the garbage collector might reorganise the heap making the pointer to that memory obsolete. With Project Panama, this would not happen as the allocation would be through the JVM but off the heap. 

Apache Arrow

Arrow provides an agreed memory format for data so you can "share data across languages and processes." [docs]

This differs from Google's Protobuf in that "Protobuf is designed to create a common on the wire or disk format for data." [SO] Any data from Protobuf that is deserialized will be done in the the same way that language always handles it.

This inter-process ability allows Spark (which runs in the JVM) to use Pandas (which runs in a Python process).

"Perhaps the single biggest memory management problem with pandas is the requirement that data must be loaded completely into RAM to be processed... Arrow serialization design provides a “data header” which describes the exact locations and sizes of all the memory buffers for all the columns in a table. This means you can memory map huge, bigger-than-RAM datasets and evaluate pandas-style algorithms on them in-place without loading them into memory like you have to with pandas now. You could read 1 megabyte from the middle of a 1 terabyte table, and you only pay the cost of performing those random reads totalling 1 megabyte... Arrow’s memory-mapping capability also allows multiple processes to work with the same large dataset without moving it or copying it in any way. "[10 Things I hate about Pandas, by Pandas author, Wes McKinny]

"The ability to memory map files allows you to treat file data on disk as if it was in memory. This exploits the virtual memory capabilities of the operating system to dynamically cache file content without committing memory resources to hold a copy of the file." [NIO - Hitchens].

MySQL vs Postgres

There's a great comparison between the two major open source DBs here at Uber. Amongst the many insights, there is a mention that MySQL uses a cache "logically similar to the Linux page cache but implemented in userspace... It results in fewer context switches. Data accessed via the InnoDB buffer pool doesn’t require any user/kernel context switches. The worst case behavior is the occurrence of a TLB [Translation Lookaside Buffer] miss, which is relatively cheap and can be minimized by using huge pages."

"On systems that have large amounds of memory and where applications require large blocks of memory, using huge pages reduces the number of entries required in the hardware memory management unit's translation look-aside buffer (TLB). This is beneficial because entries in the TLB are usually a scarce resource... For example, x86-32 allows 4mb pages as an alternative to 4kb pages)" [The Linux Programming Interface]

Thursday, November 9, 2023

Z-order

Z-ordering is an optimization technique in big data that allows faster access since similar data lives together. We discuss the algorithm that defines what is similar here. 

Imagine a logical grid where all the values of one column run across the top and all the values from another run down the side. If we were to sort this data, every datum can be placed somewhere in that grid.

Now, if the squares of the grid were mapped to files and all the data in each cell were to live in those files, we have made searching much easier as we now know the subset of files in which it may live. We've essentially sorted in not just one dimension but two (although we can do higher).

This can be especially useful when we want to sort the data but don't know exactly what to sort on - a common connundrum when dealing with events. Say we have event_time, insert_time and update_time. Which do we choose? We could sort the data three times, each time on one column but this is impractical with huge data sets. Enter z-order.

Note that the Z-Order really needs 2 or more columns on which to act. Only one column is the degenerate case. "Zorder/Hilbert etc on a single dimension are just a hierarchal sort" [Russell Spitzer on Slack].

(This is a good article about z-ordering from the perspective of Apache Iceberg.)

For an example in Delta Lake, we can see this code that creates a data set with columns c1, c2 and c3 whose values are [x, 99-x, x+50 mod 100] for x [0, 99]. After z-ordering it, these numbers are split into 4 different files. Generating a graphic illustrates how the data points are distributed over those files:

The idea behind how we calculate which cell a datum falls into is best described here on Wikipedia. But, in brief, the binary representation of the data points is interleaved to give a z-value per tuple. In our example, I see a [0, 99, 50] mapped to the byte array [0, 0, 0, 0, 0, 0, 0, 0, 0, 9, -112, 26].

I took a look at the Delta Lake code here where a Spark Column object is created that wraps a DL InterleaveBits type which in turn is a subclass of Spark's Catalyst Expression type. This executes on Spark's InternalRow, that is, the raw data on the executors.

The reason the code is doing this is to add a column with which we can repartition the data with the SQL CAST(interleavebits(rangepartitionid(c1), rangepartitionid(c2), rangepartitionid(c3)) AS STRING). The rangepartitionid keyword is part of the Delta Lake machinery.

Using this z-order value (plus a random key), the DataFrame the Delta code now calls repartitionByRange which samples the data [SO] and breaks it into discrete ranges.

Given the interleaving of the columns c1c2 and c3 their order has minimal impact on the z-value so it's no surprise to see nearby data clustering into the same files, as we can see in the graphic. In fact, if you look at the DataFrame during the repartition process:

+---+---+---+-------------------------------------------+
| c1| c2| c3|c7b6b480-c678-4686-aa99-283988606159-rpKey1|
+---+---+---+-------------------------------------------+
|  0| 99| 50|                                       \t�|
|  1| 98| 51|                                       \t�|
|  2| 97| 52|                                       \t�b|
|  3| 96| 53|                                       \t�e|
|  4| 95| 54|                                       \b��|
|  5| 94| 55|                                       \b��|
|  6| 93| 56|                                       \b��|
|  7| 92| 57|                                       \b��|
|  8| 91| 58|                                       \b�|
|  9| 90| 59|                                       \b�|
| 10| 89| 60|                                       \b�b|
| 11| 88| 61|                                       \b�e|
| 12| 87| 62|                                       \b��|
| 13| 86| 63|                                       \b��|
| 14| 85| 64|                                       \f)�|
| 15| 84| 65|                                       \f)�|
| 16| 83| 66|                                       \f`|
| 17| 82| 67|                                       \f`|
| 18| 81| 68|                                       \f`b|
| 19| 80| 69|                                       \f`e|
+---+---+---+-------------------------------------------+


you can see the slowly changing values by which things are partitioned (column c7b6b480-c678-4686-aa99-283988606159-rpKey1 -  a random name so it doesn't clash with other column names. It's dropped immediately after the call to repartitionByRange)