Thursday, November 30, 2023

Memories are made of these

Some notes on new memory models I've been looking at recently.

Zero copy

"Device controllers cannot do DMA directly into user space, but the same effect is achievable by exploiting ... [the fact] more than one virtual address can refer to the same physical memory location. [Thus] the DMA hardware (which can access only physical memory addresses) can fill a buffer that is simultaneously visible to both the kernel and a user space process." - Java NIO, Ron Hitchens

Virtual memory paging is "often referred to as swapping, though true swapping is done at the process level, not the page level" [ibid].

An excellent visual representation of what's going on in during a zero-copy is here from Stanislav Kozlovski (who ends with the knock-out punch that is makes very little difference to Kafka since generally costs of network IO and encryption cancel any savings). Anyway, the take-away points are: 

  • Zero-copy "doesn’t actually mean you make literally zero copies" it's just that it "does not make unnecessary copies of the data."
  • Fewer context switches happen.
  • A further optimization to DMA is where the disk "read buffer directly copies data to the NIC buffer - not to the socket buffer.  This is the so-called scatter-gather operation (a.k.a Vectorized I/O).  [It is] the act of only storing read buffer pointers in the socket buffer, and having the DMA engine read those addresses directly from memory."

Java's new vector API

A new way of dealing with vectors is outlined at JEP426 (Vector API). It leverages new CPU features like Advanced Vector Extensions [Wikipedia] that provide new machine instructions to execute Single Instructions on Multiple Data (SIMD).  

Martin Stypinski has an interseting article that shows adding two floating point vectors together gain very little from the new API but a linear equation like y = mx + c (which has obvious applications to machine learning) can improve performance by an order of magnitude.

Project Panama

Project Panama deals with interconnecting the JVM with native code. Oracle's Gary Frost talks about this in his presentation on accessing the GPU from Java. The difficulty he encountered was allocating heap memory and passing it to the GPU. Unfortunately, the garbage collector might reorganise the heap making the pointer to that memory obsolete. With Project Panama, this would not happen as the allocation would be through the JVM but off the heap. 

Apache Arrow

Arrow provides an agreed memory format for data so you can "share data across languages and processes." [docs]

This differs from Google's Protobuf in that "Protobuf is designed to create a common on the wire or disk format for data." [SO] Any data from Protobuf that is deserialized will be done in the the same way that language always handles it.

This inter-process ability allows Spark (which runs in the JVM) to use Pandas (which runs in a Python process).

"Perhaps the single biggest memory management problem with pandas is the requirement that data must be loaded completely into RAM to be processed... Arrow serialization design provides a “data header” which describes the exact locations and sizes of all the memory buffers for all the columns in a table. This means you can memory map huge, bigger-than-RAM datasets and evaluate pandas-style algorithms on them in-place without loading them into memory like you have to with pandas now. You could read 1 megabyte from the middle of a 1 terabyte table, and you only pay the cost of performing those random reads totalling 1 megabyte... Arrow’s memory-mapping capability also allows multiple processes to work with the same large dataset without moving it or copying it in any way. "[10 Things I hate about Pandas, by Pandas author, Wes McKinny]

"The ability to memory map files allows you to treat file data on disk as if it was in memory. This exploits the virtual memory capabilities of the operating system to dynamically cache file content without committing memory resources to hold a copy of the file." [NIO - Hitchens].

MySQL vs Postgres

There's a great comparison between the two major open source DBs here at Uber. Amongst the many insights, there is a mention that MySQL uses a cache "logically similar to the Linux page cache but implemented in userspace... It results in fewer context switches. Data accessed via the InnoDB buffer pool doesn’t require any user/kernel context switches. The worst case behavior is the occurrence of a TLB [Translation Lookaside Buffer] miss, which is relatively cheap and can be minimized by using huge pages."

"On systems that have large amounds of memory and where applications require large blocks of memory, using huge pages reduces the number of entries required in the hardware memory management unit's translation look-aside buffer (TLB). This is beneficial because entries in the TLB are usually a scarce resource... For example, x86-32 allows 4mb pages as an alternative to 4kb pages)" [The Linux Programming Interface]

No comments:

Post a Comment