Sunday, March 24, 2024

Iceberg locks and catalogs

Although Hadoop Meta Store is used for most Spark implementations, it's not recommended for Iceberg. HMS does not support retries and deconflicting commits.

"HadoopCatalog has a number of drawbacks and we strongly discourage it for production use.  There are certain features like rename and drop that may not be safe depending on the storage layer and you may also require a lock manager to get atomic behavior.  JdbcCatalog is a much better alternative for the backing catalog." [Iceberg Slack]

Iceberg comes with a DynamoDb (AWS) implementation of the lock manager. Looking at the code, it appears that acquiring the lock uses an optimistic strategy. You can tell DynamoDB to put a row in the table iff it doesn't exist already. If it does, the underlying AWS library throws a  software.amazon.awssdk.services.dynamodb.model.ConditionalCheckFailedException. There's a test for this in the AWS module here. It needs an AWS account to run.

"This is necessary for a file system-based catalog to ensure atomic transaction in storages like S3 that do not provide file write mutual exclusion." [Iceberg docs] This is a sentiment echoed in this blog.

The issue is the rename, not the data transfer. "Each object transfer is atomic. That is, either a whole file is transferred, or none of it is. But the directory structure is not atomic and a failure can cause mv to fail mid-way." [AWS Questions]

In the old world of HDFS, Spark would write its output to a temporary directory then atomically rename that directory to that of the final destination. However, S3 is not a file system but a blob store and the notion of a directory is just that: notional. When we change a "directory's" name, all the files in a directory need to be renamed one-by-one and renaming all the files Spark outputs is not atomic in S3. Implementations that talk to their own file system must implement Hadoop's OutputCommitter and Spark will call these when preparing to write etc.

The only mention of the lock manager in "Apache Iceberg: The Definitive Guide" is:

"If you are using AWS Glue 3.0 with Iceberg 0.13.1, you must also set the additional configurations for using the Amazon DynamoDB lock manager to ensure atomic transactions. AWS Glue 4.0, on the other hand, uses optimistic locking by default."

which is a bit too cryptic for me apparently because Glue 4.0 has a different version of Iceberg that uses optimistic locking [Discourse].

Catalogs

Catalogs "allows [Iceberg] to ensure consistency with multiple readers and writers and discover what tables are available in the environment... the primary high level requirement for a catalog implementation to work as an Iceberg catalog is to map a table path (e.g., “db1.table1”) to the file path of the metadata file that has the table’s current state."

The Catalogs are:

  • Hadoop. Note that Hadoop is used loosely. "Note anytime you use a distributed file system (or something that looks like one) to store the current metadata pointer, the catalog used is actually called the 'hadoop' catalog." [1] The most important potential downside of this Catalog is "It requires the file system to provide a file/object rename operation that is atomic to prevent data loss when concurrent writes occur." And there are others. "A Hadoop catalog doesn’t need to connect to a Hive MetaStore, but can only be used with HDFS or similar file systems that support atomic rename. Concurrent writes with a Hadoop catalog are not safe with a local FS or S3." [Iceberg docs]
  • Hive. Apart from running an additional process (unlike the Hadoop catalog), "It requires the file system to provide a file/object rename operation that is atomic to prevent data loss when concurrent writes occur." [1]
  • AWS Glue. "Like the Hive catalog, it does not support multi-table transactions" [1]
  • Nessie gives a Git-like experience for data lakes but the two main disadvantages are that you must run the infrastructure yourself (like Hive) and it's not compatible with all engines.
  • REST is by nature simple, is implementation agnostic and "the REST catalog supports multi- table transactions"  "REST Catalog is actually a protocol with a client implementation in the library.  There are examples of how to adapt that protocol to different catalog backends (like HMS or JDBC)... The REST protocol allows for advanced features that other catalogs cannot support, but that doesn't mean all of those features will be available for every REST implementation" [Slack]
  • JDBC is near ubiquitous but "it doesn’t support multi-table transactions". "With JDBC the database does the locking, so no external lock manager is required" [Slack]

So, which should you use? From contributor, Daniel Weeks in Slack:

"If you're not using HMS currently, I would suggest going with JdbcCatalog, which you can also use directly or with a REST frontend... I would strongly suggest using JDBC Catalog unless there's something specific you need. HMS is built for hive and iceberg is not hive.  There is both a lot of completely and baggage that comes with hive.  For example, if you change the table scheme directly in hive, it does not change the schema in your iceberg table.  Same with setting table properties. JDBC is super lightweight and native to iceberg, so if you don't have hive, I would avoid using it.

"There are multiple projects that are starting to adopt REST and I expect that only to grow, but that doesn't mean you necessarily need it right now.  The main thing to think about is using multiple catalogs (not limit yourself to a single one). You can use JDBC directly now (most engines support it), but you can always add a REST frontend later.  They can co-exist and REST can even proxy to your JDBC backend"

[1] "Apache Iceberg: The Definitive Guide"

Saturday, March 9, 2024

Big Data and CPU Caches

I'd previously posted about how Spark's data frame schema is an optimization not an enforcement. If you look at Spark's code, schemas save checking whether something is null. That is all. 

Can this really make so much of a diffence? Surprisingly, omitting a null check can optimize your code by an order of magnitude. 

As ever, the devil is in the detail. A single null check is hardly likely to make a difference to your code. But when you are checking billions of times, you need to take it seriously. 

There is another dimension to this problem. If you're checking the same reference (or a small set of them) then you're probably going to be OK. But if you are null checking large numbers of references, this is where you're going to see performance degradation.

The reason is that a small number of references can live happily in your CPU cache. As this number grows, they're less likely to be cached and your code will force memory to be loaded from RAM into the CPU.

Modern CPUs cache data to avoid hitting RAM. My 2.40GHz Intel Xeon E-2286M has three levels of cache, each bigger (and slower) than the next:

$ sudo dmidecode -t cache  
Cache Information                       
Socket Designation: L1 Cache                     
Maximum Size: 512 kB                 
...                  
Socket Designation: L2 Cache                   
Maximum Size: 2048 kB                
...                      
Socket Designation: L3 Cache                      
Maximum Size: 16384 kB             

Consequently, the speed we can randomly access an array of 64-bit numbers depends on the size of the array. To demonstrate, here is some code that demonstrates. The results look like this:


Who would have thought little optimizations on big data can make such a huge difference?