Friday, March 28, 2025

Iceberg Tidbits

Some miscellaneous Iceberg notes.

Cleaning up

There is a logic to this. Russell Spitzer, of the Apache team says:
run rewrite first
expire second
you don't need to run remove_orphans unless something broke [Discord]
Here is a quick BDD I wrote that illustrates what's going on. Basically: 

  1. rewrite_data_files puts all the data in as few a number of files as possible. 
  2. expire_snapshots then deletes any files that are surplus.
Fewer files means less IO and a more efficient query.

Russell explains the reason we don't use remove_orphans  
To clarify, Remove Orphan Files is only for cleaning up files from failed writes or compactions that failed in a way that the process could not clean up.
Inside of Remove orphan files are 2 expensive remote steps
  • Listing everything in the directories owned by the table
  • Join of that list with metadata file paths
Then the list of files to delete is brought back to the driver where deletes are preformed [Discord]
This BDD demonstrates removing orphans through the Java API. I wanted to use CALL system.remove_orphan_files but couldn't. Instead, I got the error:

java.lang.IllegalArgumentException: Cannot remove orphan files with an interval less than 24 hours. Executing this procedure with a short interval may corrupt the table if other operations are happening at the same time. If you are absolutely confident that no concurrent operations will be affected by removing orphan files with such a short interval, you can use the Action API to remove orphan files with an arbitrary interval.
at org.apache.iceberg.spark.procedures.RemoveOrphanFilesProcedure.validateInterval(RemoveOrphanFilesProcedure.java:209)

which is a bit much for a simple BDD.

I forced Iceberg to break by having two threads try to write to the same table at the same time.

Pushdown

This is more Spark than just Iceberg but it's an interesting use case.

The IN clause is not pushed down, at least in Spark 3.5.1 - see some nice analysis here. TL;DR; instead of using IN, the most efficient query just converts it into a set of ANDs and ORs. Note that pushdown is (according to Russel Spitzer) pushing the actual function to the data and not mapping it to some other function that is semantically equivalent (as we see in that link).

This filter was not pushed down
Again, I've got a BDD to demonstrate this. Note that if the filters element is empty in a Physical Plan's BatchScan, your function has not been pushed down.

No comments:

Post a Comment