Friday, September 27, 2024

Running Iceberg Catalogs in a test suite


When trying to put some BDDs together here (GitHub) for Iceberg and Spark integration, I hit some snags. I was using the local catalog. If I used the Hive catalog, I was getting this error ("Iceberg does not work with Spark's default hive metastore"). 

The reason is that Derby, which is bundled with the Hive metastore in Spark, "doesn't support the interfaces (or concurrent access) required for Iceberg to work" according to Russell Spitzer. He suggests using the Hadoop catalog instead. 

I didn't fancy setting up a more robust Hive metastore that uses a more "professional" database like Postgres so I  followed Russel's advice to use the Hadoop catalog... and immediately hit this problem that's discussed more in Slack. Basically, CALLing via Spark SQL some procedures throws an IllegalArgumentException stating "Cannot use non-v1 table". This is a Spark v1 table, and nothing to do with Iceberg v1 tables, it seems.

OK, so this got me thinking that if an easy way to use Hive and Hadoop catalogs don't work, let's try a REST catalog. Apache Polaris seemed a good candidate. At first, I was hoping to have it run in-process but because of transitive dependencies, it wasn't possible to have everything running easily in the same JVM.

Hmm, so we need it in a separate process. OK, we can Docker-ise that. After adjusting the fixtures, I noticed that while running an erstwhile passing test for a second time leads to a failure. Looking at the stack traces, it appears that sure, we DROP TABLE IF EXISTS ..., but the underlying files are left behind. Russell Spitzer helped me out (again) by pointing out that a PURGE keyword needs to be added to the SQL.

Does the catalog also need to delete metadata files even if the query engine deletes the data files?
Russell Spitzer
This isn't actually defined anywhere. So at the moment we are kind of in a wild west, the Spark implementation deletes absolutely everything. I believe for Trino it just sends the request to the catalog and it's up to the catalog to decide. So for Spark "no", and it explicitly sends a drop table request to the rest catalog without a purge flag.
Russell says Polaris will delete files with this flag [caveat] although there is debate about making this service asynchronous ("I think polaris could own this service too but I know some folks (like netflix) have delete services already"), and "the option to hook into an external delete service".

So, there is some latitude in how one implements a catalog, it seems. This would explain why some of my tests are failing - some assumptions I made are not guaranteed. Still working on making them pass.

No comments:

Post a Comment