Sunday, June 16, 2024

First look at the Unity Catalog

Databrick's Unity Catalog has now been open sourced [Git]. But there's not a huge amount of code there - I counted a mere 133 Java files which were neither test nor examples.

This is not too surprising since it's little more than a REST API over a (H2) database that stores metadata. It is just a catalog after all.

What is a little more surprising is that "MANAGED table creation is not supported yet." [GitHub] Managed tables are those that Unity, well, manages. That is, the whole life cycle of the table is under its purview. Contrast this with EXTERNAL tables that live elsewhere, who knows where - it's not important. 

Remember that Unity is a catalog not a data store. This confusing the map with the territory [Wikipedia] is common when coming across metastores. This is perhaps due to the Apache Hive project that is somewhat moribund these days and Hive's metastore which is very much alive. But think of a metastore like web URLs where the locale of the machine hosting the website is irrelevant. Only its domain name is what we care about.

With Databricks recent acquisition of Tabular.io (the defacto commercial force behind Apache Iceberg) it will be interesting to see how Iceberg integrates with it, if at all. Iceberg does not (as demonstrated here) store its table information in a metastore (unlike Delta). All its information is contained in the metadata files. Whether Databricks will encourage Iceberg to be tied to the metastore remains to be seen.