Continuing the investigation into a lack of cache coherency in Databricks in my previous post, I've raised the issue with the people at Azure.
In the document a representative from Microsoft pointed me to, there are two caches at play: Delta Cache and Apache Spark cache. "The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage" and is enabled by default on certain clusters. In effect, you get it for free when you use Databricks
The Apache Spark cache is familiar one that is invoked by calling Dataset.cache().
My first complaint is that these two caches behave inconsistently. In the Delta Cache, updates are immediately available. But if we use the Apache Spark cache in conjunction with the Databrick's Delta Lake (not to be confused with the orthogonal Delta Cache), the data is frozen in time.
Now, in the Spark world, Datasets are strictly immutable. But Delta Lake "provides ACID transactions ... on top of your existing data lake and is fully compatible with Apache Spark APIs." Well, once I've invoked .cache(), and update the Delta Lake in the standard way:
df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", CONDITION_STRING)
.save(HDFS_DIRECTORY)
What is, perhaps, even more bizarre can be seen when I try to read the data afresh. A second, totally independent call (but using the same Spark session) to read the data like this:
session.read.format("delta").load(dir)
This brings me to my next complaint - the leaky abstraction. Databricks is trying to retrofit ACID transactions on an API that was built with immutability at its core. The claim that it's "fully compatible with Apache Spark APIs" seems not enitrely true.
I have a meeting scheduled with the MS representative who is currently of the opinion that we should just never call .cache(). On Azure Databrics, this does not seem too bad as the Delta Cache seems pretty fast. It sucks for anybody using Delta Lake on just HDFS.
Summary
If you call .cache() on a Spark Dataset while using Databricks Delta Lake format you will not be able to:
- see updates from any other Spark session
- see your own updates to this data
- see the true data even if you re-read it from the underlying data store
No comments:
Post a Comment