Tuesday, December 8, 2020

Cache vs Delta Cache

Continuing the investigation into a lack of cache coherency in Databricks in my previous post,  I've raised the issue with the people at Azure.

In the document a representative from Microsoft pointed me to, there are two caches at play: Delta Cache and Apache Spark cache. "The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage" and is enabled by default on certain clusters. In effect, you get it for free when you use Databricks 

The Apache Spark cache is familiar one that is invoked by calling Dataset.cache().

My first complaint is that these two caches behave inconsistently. In the Delta Cache, updates are immediately available. But if we use the Apache Spark cache in conjunction with the Databrick's Delta Lake (not to be confused with the orthogonal Delta Cache), the data is frozen in time.

Now, in the Spark world, Datasets are strictly immutable. But Delta Lake "provides ACID transactions ... on top of your existing data lake and is fully compatible with Apache Spark APIs." Well, once I've invoked .cache(), and update the Delta Lake in the standard way:

    df.write
      .format("delta")
      .mode("overwrite")
      .option("replaceWhere", CONDITION_STRING)
      .save(HDFS_DIRECTORY)

(working example is here) then I cannot see my own updates!

What is, perhaps, even more bizarre can be seen when I try to read the data afresh. A second, totally independent call (but using the same Spark session) to read the data like this:

session.read.format("delta").load(dir)

cannot see the data either! And just to show you that there is nothing up my sleeve, running a completely separate process (code here) to read the data shows that the change has indeed been persisted to HDFS. This appears to be because Spark caches the data plus the Delta Lake meta data too and cares not for Databricks' new semantics.

This brings me to my next complaint - the leaky abstraction. Databricks is trying to retrofit ACID transactions on an API that was built with immutability at its core. The claim that it's "fully compatible with Apache Spark APIs" seems not enitrely true.

I have a meeting scheduled with the MS representative who is currently of the opinion that we should just never call .cache(). On Azure Databrics, this does not seem too bad as the Delta Cache seems pretty fast. It sucks for anybody using Delta Lake on just HDFS.

Summary

If you call .cache() on a Spark Dataset while using Databricks Delta Lake format you will not be able to:

  • see updates from any other Spark session
  • see your own updates to this data
  • see the true data even if you re-read it from the underlying data store
unlesss you unpersist the Dataset.

[Addendum: I spoke to Databricks engineer, Sandeep Katta, on 17 December 2020 and he agreed the documentation is misleading. He says he'll generate a docs PR and submit it to the relevant team to make it clear that one should not use .cache when using Delta Lake]

No comments:

Post a Comment