Apache Iceberg deals with updates in two different ways:
- Merge on read
- Copy on write
What are these strategies? To illustrate, I've some BDD code that writes and updates 20 rows of data using both MOR and COW. (There's an interesting article at Dremio about this topic but some of the code is Dremio specific).
Simply put, copy-on-write replaces the entire parquet file if a single row in updated.
In merge-on-read, a file containing only the updated data (and a file saying which row was affected) are written. It is the reader that needs to reconcile the data.
The two strategies are for different audiences. COW makes writes slow and reads fast. MOR makes reads slow and writes fast.
Focussing on the more complex of the two strategies, MOR, we see that it creates four new files compared to COW's two. This maps to two different parquet files (as each parquet file in both strategies has a .crc file that holds its metadata). One file contains the updated data, the other a textual reference to the original file containing the original data. No files are deleted and the original parquet file remains the same. No data is redundant.
Iceberg has other means of storing the delete data other than position in the file. It also can check equality. However, this appears to be not supported by Spark at the moment.
(BTW, the title of this post refers to this SNL sketch).
No comments:
Post a Comment