I watched Ryan Blue's excellent "Parquet Performance Tuning" video (here and slides here). But before watching it, it's worth familiarizing yourself with the glossary at the Parquet github page:
Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.
Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.
Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages. The units of parallelization are:
Job | Granularity |
MapReduce | File/Row Group |
IO | Column chunk |
Encoding/Compression | Page |
Tuning Parquet
My notes from the video are:
- Parquet builds a dictionary and "just stores pointers into that dictionary ... basically the big trick in Parquet compression".
- Dictionaries not only save space but can be used for filtering (no entry means that data page doesn't have it). It is highly recommend to implement data locality as it leads to smaller dictionaries.
- Be careful of writing too many files (22'18"). Incoming partitions are not the same as outgoing partitions. That is, the partition key on the incoming file need not be that of the outgoing.
- Look at CPU usage and bytes read not wall time as that can vary (just plain stats).
- Use parquet-tools to see the stats demonstrating that your data really is partitioned as you expect and has sensible splits (eg, A-z is wrong but A-F, f-M etc is better).
- Every row group has its own dictionary (43'10").
- Decreasing row group size increases parallelism.
- "You're going to need a sort anyway if you want to take advantage of stats" (46'10"). See below for a further explanation.
- Brotli compression "does really really well with text data" (47'20").
To take advantage of the stats embedding in the files "you need to insert the data sorted on filtering columns Then you will benefit from min max indexes and in case of orc additional from bloom filters, if you configure them. In any case I recommend also partitioning of files (do not confuse with Spark partitioning)" (see here).
Interestingly, the Parquet documentation has something to say about HDFS block sizes: "We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file."
Interestingly, the Parquet documentation has something to say about HDFS block sizes: "We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file."