This Spark/Parquet abuse bit us hard this week. "In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory" (from here). This is great for compression as values do not even appear in the file as they are encoded in the directory structure of a Parquet file.
Unfortunately, if you save the Dataset to HDFS, it appears that a new file on HDFS is created for each contiguous block that belongs to the same Parquet directory. For example, if there were only two values for a key, 1 and 2, and the Dataset would map elements to partitions like this:
then this piece of the sequence would result in 5 different files having keys [1,1], , , [2,2,2] and [1,1,1].
So, instead of fewer but larger files, we had many smaller files. This is a scenario that HDFS does not handle very well and it manifested its displeasure by a DDOS on the Name Node.
What's more, resolving the problem by deleting them also caused the Name Node pain. Using the -skipTrash flag made things a little better.
The solution was to sort the Dataset before saving.