The user can fiddle with a few knobs to mitigate this. One is write.distribution-mode. Here are some tests I created to see how the configuration changes affect the number of files when writing the same data:
write.distribution-mode | Number of files | Notes |
"hash" | p | df.writeTo(tableName).append() |
"hash", sorted DataFrame | p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"hash", sorted table | p | df.sort("partitionField").writeTo(tableName).append() |
"hash", sorted table but only one value for partitionField | 1 | because p=1; assumes the size of the data to write is < write.spark.advisory-partition-size-bytes. Otherwise multiple files are written (Spark 3.5). |
"none" | d * p | df.writeTo(tableName).append() |
"none", sorted DataFrame | p | df.sort("partitionField").writeTo(tableName).append() |
"none", sorted table | d * p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"none", sorted table but only one value for partitionField | d | because p=1 |
p = number of (logical) partitions in the data
d = number of (physical) partitions in the data frame
Note that this is for Spark 3.5. For a distribution mode of hash, and with the size of data exceeding advisory-partition-size-bytes, multiple threads write multiple files.
But for Spark 3.3, if we use a distribution mode of hash and the data exceeds the size of write.spark.advisory-partition-size-bytes, then only one thread writes.
Fan out made no difference in my tests that measured the number of files but it should be used, despite what the documentation says. Contrary to the documentation, Russell Spitzer on Discord says:
"Fanout writer is better in all cases. We were silly. The memory requirements were tiny IMHO. Without fanout, you need to presort within the task but that ends up being way more expensive (and memory intesive) IMHO. In the latest versions @Anton Okolnychyi removed the local sort requirements if fanout is enabled, so I would recommend fanout always be enabled and especially if you are using distribution mode is none."