If you're having trouble making Spark actually work, you're not alone. I can sympathize with the author of these gotchas. Something killing me for weeks was how slow a Spark job was - tens of hours. Further investigation showed lots of GC - something like a third of the run time. Isn't Spark supposed to handle huge volumes of data?
This one tip proved to be a quick-win:
This meant that the data held in memory was serialized. As a result, the memory footprint was 80% less (we use Kryo). In itself, that's cool. But the great thing was that all our data now fitted into the memory of the cluster.
Sure, Spark can process more memory than is physically available but if it's swapping memory in and out from disk, that's not only slow but it's going to cause a lot of garbage collection.
With this setting, full garbage collections took only 30s for hours of processing.
The downside is that obviously this requires more CPU. But not much more and a lot less than a JVM that's constantly garbage collecting.