Previously, 20k sentences would take a few minutes using Spark's built-in machine learning algorithms. So, I was surprised that 20k larger documents could take far, far longer.
Note: each data-point was reduced to a vector of length 20 irrespective of whether it was a short sentence or a long piece of text. So document size should have been irrelevant.
After over three hours, my One-vs-All, Support Vector Machine still was not finished.
I managed to fix the problem but first, here are the results for the models I used on the larger documents. Again, the figures are indicative as I did not run the models more than once.
|LinearSVC and OneVsRest||74.2|
Despite calling .cache() on every DataFrame in sight and everything apparently fitting into memory with gigabytes to spare, performance was dreadful. (Note that DataSet.cache has a default StorageLevel of MEMORY_AND_DISK whereas RDD.cache defaults to MEMORY_ONLY).
Using jstat, it was clear all the executor threads were involved in some sort of serialization with deep stack-frames in java.io.ObjectInputStream.
So, instead I tried calling checkpoint on the DataFrame before passing it to the machine learning algorithm. This has the effect of removing all lineage and writing to the HDFS directory you must set in sc.setCheckpointDir.
"Broadly speaking, we advise persisting when jobs are slow and checkpointing when they are failing"  say Karau and Warren but since my job was so slow it might as well have failed I had nothing to lose.
Indeed, the SVM was now able to process the data in a few minutes. Quite why cacheing didn't work as expected is something of a mystery that I will look at in another post.
 High Performance Spark