I'd heard Random Forests generally give good results but the accuracy I was seeing was bad. After some head scratching, things improved enormously.
What are Random Forests?
"A random forest is simply a collection of decision trees whose results are aggregated into one final result. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models" (from TowardsDataScience).
The matrix for the 20 Newsgroups data was generated using TF-IDF. Interestingly, if all data is used, the accuracy from Spark's RF implementation is the monkey score (roughly 5%).
However, if we insist that a word must appear in at least 20 documents the accuracy was approximately 25%.
Changing the Random Forest hyperparameters for minDocFreq=1 only achieved a limited increase in accuracy; setting the number of trees hyperparameters to [10, 100, 100] gave an accuracy percentage of [10.3, 27.3, 38.3].
Removing columns with no non-zero values in them (as outlined here) managed to increase the accuracy to between 42% and 48%. But no matter what hyperparameter tuning I did, there it stayed.
What using using minDocFreq=25 and removing empty columns had in common was that they resulted in TF-IDF matrices that were less sparse.
"There is no [Random Forest] implementation for sparse data in R. Partially because RF does not fit very well on this type of problem -- bagging and suboptimal selection of splits may waste most of the model insight on zero-only areas. Try some kernel method or better think of converting your data into some more lush representation with some descriptors (or use some dimensionality reduction method)." (from StackOverflow).
(See On the Surprising Behaviour of Distance Metrics in High Dimensional Space for more information).
So, let's go all the way and using Singular Value Decomposition. This reduced my TF-IDF from having about 10 000 features to a mere 400. This figure of 400 was found by just a manual binary search that attempted to maximize the accuracy. Interestingly, the eigenvalue "elbow" can be seen in this scree plot:
After this (and some more cross validation that said my tree size should be about 190), accuracy was reaching about 81%.
Note, you may need to call setMaxMemoryInMB to something quite high otherwise executors will die for no apparent reason. In fact, you'll see nothing special when you call jstat -gc on the executors.
A Matter of Interpretation
One very nice thing about tree models is that they're easy to interpret compared to say neural nets.
Spark gives a very nice method (featureImportances) that tells you which features were the most important when generating the model.
Gradient Boosted Trees
Note that Spark's facility for boosting trees is currently limited. In Spark, "GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or Random Forests." (from the docs).