...but is she looking good? (With apologies to Kraftwerk).
"No single algorithm performs best or worst. This is wisdom known to machine learning practitioners, but difficult to grasp for beginners in the field. There is no silver bullet and you must test a suite of algorithms on a given dataset to see what works best." (MachineLearningMastery).
This blog goes on to quote from a report that compares models and data sets: "no one ML algorithm performs best across all 165 datasets. For example, there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting, despite being the overall worst- and best-ranked algorithms, respectively".
Another comparison is made here as a data scientist reproduces part of her PhD using modern tools. Interestingly, Naive Bayes this time is a close second to SVMs (a result echoed here when using Python).
For my part, I took the "20 Newsgroups" data set and fed it into Knime's Document Classification Example. (I like this data set as it's close to my proprietary data set). With almost no change, my results for the "subjects" part of the data was:
Boosting generally reduced all models by 5-10% in accuracy. So did removing just the punctuation rather than the more sophisticated massaging the data as in the example.
Interestingly, the results on the full corpus (not just subject text) was only about half as good.This could be that Knime could not store everything in memory. This appears to be because of how the data is split between train and test. A random split will mean there is a good chance a given record/category mapping is in both the train and test data. One must instead ensure that record/category tuples do not straddle both test and train data sets.
Note that this Knime example only uses the top words in the corpus as the TF-IDF matrix would be far too big to fit into memory otherwise. Here, Spark has the advantage of being able to easily process the full matrix (Naive Bayes for instance scores about 85% there). So, these results should constitute a minimum of what we can achieve in Spark.
In Spark, I just first cleared the text of all punctuation with the code in this StackOverflow suggestion, that is:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
and then ran the ML pipeline of [Tokenizer, StopWordRemover, NGram, IDF, HashingTF, Normalizer]. The results looked like:
The table is a little unfair as I spent a disproportionate amount of time tuning the models. And, as ever, YMMV (your mileage may vary). These are the results for my data, yours will probably look very different.
Interestingly, creating a hand-made ensemble of NaiveBayes, MultilayerPerceptronClassifier and RandomForestClassifier didn't improve matters. The result on these 3 models trained on the same data and voting on the test data gave an accuracy of 81.0%.
Finally, there were two algorithms that I've mentioned before that were not part of this work but I'll include them for completeness:
Ensemble Cast
So, taking five Spark models (LinearSVC, NaiveBayes, MultilayerPerceptron, RandomForestClassifier and LogisticRegression), we can take the results and using joinWith and map, weave the DataFrames together and let them vote on which category a given subject should be in.
Unfortunately, this roll-your-own bagging did not provide significantly better results. The overall accuracy at 86.029% was a mere 0.022% better than the best stand alone model.
"No single algorithm performs best or worst. This is wisdom known to machine learning practitioners, but difficult to grasp for beginners in the field. There is no silver bullet and you must test a suite of algorithms on a given dataset to see what works best." (MachineLearningMastery).
This blog goes on to quote from a report that compares models and data sets: "no one ML algorithm performs best across all 165 datasets. For example, there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting, despite being the overall worst- and best-ranked algorithms, respectively".
Another comparison is made here as a data scientist reproduces part of her PhD using modern tools. Interestingly, Naive Bayes this time is a close second to SVMs (a result echoed here when using Python).
For my part, I took the "20 Newsgroups" data set and fed it into Knime's Document Classification Example. (I like this data set as it's close to my proprietary data set). With almost no change, my results for the "subjects" part of the data was:
Model | Accuracy |
Decision Tree | 78% |
SVM | 76% |
KNN | 73% |
Naive Bayes | 70% |
Boosting generally reduced all models by 5-10% in accuracy. So did removing just the punctuation rather than the more sophisticated massaging the data as in the example.
Interestingly, the results on the full corpus (not just subject text) was only about half as good.
Note that this Knime example only uses the top words in the corpus as the TF-IDF matrix would be far too big to fit into memory otherwise. Here, Spark has the advantage of being able to easily process the full matrix (Naive Bayes for instance scores about 85% there). So, these results should constitute a minimum of what we can achieve in Spark.
In Spark, I just first cleared the text of all punctuation with the code in this StackOverflow suggestion, that is:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Model | Accuracy | Comment |
LinearSVC/OneVsRest/SVD | 86.0% | |
NaiveBayes | 85.1% | Very fast (1 minute or less) |
MultilayerPerceptronClassifier | 80.6% | Took about 30 minutes with layer sizes (262144, 100, 80, 20); 262144 is the (uncompressed) TFIDF vector sizes |
RandomForestClassifier | 79.6% | For numTrees=190, maxDepth=20 but after SVD with n=400 |
Logistic Regression | 76.1% | SVD with n=400 |
DeepLearning4J on Spark | 72.9% | After 227 epochs taking 6 hours on 15 executors with 1 core each |
RandomForestClassifier | 53.7% | For numTrees=190, maxDepth=30 |
RandomForestClassifier | 48% | For numTrees=1000 |
GBTRegressor | - | "Note: GBTs do not yet support multiclass classification" |
LinearSVC | - | "LinearSVC only supports binary classification." |
The table is a little unfair as I spent a disproportionate amount of time tuning the models. And, as ever, YMMV (your mileage may vary). These are the results for my data, yours will probably look very different.
Interestingly, creating a hand-made ensemble of NaiveBayes, MultilayerPerceptronClassifier and RandomForestClassifier didn't improve matters. The result on these 3 models trained on the same data and voting on the test data gave an accuracy of 81.0%.
Finally, there were two algorithms that I've mentioned before that were not part of this work but I'll include them for completeness:
Model | Accuracy |
Palladian | 89% |
Tensorflow | 87% |
Ensemble Cast
So, taking five Spark models (LinearSVC, NaiveBayes, MultilayerPerceptron, RandomForestClassifier and LogisticRegression), we can take the results and using joinWith and map, weave the DataFrames together and let them vote on which category a given subject should be in.
Unfortunately, this roll-your-own bagging did not provide significantly better results. The overall accuracy at 86.029% was a mere 0.022% better than the best stand alone model.