When it comes to classifying documents, there are many ways to skin a cat. The use case at hand is to classify small amounts of text into 21 pre-defined categories. Since this data is private, for this post I'll use the 20 Newsgroups data. This contains some 18k documents that have already been categorised and each document contains a subject and a body.
First we need to process the data. Spark has some great out-of-the box transformers that can tokenize (Tokenizer) the words, remove stop words (StopWordRemover), create n-grams (NGram), generate the TF-IDF vectors (HashingTF and IDF) and finally normalize them (Normalizer).
Now, a natural choice for then categorising these vectors might be KMeans since we know exactly how many clusters we are looking for. However, I had appalling results - over 99% of the vectors all congregated in one cluster while the other 1% was spread over the others.
This had me scuttling back to the code that created my features and checking everything was OK (punctuation removed, all lower case, etc). This Stackoverflow answer had a nice list of what to check. Unfortunately, it didn't help.
Thinking about it, this was not too surprising. KMeans measures the distance between points. That is calculated as:
Σk(vik - vjk)p
where k is the k-th element of the vectors that represent document i and j. The value p is set to 2 if you want Euclidean distances but that might not always be desirable.
Consequently, “k-means, however, may be a bad fit. Because the means computed will not have a realistic sparsity, but will be much more dense” (from Stackoverflow).
So, an alternative was to use Spark's NaiveBayes implementation. The results here were immediately decent - about 85% success in classifying whole documents and 71% success for just classifying them based on their subject text.
The best explanation of this algorithm I found was here on Stanford University's website. The probability of document d being in category c is:
p(d|c) ∝ p(c) ∏k p(tk|c)
where p(tk|c) is the probability of the k-th term of the bag of words of document d being in category c. This can be calculated from a histogram of words over categories.
The term p(c) is our Bayesian prior and is easily calculated from the distribution of documents over the categories.
We then choose the most likely - that is, highest value of p(d|c) - for all categories.
A non-sparse solution might be to use Spark's Word2Vec implementation. However, since the vector elements can be negative, you can't use NaiveBayes.