We've played around with machine learning models to automatically categorize customer transactions. However, the accuracy of a model is irrelevant if it can't be taken to production. "One of the main reasons most [Data Science] tutorials won’t go beyond this step is because — productionisation is hard" (from Medium).
There are solutions - for instance Open Scoring which also appears to give you the break-down for the probabilities of a particular classification. This is great for us because if we don't get the prediction spot on, we want to give the user another suggestion.
Open Scoring offers the chance to drop in your PMML and you're good to go (although note "PMML is a great option if you choose to stick with the standard models and transformations. But if you’re interested in more, don’t worry there are other options" - Medium).
The old MLLib models could be turned to PMML easily (model.toPMML). But the new ones will need the Spark-JPMML library. However, this may clash with Spark's dependencies as both depend on JPMML. This appears to be fixed in Spark 2.3. See this Spark issue.
However, the IT department imposed another restriction: they don't have the resources to maintain Open Scoring servers. Instead, they want a Java library they can package with their product. Oh yes, and they're running out of JVM memory so it can't have too large a footprint, please.
So, one option might be H2O which saves the model as an actual piece of Java code that you can compile!
But since we were getting good results with Naive Bayes (both in Spark and in SciKit-Learn) we decided to code our own. Any decent developer can code it in quite a small number of lines.
The truth of the matter is that if you have lots of data and you're not smashing Kaggle, Naive Bayes is generally good enough for most purposes. Working on the principle of a Minimum Viable Product, development went something like this:
- Create the basic NB algorithm (accuracy: 81%)
- Add smoothing (with a value for alpha of 1E-9, accuracy: 85%)
- Use Kullback-Leibler between the correctly and incorrectly data to see which features are dissimilar. Eyeballing the largest discrepancies led us to conclude that 3-digit numbers confused matters and ignoring them completely improved things (accuracy: 89%).
- Trying n-grams (with n=2, accuracy: 91%).
- Trying different combinations of features while using bigrams (accuracy: 96%). Note that the combination of features is not obvious as we're not really processing natural language. We found the best permutation using brute-force.
The nice thing about building a simple Machine Learning MVP is that you becomes familiar with the data very quickly. In my case, what became clear was that I was getting a lot of transactions that were falsely labeled as a misclassification. This was because the categories were not mutually exclusive (for example: is car tax to be classified as CAR or TAXES? It depends on the user).
Also, a simple model (like Naive Bayes) makes it easy to question the data ("why did you classify this transaction as X???") by simply looking at the likelihood vector for each word. Try doing that with a neural net...
[Edit: there is a good discussion of how other people tackle the problem of productionisation at Reddit].