Wednesday, September 26, 2018

Productionisation


We've played around with machine learning models to automatically categorize customer transactions. However, the accuracy of a model is irrelevant if it can't be taken to production. "One of the main reasons most [Data Science] tutorials won’t go beyond this step is because — productionisation is hard" (from Medium).

There are solutions - for instance Open Scoring which also appears to give you the break-down for the probabilities of a particular classification. This is great for us because if we don't get the prediction spot on, we want to give the user another suggestion.

Open Scoring offers the chance to drop in your PMML and you're good to go (although note "PMML is a great option if you choose to stick with the standard models and transformations. But if you’re interested in more, don’t worry there are other options" - Medium).

The old MLLib models could be turned to PMML easily (model.toPMML). But the new ones will need the Spark-JPMML library. However, this may clash with Spark's dependencies as both depend on JPMML. This appears to be fixed in Spark 2.3. See this Spark issue.

However, the IT department imposed another restriction: they don't have the resources to maintain Open Scoring servers. Instead, they want a Java library they can package with their product. Oh yes, and they're running out of JVM memory so it can't have too large a footprint, please.

So, one option might be H2O which saves the model as an actual piece of Java code that you can compile!

But since we were getting good results with Naive Bayes (both in Spark and in SciKit-Learn) we decided to code our own. Any decent developer can code it in quite a small number of lines.

The truth of the matter is that if you have lots of data and you're not smashing Kaggle, Naive Bayes is generally good enough for most purposes. Working on the principle of a Minimum Viable Product, development went something like this:
  1. Create the basic NB algorithm (accuracy: 81%)
  2. Add smoothing (with a value for alpha of 1E-9, accuracy: 85%)
  3. Use Kullback-Leibler between the correctly and incorrectly data to see which features are dissimilar. Eyeballing the largest discrepancies led us to conclude that 3-digit numbers confused matters and ignoring them completely improved things (accuracy: 89%).
  4. Trying n-grams (with n=2, accuracy: 91%).
  5. Trying different combinations of features while using bigrams (accuracy: 96%). Note that the combination of features is not obvious as we're not really processing natural language. We found the best permutation using brute-force.
with the IT team re-deploying our library with each iteration.

The nice thing about building a simple Machine Learning MVP is that you becomes familiar with the data very quickly. In my case, what became clear was that I was getting a lot of transactions that were falsely labeled as a misclassification. This was because the categories were not mutually exclusive (for example: is car tax to be classified as CAR or TAXES? It depends on the user).

Also, a simple model (like Naive Bayes) makes it easy to question the data ("why did you classify this transaction as X???") by simply looking at the likelihood vector for each word. Try doing that with a neural net...

[Edit: there is a good discussion of how other people tackle the problem of productionisation at Reddit].


Wednesday, September 19, 2018

Spark and Buckets


Unbalanced Partitions

Deliberately trying to get Spark to have an unbalanced partition

val n  = 10000000
val df = sc.range(1 to n).toDF("id")
df.groupByKey{ r => r.getInt(0) / n }.agg(count("id")).show()

seems fine but

val allKeysIn1Partition = df.map { r => 1 -> (r.getInt(0), "A" * 10000)
val repartitioned       = allKeysIn1Partition.repartition($"_1")
val pSizes              = repartitioned.mapPartitions { x => Seq(x.size).iterator }
pSizes.agg(max($"value"))

shows rapid progress for all but the last partition and indeed this query shows all the data lives in one partition.

Buckets

"Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations.

"So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore..." (from here).

Note that you can't use bucketing for a simple save. It must be a saveAsTable call (see Laskowski's Mastering Spark SQL for more information) which is for Hive interoperability. Otherwise you'll get a "'save' does not support bucketBy and sortBy right now" error.

Partitioning vs. Bucketing

"Bucketing is another technique for decomposing data sets into more manageable parts" (from here). In Hive, for example, "suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket."

In Spark, "partitionBy ... has limited applicability to columns with high cardinality. In contrast bucketBy distributes data across a fixed number of buckets and can be used when a number of unique values is unbounded" (from the Spark documentation).

Saturday, September 1, 2018

Extrapolation of an ML model


Introduction

We have a data set that we need to classify. About 1% has already been classified by our users and our model looks pretty decent when splitting this into testing and training subsets. However, how can we be sure that it will work on the other 99%?

Approaches

One approach might be to see what it is that makes a model decide which class to classify an object. There is a technique for this called Local Interpretable Model-agnostic Explanations (LIME).  The authors original paper make a very good point that accuracy during the testing/training phase can be misleading. "There are several ways a model or its evaluation can go wrong. Data leakage, for example, defined as the unintentional leakage of signal into the training (and validation) data that would not appear when deployed, potentially increases accuracy" erroneously, they say. They demonstrate this danger by using the 20 Newsgroups data set and observing classifications that were correct but for the wrong reason.

We will probably return to LIME but for this job we chose Kullback-Leibler as this was very easy to implement efficiently in Apache Spark.

Kullback-Leibler Divergence

What is it? Well, we want to know how our 1% data set approximates the 99%. "It is no accident that Cover & Thomas say that KL-Divergence (or "relative entropy") measures the inefficiency caused by the approximation" (from SO).

KL has a simple equation:

D(P|Q) = Σi pi ln (pi/qi)

where pi is the probability of element i in data set P and qi is the probability of element i in data set Q. You'll quickly notice that it is not a metric as it does not satisfy the triangle inequality nor symmetry. (The probability that the two distributions mimic each other is given by user Did in this SO post).

You might also quickly see that this equation blows up if there is an element in P that's not in Q or vice-versa. What to do?

You might try "smoothing" by replacing values of 0 with a "very small value" (see here). How large this value is can be quantified here.

[Aside: this SO post mentions "the number of atoms in the discrete distribution" which was a term I was not familiar with and prompted me to ask my own question on SO. In the context of a normal word count, an atom would be a word. But in other contexts we might note that a word is not atomic. For instance the English word "other" has the English word "the" inside it. What is an atom appears to be use case dependent. I also thought this SO post which compares the support of a distribution (ie, where P(X) > 0) to the set of atoms. Its example looks contrived but is quite neat as it's saying the rational numbers make up the set of atoms but the irrational numbers in [0,1] make the support].

Or, you might ignore items for which q or p is zero, that is "employ heuristics throwing out all the values that do not make sense... While I acknowledge that [this] is a convention, it doesn't really fit with the nature of these distributions... the invalidity of the formula in the presence of zeros isn't just some unfortunate hack, it is a deep issue intimately tied to how these distributions behave."

Finally, you might solve the problem by using a slightly different variant of KL like Jensen-Shannon divergence. This is simply "the sum of the KL-divergences between the two distributions and their mean distribution" (Reddit).

Having coded the algorithm, I then ran it on our data. The methodology we chose was to take 10 samples of 1% from the 99% that is unclassified and calculate their convergence metrics with the remainder. Then, the human-classified 1% was compared to this remainder. Interestingly (and annoyingly) the samples had an average Jensen-Shannon value of 0.02 (with a standard deviation of about 10^-5) while the score for the human-classified data was 0.12. So, quite a difference.

The differences seem to be based not on the missing data in each data set but due to the differences in probabilities for the data they shared. Quite why requires further investigation.