Introduction
We have a data set that we need to classify. About 1% has already been classified by our users and our model looks pretty decent when splitting this into testing and training subsets. However, how can we be sure that it will work on the other 99%?
Approaches
One approach might be to see what it is that makes a model decide which class to classify an object. There is a technique for this called Local Interpretable Model-agnostic Explanations (LIME). The authors original paper make a very good point that accuracy during the testing/training phase can be misleading. "There are several ways a model or its evaluation can go wrong. Data leakage, for example, defined as the unintentional leakage of signal into the training (and validation) data that would not appear when deployed, potentially increases accuracy" erroneously, they say. They demonstrate this danger by using the 20 Newsgroups data set and observing classifications that were correct but for the wrong reason.
We will probably return to LIME but for this job we chose Kullback-Leibler as this was very easy to implement efficiently in Apache Spark.
Kullback-Leibler Divergence
What is it? Well, we want to know how our 1% data set approximates the 99%. "It is no accident that Cover & Thomas say that KL-Divergence (or "relative entropy") measures the inefficiency caused by the approximation" (from SO).
KL has a simple equation:
D(P|Q) = Σi pi ln (pi/qi)
where pi is the probability of element i in data set P and qi is the probability of element i in data set Q. You'll quickly notice that it is not a metric as it does not satisfy the triangle inequality nor symmetry. (The probability that the two distributions mimic each other is given by user Did in this SO post).
You might also quickly see that this equation blows up if there is an element in P that's not in Q or vice-versa. What to do?
You might try "smoothing" by replacing values of 0 with a "very small value" (see here). How large this value is can be quantified here.
[Aside: this SO post mentions "the number of atoms in the discrete distribution" which was a term I was not familiar with and prompted me to ask my own question on SO. In the context of a normal word count, an atom would be a word. But in other contexts we might note that a word is not atomic. For instance the English word "other" has the English word "the" inside it. What is an atom appears to be use case dependent. I also thought this SO post which compares the support of a distribution (ie, where P(X) > 0) to the set of atoms. Its example looks contrived but is quite neat as it's saying the rational numbers make up the set of atoms but the irrational numbers in [0,1] make the support].
Or, you might ignore items for which q or p is zero, that is "employ heuristics throwing out all the values that do not make sense... While I acknowledge that [this] is a convention, it doesn't really fit with the nature of these distributions... the invalidity of the formula in the presence of zeros isn't just some unfortunate hack, it is a deep issue intimately tied to how these distributions behave."
Finally, you might solve the problem by using a slightly different variant of KL like Jensen-Shannon divergence. This is simply "the sum of the KL-divergences between the two distributions and their mean distribution" (Reddit).
Having coded the algorithm, I then ran it on our data. The methodology we chose was to take 10 samples of 1% from the 99% that is unclassified and calculate their convergence metrics with the remainder. Then, the human-classified 1% was compared to this remainder. Interestingly (and annoyingly) the samples had an average Jensen-Shannon value of 0.02 (with a standard deviation of about 10^-5) while the score for the human-classified data was 0.12. So, quite a difference.
The differences seem to be based not on the missing data in each data set but due to the differences in probabilities for the data they shared. Quite why requires further investigation.
No comments:
Post a Comment