It's no secret that
most of data science is actually data munging. Indeed, since we've only been using a Naive Bayes model, there was only one hyperparameter: that for
Laplacian smoothing. Tuning this did indeed make our model better but that was a fairly trivial exercise. And it was the feature engineering that made the greatest difference.
Most online machine learning tutorials focus on the model not feature engineering (the only book on the subject that I am aware of is being written
here). Perhaps because it is an
art as much as a science. Well, these are some notes I've made over the last four months when trying to bring a bank transaction classification system to market. YMMV.
Data cleaning and domain knowledge
Some machine learning tool vendors say their products can do this for you auto-magically but practitioners I meet are skeptical. For instance, a colleague was working on a security system for a mobile banking app. If the 'phone had been compromised, some software by an Israeli company would flag the transaction as suspect. Yet, in all his training data not a single transaction was marked as such. Automated machine learning tools would have thought this field useless and discarded it yet it was incredibly important.
My client for this my most recent project was an Austrian bank. Now, there was some fairly trivial things we could do when feature engineering for German textual data. Replacing all German characters with Latin characters improved things a little. As did the fact that Germans often avoid umlauts when online replacing, for instance, "ö" with "oe" (the national rail company is öbb but their website is www.oebb.at). So cleaning the data this way gave a slight improvement. But some things were not so straight forward.
The data
Our data was not much more than what you typically see on your bank statements. That is, some natural language and some other miscellaneous data. This other data included things you won't see on your statements like "POS K2 14/02/18 11:55" which is saying it's a Point of Service transaction (that is, you were in a store) at cashier desk K2 at a given date/time.
Our first instinct was to just treat the transactions as a corpus of text and for each word calculate its distribution over all characters. But why stick with unigrams?
Are Bigrams Better?
Maybe. They improved our bank transaction classification accuracy by about 4%. There are no
a priori reasons for why bigrams are better for us. For the most part, we're not dealing with natural language and there is no reason why "BillaK2" makes more sense than "K2Billa" (order
was important - see below).
It did make sense in some fields - for instance the recipient. "Phillip" and "Henry" are common enough names but "PhillipHenry" is sufficiently rare (in the German speaking world at least) for it to identify a single individual.
This unexpected boost in accuracy is not totally unheard of. David Marchette tells how he made a model that used a histogram over the bytes (that is, there were 256 buckets) in a computer virus binary when trying to identify which virus it was. It was surprisingly accurate (86% vs. a monkey-score of about 5%). "Computer scientists laugh when I tell this story"
he says, not because binary executables work on
word not
byte sized units of instructions. So, in theory, it shouldn't work. But in practice, it does.
Field order
This made a remarkable difference. Since we were not using natural language, the order of the text fields was arbitrary. But choosing a wrong permutation could reduce accuracy by a massive 10%.
What an optimal permutation was could only be derived by brute force.
Filtering out noise
Calculating the
Kullback-Leibler divergence between the distribution of words in the successfully and unsuccessfully classified data sets identified words that appeared to be over-represented. Ignoring them improved accuracy by about 2% for unigrams but made no difference to bigrams.
Similarly, looking at the entropy of the distribution of a word over all the categories and eliminating those with low entropy did not improve things.
Simply eliminating small words (less than 3 characters) and all strings that were just numerical boosted accuracy by a few percent for both bigrams and unigrams.
Another data masaging technique we used was to bucket transaction amounts rather than use raw figures. However, it raised the question: how best do we organise these buckets? One trick is to use a
Random Forest to help you create features. Random Forests don't work well on high dimensional data (which ours would be if we had a field per word - we had a million strings) so we tried all but these features. Accuracy was unsurprisingly an appalling 35% but RFs are easy to read and I used its amount buckets in my model. However, it actually reduced accuracy somewhat. So, instead I stuck with my (roughly) logarithmic buckets.
Splitting the data caveat
Randomly selecting transactions to be used in the training and testing sets consistently resulted in about 95% accuracy. Pretty good. However, this hid a significant problem. If we randomly selected
users for the training and testing sets and then partitioned
their transactions, accuracy dropped to about 66%.
[Aside: I'm clearly not the first to make a mistake like that. See
this data scientist make fall into the exact trap when using time series that
this data scientist warns against].
This was because a lot of transactions were peculiar to the individual. For instance, if a tenant pays his landlord every month using the text "Mr Phillip Henry" and marks it as rent, only one transaction in the training set may be needed to train the model. However, if this user was not in the training data (and consequently none of his/her transactions) then the model would have a hard time concluding that a payment was rent purely on the text "Mr Phillip Henry".
This was something of a problem to the business as new users experience a low accuracy in transaction categorization. We never solved this problem and the business had to accept that the model had to learn a user's habits before it could be a great performer.
Future work
Alas, data wrangling never comes to an end. If I had more time, I would have investigated:
- Pointwise mutual information (see
this SO answer for more pointers).
- Using n-grams of characters
- Bagging between different n-gram models
- Creating new features from eg, number of words in text, average word size etc
-
text lemmatization and
stemming (the difference being a word that has been lemmatized is always in the dictionary whereas one that has been stemmed may not be).
Sadly, we ran out time. But in 3 months we had managed to build a pure Java/Scala library that could be passed to the front-end team and give them good enough accuracy that the management team were happy.