Interesting problem: we have 33 thousand 'areas' in the UK for the purposes of health. People would like to compare one health area with another that has similar demographics etc. We'd like to find these peers automatically using machine learning but nobody really knows what makes a neighbour in feature space an actual neighbour.
This is an unsupervised ML task. We cannot use the go-to clustering algorithm KNN (where a class is elected given k nearest neighbours as calculated with some arbitrary distance metric) as this is a supervised learning algorithm and each vector is its own distinct class. That is, there is no class with more than one data point in it.
So, how do we measure how good our algorithm is? We might like to use hypothesis testing where we compare the 10 nearest neighbours (k=10 for us) from our algorithm against 10 randomly chosen feature vectors (null hypothesis). For both these samples, we follow this process: take, say, 200 other vectors randomly chosen (synthetic data) and for each feature for each of the 200, calculate whether it's within the min/max range of the sample (test statistic). With this we can calculate whether we're getting results better than chance (p-test). So, we have fulfilled all of Downey's criteria for a hypothesis test in "There is still only one test" blog post.
Of course, we would jolly well hope that our results are better than chance but this introduces another problem: when every combination of feature engineering scores 100%, which is the best? We could take the difference between the test statistic for the null hypothesis sample and the actual sample. This is what we did in the end but it is one of any number of metrics.
Another metric we tested was the Pearson correlation statistic in measuring each feature vector in a sample with all the others. If the average score for our actual sample is greater than the null hypothesis then we know we're on the right path. Unfortunately, this proved to be not great. Thinking about it, it's not too surprising since the vectors are not random numbers but the demographics within (roughly) equal sized regions in the same country.
The Algorithm
Having established a test statistic, we then need to choose an algorithm and features.
Using the Freedman Diaconis algorithm to generate optimal bucket sizes from which probabilities could be calculated, I then used the Jensen Shannon to calculate nearest neigbours. However, our data was Zipfian so this meant almost everything fell into the same bucket. With some hacking, I resized the buckets so that there were many around the sharp peak and few at the edges. However, the score from using JS was still not great. I'm told because JS is better for finding outliers, which also jives with my experience.
Instead, the algorithm we used was L2 Euclidean distance. Bearing in mind that Euclidean distance is not a good metric in high dimensions [SO], our test statistic looked good when we applied it to bucketed (!) and scaled data. What's more, it was computationally cheap (remember that to compare all nodes is an O(N2) problem) and easy to explain to the client.
Palantir
For this work, I was asked to use Palantir technology which was pretty easy since it seems to be a thin wrapper around Spark. Qualms aside about the selling at high prices open source technology (with some features actually crippled), some best practices I came to acquire:
- Don't have multiple tabs open on the codebase. Apparently, you can download the code into your favourite IDE but you won't have the data. So, at some point, you'll be coding in the browser. If you have multiple tabs open, Palantir gets confused and overwrites versions. Not a great user experience.
- Create a pyspark file for each combination in the pipeline that is only a couple of lines long which delegates to shared code. This way, it's easier to explore lineage with the Monocle tool. This is actually quite a nice feature of Palantir giving you a graphical representation of your pipelines.