Friday, March 26, 2021

Unsupervised nearest neighbours

Interesting problem: we have 33 thousand 'areas' in the UK for the purposes of health. People would like to compare one health area with another that has similar demographics etc. We'd like to find these peers automatically using machine learning but nobody really knows what makes a neighbour in feature space an actual neighbour. 

This is an unsupervised ML task. We cannot use the go-to clustering algorithm KNN (where a class is elected given k nearest neighbours as calculated with some arbitrary distance metric) as this is a supervised learning algorithm and each vector is its own distinct class. That is, there is no class with more than one data point in it.

So, how do we measure how good our algorithm is? We might like to use hypothesis testing where we compare the 10 nearest neighbours (k=10 for us) from our algorithm against 10 randomly chosen feature vectors (null hypothesis). For both these samples, we follow this process: take, say, 200 other vectors randomly chosen (synthetic data) and for each feature for each of the 200, calculate whether it's within the min/max range of the sample (test statistic). With this we can calculate whether we're getting results better than chance (p-test). So, we have fulfilled all of Downey's criteria for a hypothesis test in "There is still only one test" blog post

Of course, we would jolly well hope that our results are better than chance but this introduces another problem: when every combination of feature engineering scores 100%, which is the best? We could take the difference between the test statistic for the null hypothesis sample and the actual sample. This is what we did in the end but it is one of any number of metrics.

Another metric we tested was the Pearson correlation statistic in measuring each feature vector in a sample with all the others. If the average score for our actual sample is greater than the null hypothesis then we know we're on the right path. Unfortunately, this proved to be not great. Thinking about it, it's not too surprising since the vectors are not random numbers but the demographics within (roughly) equal sized regions in the same country.

The Algorithm

Having established a test statistic, we then need to choose an algorithm and features.

Using the Freedman Diaconis algorithm to generate optimal bucket sizes from which probabilities could be calculated, I then used the Jensen Shannon to calculate nearest neigbours. However, our data was Zipfian so this meant almost everything fell into the same bucket. With some hacking, I resized the buckets so that there were many around the sharp peak and few at the edges. However, the score from using JS was still not great. I'm told because JS is better for finding outliers, which also jives with my experience.

Instead, the algorithm we used was L2 Euclidean distance. Bearing in mind that  Euclidean distance is not a good metric in high dimensions [SO], our test statistic looked good when we applied it to bucketed (!) and scaled data. What's more, it was computationally cheap (remember that to compare all nodes is an O(N2) problem) and easy to explain to the client.

Palantir

For this work, I was asked to use Palantir technology which was pretty easy since it seems to be a thin wrapper around Spark. Qualms aside about the selling at high prices open source technology (with some features actually crippled), some best practices I came to acquire:

  • Don't have multiple tabs open on the codebase. Apparently, you can download the code into your favourite IDE but you won't have the data. So, at some point, you'll be coding in the browser. If you have multiple tabs open, Palantir gets confused and overwrites versions. Not a great user experience.
  • Create a pyspark file for each combination in the pipeline that is only a couple of lines long which delegates to shared code. This way, it's easier to explore lineage with the Monocle tool. This is actually quite a nice feature of Palantir giving you a graphical representation of your pipelines.


Monday, March 1, 2021

Spark Code Submissions

Hurray! My Py/Spark pull request to optimize hyperparameters using randomization has been accepted and merged to branch 3.2!

Here are some things I have learned about maintaining the Spark codebase.

  1. Scalacheck is used extensively. This is a great little library that I should really use more in my work code. It checks corner cases such as: if your function needs an Int then how does it handle MaxValue? How about MinValue?
  2. Unfortunately (in my opinion) the documentation is generated with Jekyll which runs on Ruby requiring me to install all sorts of weird (to me) software on my system. This StackOverflow post helped me avoid doing all sorts of nasty things to my laptop as root. Documentation is generated with bundle exec jekyll build which takes the .md files and enhances them so tools like IntelliJ that give you a markdown viewer are not terribly useful. You have to spend minutes building the site each time you change it, it seems. Compare that to mdoc which via an SBT plugin allows you do things like have the documentation built the moment you change a file with docs/mdoc --watch.
  3. The style checks are draconian with even the spaces between imports being significant. I guess this makes a lot of sense in that code reviews are now more automated (I was surprised that it was only really the overworked Sean Owen who was reviewing my code) but was annoying to have builds fail because of very minor infractions. The $SPARK_HOME/dev directory however has some scripts that let's you run this kind of thin on your own laptop first.
  4. All code must be accessible to Java, Scala and Python. Scala/Java compatibility is not without its wrinkles but PySpark code is a parallel code structure that mostly wraps Spark. Where it doesn't (for example, code on the client side), the Python must have as similar an interface to the JVM classes as possible.

All in all, quite difficult but I think this is a fixed cost. My next pull request should be much faster.

Exception, Exception, Exceptions

I asked this question in the Discord channel about ZIO 1.0.3:

If I have a ZIO[Any, Throwable, String] that just throws an Exception then I can handle the exception with ZIO.either

But if I have a ZIO[Any, Throwable, String] that throws an Exception in the release of bracket I get:

    Fiber failed.
    An unchecked error was produced.


Is that expected? [Code to demonstrate here on GitHub]

Adam Fraser responded:

@PhillHenry The release action of bracket should never fail so it has a type of URIO[R, Any]. If there is an error there you need to expose the full cause and either handle it or ignore it. 

You are kind of lying to the compiler by doing URIO(x.close()).

URIO promises that something will never fail but close can fail so that is going to create an unchecked exception. 

If you make that something like Task(x.close()) to reflect the fact that the effect can fail then you are going to get a compilation error when you try to put that in bracket. 

And then you have a choice as the user. If a finalizer fails, which really shouldn't happen, how do you want to treat it? One option is to say just fail at that point. Not successfully running the finalizer indicates a potential resource leak and better to fail fast than die slowly later. So then you would call orDie on the effect, and you could potentially handle the cause at a higher level of your application. Another option is to say we made our best efforts, we just need to move on, and then you would do ignore or maybe log the error somewhere and then ignore it.

Why couldn't the Exception manifest itself in the error channel of the ZIO?

Because the error channel models the failure of the acquire and use actions of bracket. It is possible that the use action fails and then the release action runs and also fails, so we need different channels for those errors.

Could ZIO not use Throwable.addSuppressed? I notice that this is what Java's try-with-resource does in these situations

Well we have to be polymotphic. The error of use may not be a Throwable at all so we definitely can't add a suppressed exception to that.

That also makes it really easy to lose failures in finalizers when normally a failure in a finalizer is very bad and something you should explicitly handle.

Well the only way you are cheating the compiler is by doing UIO(somethingThatThrows). I'm not sure how that can be prevented other than by preventing users from constructing UIO values at all, which prevents users from describing effects that really can't fail. 

I think the lesson is just to only use UIO for effects that really don't throw exceptions (or if they do throw exceptions you are comfortable treating those as defects).

Can we not force bracket to be available only if the error channel is a Throwable?
But we want to use bracket in a ton of situations where it is not.
The bracket operator is core to safe resource management. Saying we could only use it when E was Throwable would prevent us from writing resource safe code with a polymorphic error type, which is basically all code we want to write.

Disquieting

If a ZIO can throw any Exception to bring the whole machinary to a crashing halt, why have an error channel in the first place? This seems like ZIO is a leaky abstraction. And this is what bothers me although colleagues have told me they don't find it a worry at all.

The Cats way

Drew Boardman @drewboardman Feb 23 22:15
Basically I'm trying to see if there exists something like MonadError that signals whether the error has been handled but I think this is only possible with datatypes, not with typeclasses. I was having a discussion about how MonadError doesn't really signal, at the type-level, that anything has been handled. I basically got around to just creating a datatype that signals this - but that effectively just re-invents Either
Adam Rosien @arosien Feb 23 22:18
"whether the error has been handled" - do you mean knowing this statically? MonadError and the like don't distinguish, in the type, error-handling.
Fabio Labella @SystemFw Feb 23 22:18
It's kinda possible with cats-mtl , but in general yeah it's a lot easier with datatypes... So in this case you'd have a MonadError constraint, and handle it (eliminate it) by instantiating to EitherT (locally) and then you handle that. 
To recap, MonadError tells you that the called code has the right to raiseError and if it does, what type this will be. But there is no guarantee that an IO will not throw an Exception that brings the JVM crashing down. 

IOs that throw Exceptions  and MonadErrors that raiseError can be .attempted to get an IO[Either[. That is, the datatype not the effect indicates whether the exception has been handled or not.