Wednesday, January 25, 2017

Poisoned Poisson

I needed to code up a simple implementation of the probability mass function (PMF) for the boring old Poisson distribution. Five minutes work, or so I thought.

My first cut of the code looked like this:

  def poissonWithSmallMean(x: Int, mu: Double): Double = Math.pow(mu, x) * Math.exp(-mu) / factorial(x)

where factorial is a very simple implementation of factorial using tail recursion. I could have implemented in a more efficient manner using caching etc but I wasn't expecting too much trouble.

Going to the excellently named StatTrek to get some test values, my test looked like:

    poissonWithSmallMean(10, 12) shouldEqual 0.104837255883659 +- tolerance

and it passed. Great. 

Things started getting strange with slightly larger numbers. StatTrek was telling me that with x=998 and mu=999, we should get the modestly-sized answer of 0.0126209223360251. But my code was blowing up.

It's not hard to see why. That factorial goes crazy pretty quickly. Given that the expected value is not particularly large and not terribly small, how do we compute it simply? 

The answer is here. It depends on logs. If:

n! = n * (n-1) * (n-2) * ... * 1


log n! = log n + log(n-1) + log(n-2) + ... + log(1)

Now, our equation to calculate the PMF looks like this:

  def poissonWithLargeMean(x: Integer, mu: Double): Double = {
    val logged = (x * Math.log(mu)) - mu - logFactorial(x)

where logFactorial is just a simple implementation of the equation for log n! above. This handles x=998, mu=999 beautifully.

We could start implementing logFactorial in a much more efficient manner by using the Stirling approximation (log n!  n log n). This is a handy equation which incidentally also gives us the limit on comparison sorts. Take a look at John D Cooke's blog in the link above to see how to do this. But this simple solution is fine for me for now.

Friday, January 20, 2017

HBase File Structure

Some notes from a colleague who knows HBase better than I do.

Quick overview of how HBase stores data

When HBase batch writes in-memory data to disk, it may do so in separate files. Each file is sorted by the key.

So, say HBase persists data with keys, A, F, B. It sorts them so the file looks like:

[A, B, F]

Some time later, HBase persists G, P and M. Now there are two files that look like:

[A, B, F],  [G, M, P]

At some later time still, data with keys C, R and D arrive. Key R clearly comes after the last letter already written to the files (P) but D and C comes in the middle of the first file. So, what does it do? It creates a new file with [C, D, R] in it. To summarize, the files look like this:

[A, B, F], [G, M, P], [C, D, R]

Even later, a client wants to find the value for key D. Assuming it's not cached in memory, where does HBase look for it? Well, although all data within a file is ordered (so we can call off the search early if it's not where we would expect it in the file), we don't know which file D is in. HBase must search all files even if it doesn't search through all of a file.


To help matters, HBase can compact these three files into one that looks like:

[A, B, C], [D, F, G], [M, P, R]

Well, that makes searching much easier.

Read-Write Patterns

How can we leverage this? Well, in certain read-write patterns, we could force a compaction at an appropriate point. For instance, I'm using HBase to store a dictionary I build when processing a corpus of text. Later, I refer to that dictionary but I never update or add any more data. So, after I have finished my last write, I can call:


After compacting the table, my read throughput went up by a factor of 4.

For more information on HBase compaction see here.