Saturday, February 18, 2017

Knuth and Poissons


Trying to get my head around how Donald Knuth generates samples from a Poisson distribution. The formula is simple:

Let L=eλ, p=1. Count the number of times you multiply p by a uniform distribution in the range [0,1] until p

Simple. But why does it work?

Poisson

The Poisson distribution relies on the Poisson process. That is, the probability that something is about to occur has nothing to do with what has gone on in the past. This is seen throughout nature and makes it very useful in the sciences. For example, the probability of an atom radioactively decaying has nothing to do with how long it has been sitting around for before the decay.

In other words, if X is the inter-arrival times of a memory-less event, X = (X1X2, X3, ... ):
"At each arrival time and at each fixed time, the process must probabilistically restart, independent of the past. The first part of that assumption implies that X is a sequence of independent, identically distributed variables [IIDs] . The second part of the assumption implies that if the first arrival has not occurred by time s, then the time remaining until the arrival occurs must have the same distribution as the first arrival time itself" (from here).
This memory-less property can be expressed mathematically like this:

P(X > t + s | X > s) = P(X > t)


Let F(t) = P(X > t) for t > 0. Then:

F(t + s) = F(t) F(s)

since the probabilities of independent events can be multiplied to give the probability of both happening. (The mathematically inclined may look at this equation and suspect that logarithms are about to make an appearance...)

Now, this is the clever bit. Let a = F(1). Then:

F(n) = F(Σn1) = ΠnF(1) = an

The proof for this being true for all rational n is here but, briefly, if we let λ = - ln (a) then:

F(t) = e-λt

which, if you recall, is the probability that an event will happen at t or later. That is, it's a value between 0 and 1.

Now, we use a method called inversion sampling. The basic idea is that given a plot, y = P(x) then taking uniformly random samples on the y-axis then the corresponding values on the x-axis will give you a distribution that conforms to P(x).

Note that y represents a probability so necessarily 0 ≤ y ≤ 1. Since a uniform random number generator comes with pretty much any programming language, the problem is getting simpler.

The last piece of the jigsaw is documented here and explains why Knuth's algorithm is so simple. Basically, we rearrange the equation for F(t) by taking natural logs:

t = - ln F(t) / λ

and since we're using inversion sampling, we know that the probability F(t) is going to be simulated by a uniformly distributed random variable between 0 and 1. We call it U.

Now, given any random sample Uthere will be a value ti. The question is Knuth asks is: how many random samples Ui do I need before t is a unit of time (ie, 1)?

More mathematically, what is k for:

Σki=1 ti              ≤   1  ≤ Σk+1i=1 ti

- Σki=1  ln U / λ ≤   1  ≤ - Σk+1i=1 ln U / λ

 Σki=1  ln U      ≤  -λ  ≤ Σk+1i=1 ln U

ln (Πki=1 Ui)       ≤  -λ  ≤ ln (Πk+1i=1 Ui)

Πki=1 Ui              ≤  e ≤ Πk+1i=1 Ui

which is Knuth's algorithm. Notice that the beauty of this is that it's computationally cheap for relatively small k. This should be fine for most applications.

Tuesday, February 14, 2017

Tweaking TF-IDF


TF-IDF is a simple statistic (with a few variants) used in information retrieval within a corpus of text. The funny thing is that although it works well, nobody seems to be sure why. Robertson says in On theoretical arguments for IDF:

"The number of papers that start from the premise that IDF is a purely heuristic device and ends with a claim to have provided a theoretical basis for it is quite startling."

The simplest expression for TF-IDF is:

(TF) . (IDF) = (ft,d) . (log (N / |{d∈D : t∈d}|))

where t is a "term" (word, ngram etc); d is a document in corpus D; N is the total number of documents in corpus D; and ft,d is the frequency of term t in document d. (Often the term frequency component is actually a function of the raw ft,d) . "The base of the logarithm is not in general important" (Robertson).

It has been mooted that the IDF component looks like the log of a probability. That is, if we define the number of documents with term ti to be ni then we could expect the probability of a given document containing ti to be:

P(ti) = P(ti occurs in d) = ni/N

and the IDF term looks like:

idf(ti) = - log P(ti)

The nice thing about this is that "if we assume the occurrences of different terms in documents are statistically independent, then addition is the correct thing to do with the logs":

idf(t1∧t2) = -log P(t1∧t2
           = -log(P(t1)P(t2)) 
           = -(log P(t1) + log P(t2)) 
           = idf(t1) + idf(t2)

Thus we can add IDF terms to give us a total score.

Implementations

Lucene used to use this implementation but appear to moving to something called BM25 with which they're getting good results. I tried BM25 but it was unfortunately prone to over-linking my documents compared to the classic formula.

So, I returned to the classic implementation but thought again about an approximation I had made. I had capped the maximum frequency of a given term to be 1000 simply for practical reasons of storage. All other terms are discarded as something that occurs too often is not likely to be useful in this particular domain. Note that I could not filter out a list of stopwords as lots of my corpus was not English. The hope was that stopwords would naturally drop out as part of this restriction. Imposing this limit still gave me over 90 million terms.

The result was the difference between the a rare term and the most common term that we record was not that great. Since I am comparing documents, the minimum document frequency for a term to be useful is 2 (a frequency of 1 obviously doesn't compare documents). If the maximum frequency is 1000 then the ratio of the IDFs would be:

log (N / 2) / log (N / 1000)

which for 90 million documents is only about 1.5. No wonder the results for my document comparisons were not great. Now, if I set N = 1000 + 1 (the +1 to avoid a nasty divide by 0), the ratio between the weighting for the rarest term and the most common is about 6218 which seems more reasonable. And sure enough, more documents seemed to match. (Caveat: I say "seemed to match" as I only examined a sample then extrapolated. There are far too many documents to say definitely that the matching is better).