Tuesday, February 14, 2017

Tweaking TF-IDF

TF-IDF is a simple statistic (with a few variants) used in information retrieval within a corpus of text. The funny thing is that although it works well, nobody seems to be sure why. Robertson says in On theoretical arguments for IDF:

"The number of papers that start from the premise that IDF is a purely heuristic device and ends with a claim to have provided a theoretical basis for it is quite startling."

The simplest expression for TF-IDF is:

(TF) . (IDF) = (ft,d) . (log (N / |{d∈D : t∈d}|))

where t is a "term" (word, ngram etc); d is a document in corpus D; N is the total number of documents in corpus D; and ft,d is the frequency of term t in document d. (Often the term frequency component is actually a function of the raw ft,d) . "The base of the logarithm is not in general important" (Robertson).

It has been mooted that the IDF component looks like the log of a probability. That is, if we define the number of documents with term ti to be ni then we could expect the probability of a given document containing ti to be:

P(ti) = P(ti occurs in d) = ni/N

and the IDF term looks like:

idf(ti) = - log P(ti)

The nice thing about this is that "if we assume the occurrences of different terms in documents are statistically independent, then addition is the correct thing to do with the logs":

idf(t1∧t2) = -log P(t1∧t2
           = -log(P(t1)P(t2)) 
           = -(log P(t1) + log P(t2)) 
           = idf(t1) + idf(t2)

Thus we can add IDF terms to give us a total score.


Lucene used to use this implementation but appear to moving to something called BM25 with which they're getting good results. I tried BM25 but it was unfortunately prone to over-linking my documents compared to the classic formula.

So, I returned to the classic implementation but thought again about an approximation I had made. I had capped the maximum frequency of a given term to be 1000 simply for practical reasons of storage. All other terms are discarded as something that occurs too often is not likely to be useful in this particular domain. Note that I could not filter out a list of stopwords as lots of my corpus was not English. The hope was that stopwords would naturally drop out as part of this restriction. Imposing this limit still gave me over 90 million terms.

The result was the difference between the a rare term and the most common term that we record was not that great. Since I am comparing documents, the minimum document frequency for a term to be useful is 2 (a frequency of 1 obviously doesn't compare documents). If the maximum frequency is 1000 then the ratio of the IDFs would be:

log (N / 2) / log (N / 1000)

which for 90 million documents is only about 1.5. No wonder the results for my document comparisons were not great. Now, if I set N = 1000 + 1 (the +1 to avoid a nasty divide by 0), the ratio between the weighting for the rarest term and the most common is about 6218 which seems more reasonable. And sure enough, more documents seemed to match. (Caveat: I say "seemed to match" as I only examined a sample then extrapolated. There are far too many documents to say definitely that the matching is better).

No comments:

Post a Comment