Here's a cool little trick called feature hashing. The idea is that you want to turn some corpus of text into feature vectors without resorting to an external repository for the indices in those vectors. Looking things up in an external repository can be slow and complicates the code base.
Instead, you avoid this through hashing the words to an integer. Spark uses the HashingTF class for this.
However, the Spark code is very dependent on all your words being able to fit into positive Ints. This is probably OK for most corpus of text but when you start using n-grams ("Groups of words in a split are known as n-grams." ), then the numbers start to approach what an Int can hold.
But what's more worrying is that hashing leads to the possibility of hash collisions. How many you expect can be calculated from the formula here. Basically, given M possible slots and N words that need fitting into them, the probability of a word going into a particular slot, z, is 1/M. Therefore, the probability of avoiding a slot is:
p(miss z) = 1 - 1 = M - 1 = Y
The probability that we miss z for all N words as we add them to the slots is YN. So, obviously, the probability that the slot receives at least one word is 1-YN.
This is a formula for the probability of a slot being filled but what is the expected number of collisions? Here, we note a nice trick outlined here.
The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent.Now, we introduce a function that is 1 if slot z contains one or more words and 0 otherwise. Note: we're not interested in the number of words in the slot. We're only interested if the slot contains something.
We call this function for slot z, fz.
So, the expected total number of slots that hold at least one word is:
E(X) = E(f0) + E(f1) + E(f2) + ... + E(fM)
The expected value of anything is the sum of all possible values multiplied by their probability. So, the expected value of an individual f for any of the M slots is:
E(fz) = 0 * p(fz=0) + 1 * p(fz=1) = p(fz=1)
(Note that the expected value need not be an integer nor indeed sensible. The expected number when throwing a 6-sided die is 3.5.)
Given M slots, the expected number of slots that contain something is:
E(X) = M (1-YN)
So the total number of collisions is the difference between this and N as the Quora link says.
For a spike, we used the MurmerHash3 hashing algorithm over 820 million n-grams and the number of slots that had more than one word was 70 million (close to the 74 million this equation predicts when M = 2^32 and N = 820 million).
So, although it's a nice idea, Spark's current implementation is not for us. We'll look at rolling-our-own using Longs for which we could reasonably expect no collisions (using the above equation with M = 2^64).