Since the feature hashing code that comes with Spark is based on a 32-bit hash giving us lots of collisions, we went for a 64-bit implementation. But what makes a good hash algorithm?

Rather than roll my own, I used this code. But how do I make sure it's good? This Stack Exchange answer gave me some clues by using the chi-squared test. Here, we basically compare with some fairly simple maths, our expected answer to our actual answer to indicate whether there is an unusually high number of collisions that our algorithm generates.

"The chi-squared test of goodness of fit is used to test the hypothesis that the distribution of a categorical variable within a population follows a specific pattern of proportions, while the alternative hypothesis is that the distribution of the variable follows some other pattern." [1]

__The Chi-Square Test__

"The values of column and rows totals are called marginals because they are on the margin of the table... The numbers within the table ... are called joint frequencies...

"If the two variables are not related, we would expect that the frequency of each cell would be the product of its marginals, divided by the sample size." [1]

This is because the probabilities of A and B are:

P(A ∩ B) = P(A) P(B) iff A ⊥ B

In other words, given a sample size of N, the probability of A is

P(A) = N

_{A}/ N

and

P(B) = N

_{B}/ N

so

P(A ∩ B) = N

_{B}N

_{A}/ N

^{2}

so the expected frequency would be

N P(A ∩ B) = N

_{B}N

_{A}/ N

exactly as [1] says.

The Chi-Square distribution then looks like this:

χ = Σ

_{i,j=1}(O

_{ij}- E

_{ij})

^{2}/ E

_{ij}

where i and j are being summed over all the rows and columns.

We can then perform the chi-squared test. This involves looking up the critical value in a table (or calculating it) from the degrees of freedom and the observed chi-squared value. "Basically, if the chi-square you calculated was bigger than the critical value in the table, then the data did not fit the model, which means you have to reject the null hypothesis." (from here)

__The Results__

Using this new hashing algorithm for our feature hashing resulted in no collisions while hashing all words in the corpus of text where previously we were getting about 70 million collisions with the 32-bit version.

Consequently, the number of records we then had to compare dropped by about 20 million (or about 2% of the total).

Unfortunately, although I am able to execute a chi-square test on the 32-bit algorithm, the 64-bit algorithm has such a large possible number of hash values, it's not practically possible to test it in such a manner.

[1] Statistics in a Nutshell, Sarah Boslaugh

## No comments:

## Post a Comment