Monday, April 24, 2017

Crash Course in Conditional Random Fields


In an attempt to try to grok Conditional Random Fields used in the Mallet library in a hurry, I've made some miscellaneous notes.

The basics


You will have been taught at school that the joint probability of A and B is:

P(A ∩ B) = P(A) P(B)

iff A and B and independent (see here).

Depending on your interpretation of probability theory, it is axiomatic that the relationship between the joint probability and the conditional probability is:

P(A ∩ B) = P(A|B) P(B)

These will come in useful.

CRFs

"A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure." [1]

We consider the distribution over

V = X ∪ Y

where:

V are random variables
X observed inputs
Y outputs we'd like to predict

"The main idea is to represent a distribution over a large number of random variables by a product of local functions [ΨA] that each depend on only a small number of variables." [1]

The Local Function

The local function has the form:

ΨA (xA, yA) = exp { Σk θA,k fA,k(xA, yA) }

where k is the k-th feature of a feature vector.

Graphical Models

"Traditionally, graphical models have been used to represent the joint probability distribution p(y, x) ... But modeling the joint distribution can lead to difficulties ... because it requires modeling the distribution p(x), which can include complex dependencies [that] can lead to intractable models. A solution to this problem is to directly model the conditional distribution p(y|x). " [1]

Undirected Graphical Model

If these variables are A ⊂ V, we define an undirected graphical model as the set of all distributions that can be written as:

p(x, y) = (1/Z) ∏A ΨA (xA, yA)

for any choice of factors, F = {ΨA}, where:

ΨA is a function υn → ℜ+ where υ is the set of values v can take
Z is the partition function that normalizes the values such that

Z = Σx,yA ΨA (xA, yA)

Factor Graph

This undirected graphical model can be represented as a factor graph. "A factor graph is a bipartite graph G = (V, F, E)" that is, a graph with two disjoint sets of vertices. One set is the variables, vs ∈ V, the other is the factors ΨA ∈ F. All edges are between these two sets. An edge only exists if vertex vs is an argument for vertex ΨA.

Directed Model (aka Bayesian Network)

This is based on a directed graph G = (V, E) and is a family of distributions (aka "model") that factorize as:

p(x,y) = ∏v∈V p(v| π(v))

where π(v) are the parents of v in the graph G.

Naive Bayesian Classifier

As a concrete example, let's look at Mallet's implementation of Naive Bayes.

Firstly, what make Naive Bayes naive? "Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features." (from here).

We ask ourselves: given the data, what is the probability that the classifiers is C? Or, in math-speak, what is p(C|D)?

Now, the data, D, is made up of lots of little data points, { d1, d2, d3, ... }. And given that little equation at the top of this post, if all these data points are independent then:

p(D|C) = p({ d1, d2, d3, ... } | C) = p(d1|C) p(d2,|C) p(d3,|C) ...

Mallet's Java code for this is surprisingly easy and there is a JUnit test that demonstrates it. In the test, there is a dictionary of all the words in a corpus. It's a small dictionary of the words { win, puck, team, speech, vote }.

We create two vectors that represent the weightings for these words if a document relates to politics or sport. Not surprisingly, {speech, vote, win} have a higher weighting for politics and {team, puck, win} have a higher weighting for sports but all words have a nominal value of 1 added (before normalization) in each vector. This is Laplace Smoothing and it ensures the maths doesn't blow up when we try to take the log of zero.

Note that these weightings for each category are by definition p(D|C), that is, the probability of the data given a classification.

This being Bayes, we must have a prior. We simply assume the probability of an incoming document to be about sports or politics as equally likely since there is no evidence to the contrary.

Now, a feature vector comes in with {speech, win} equally weighted. To which class should it go?

The code effectively
  1. Calculates an inner product between the feature vector and each of (the logarithm of) the class's vector and then adds (the logarithm of) the bias. This is the multiplication in the Local Function section above.
  2. Deducts the maximum result for all results. This appears to be to handle rounding errors and drops out of the equation when we normalize (see below)
  3. Takes the exponential of each result.
  4. Normalizes by dividing by the sum of these exponentials. This is the partition function we saw above, Z.
The most probable is the class we're looking for.

But why does the code do this? Bayes theorem gives us:

p(C|D) = p(D|C) p(C) / p(D) 
            = prior x p(D|C) / Z 
            = prior x p(d1|C) p(d2,|C) p(d3,|C) ...  / Z

now, if we let:

λy    = ln(prior)
λy,j  = ln(p(dj|Cy))

then

p(C|D) = (eλy ∏j=1K eλy,jdj) / Z =  (eλy + Σj=1K λy,jdj) / Z

which is what the Java code does and is the same as equation 1.6 in [1]. Note the Java effectively has a e-maxScore in the numerator and denominator which of course cancels out.

Finally, that exponential value λy + Σj=1K λy,jdj can be more elegantly expressed as a matrix multiplication between our dictionary weights for the classifier, f and the vector to classify, θ (where the bias has been subsumed into the 0-th entry of the vector). That then gives us the Local Function above and so we have come full circle.

Aside: One last note on Naive Bayesian Classifiers. Like the TF-IDF statistic, it works far better than it should do. "Statisticians are somewhat disturbed by use of the NBC (which they dub Idiot's Bayes) because the naive assumption of independence is almost always invalid in the real world.  However, the method has been shown to perform surprisingly well in a wide variety of contexts.Research continues on why this is." (from here)

[1] An Introduction to Conditional Random Fields for Relational Learning.

Sunday, April 9, 2017

Generating a Gaussian distribution


Firstly, let's consider what a Gaussian is. It can be derived here. The definition Dr Wilson gives is: "Data are said to be normally distributed if the rate at which the frequencies fall off is proportional to the distance of the score from the mean, and to the frequencies themselves". Or, in maths speak:

df = k (x - μ) f(x)
dx

where k is a constant and μ is the mean. Integrating this formula and setting it to 1 (since it's a probability distribution) via integration by parts and integration by substitution, you'll eventually get:

f(x) =   e (x - μ)2/2 σ 2 
       σ √2π

which is the Gaussian, where σ is the standard deviation.

Now, squaring it and integrating it twice still equals 1 (since obviously 12 = 1) and looks like this:

(1/2π) ∫-∞-∞  e-x2/2 e-y2/2  dx dy

where I've absorbed the constants into variables and used both x and y for each integration. The trick is now to use polar co-ordinates, so:

x     = r cos θ
y     = r sin θ
dx dy = r dr dθ

and that integral becomes:

(1/2π) ∫0  ∫0  r e-r2/2  dr

which, with another integration by substitution with ψ = r2/2, becomes:

(1/2π) ∫0  ∫0  e  dψ

and again with Υ = e.

(1/2π) ∫0  ∫10  Υ dΥ = (1/2π)2π x 1 = 1

Where Transformation Sampling comes in

"We don't care about this integral. What we do is we use our relationship between integration and sampling" [Werner Krauth at Coursera] where θ is a random number between 0 and  and Υ is a random number between 0 and 1.

So, now we work backwards. If ψ = r2/2 and Υ = e, then r =  √ (- 2 ln Υ) where Υ is sampled over the interval [0,1]. The easy bit is θ which is simply sampled in the interval [0, ].

Plugging these numbers back into the equations for polar co-ordinates, we can now give samples for x and y. Notice that you actually get 2 values (x and y) for the price of one. So, implementations (like java.lang.Random.nextGaussian) maintain state (the next Gaussian value if it's available) which is calculated for "free" - except that the method is synchronized...

Implementation

What we have above is a decent implementation for generating Gaussian values but it's inefficient. There is a nice math-hack to make it better.

Java Number Cruncher (p367) says:
We begin this algorithm by generating two random values v1 and v2 that are uniformly distributed between -1 and +1. We then compute the value s = v12 + v22. If s>1, we reject the values of v1 and v2 generate new ones... With two good values of v1 and v2, we then compute:
x1 = v1 √((-2 ln s) / s)
x2 = v2 √((-2 ln s) / s)
Where this comes from had me scuttling off to here which describes the Box Muller method. Basically, we want a number between 0 and 1 for Υ and any value in [0, ] for θ. Well, this is just finding uniformly distributed values within the unit circle. Since the process of scattering points uniformly over a box in no way favours nor discriminates between points within a circle within that box, we can be sure that just taking those inner points is also uniformly distributed.

What we want to do is get rid of those expensive sin and cos functions for x and y. We note that

cos θ = adjacent / hypotenuse = v1 / s
sin θ = opposite / hypotenuse = v2 / s

since s2 = r is the length of our vector (as defined in the above algorithm from Java Number Crunchers). We then plug that into our polar co-ordinates equation above, we get the equations for x1 and x2 we see above. QED.

Saturday, April 8, 2017

Video games to statistical mechanics


A friend was writing a computer game in his spare time and wanted to speckle a sphere with texture uniformly over its surface. Idly, he asked our team how to do this. We came up with some naive ideas like taking the x,y and z co-ordinates from a uniform sample and normalizing them to make a vector or length r, the radius of the sphere. In Python, this would be:

import random, math, pylab, mpl_toolkits.mplot3d

x_list, y_list, z_list = [],[],[]
nsamples = 10000
for sample in xrange(nsamples):
    x, y, z = random.uniform(-1.0, 1.0), random.uniform(-1.0, 1.0), random.uniform(-1.0, 1.0)
    radius = math.sqrt(x ** 2 + y ** 2 + z ** 2)
    x_list.append(x / radius)
    y_list.append(y / radius)
    z_list.append(z / radius)
fig = pylab.figure()
ax = fig.gca(projection='3d')
ax.set_aspect('equal')
pylab.plot(x_list, y_list, z_list, '+')
pylab.show()

but this is not quite uniformly distributed over the sphere. It produces a sphere like this:
Sphere with Cartesian co-ordinates taken from a uniform sample and then normalized

Using polar co-ordinates and uniformly sampling over 2π doesn't make things better either. This:

for sample in xrange(nsamples):
    phi, theta = random.uniform(0, 2.0) * math.pi, random.uniform(0, 2.0) * math.pi
    x_list.append(math.cos(phi) * math.cos(theta))
    y_list.append(math.cos(phi) * math.sin(theta))
    z_list.append(math.sin(phi))

gives:
Sphere with polar co-ordinates taken from a uniform sample
which is also clearly not evenly spread.

The solution involves Gaussian (a.k.a 'Normal') distributions, thus:

for sample in xrange(nsamples):
    x, y, z = random.gauss(0.0, 1.0), random.gauss(0.0, 1.0), random.gauss(0.0, 1.0)
    radius = math.sqrt(x ** 2 + y ** 2 + z ** 2)
    x_list.append(x / radius)
    y_list.append(y / radius)
    z_list.append(z / radius)

Why this works I'll leave to another post. Suffice to say this even distribution is used not just in games but in thermodynamics where you might want to model the velocities of molecules of a gas. In this scenario, (x ** 2 + y ** 2 + z ** 2) would model the squared velocity of a molecule, useful in calculating the energy, ½mv2


Saturday, April 1, 2017

Feature Hashing


Here's a cool little trick called feature hashing. The idea is that you want to turn some corpus of text into feature vectors without resorting to an external repository for the indices in those vectors. Looking things up in an external repository can be slow and complicates the code base.

Instead, you avoid this through hashing the words to an integer. Spark uses the HashingTF class for this.

However, the Spark code is very dependent on all your words being able to fit into positive Ints. This is probably OK for most corpus of text but when you start using n-grams ("Groups of words in a split are known as n-grams." [1]), then the numbers start to approach what an Int can hold.

But what's more worrying is that hashing leads to the possibility of hash collisions. How many you expect can be calculated from the formula here. Basically, given M possible slots and N words that need fitting into them, the probability of a word going into a particular slot, z, is 1/M. Therefore, the probability of avoiding a slot is:

p(miss z) = 1 - 1 = M - 1 = Y
                M     M

The probability that we miss z for all N words as we add them to the slots is YN. So, obviously, the probability that the slot receives at least one word is 1-YN.

This is a formula for the probability of a slot being filled but what is the expected number of collisions? Here, we note a nice trick outlined here.
The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent.
Now, we introduce a function that is 1 if slot z contains one or more words and 0 otherwise. Note: we're not interested in the number of words in the slot. We're only interested if the slot contains something.

We call this function for slot z, fz.

So, the expected total number of slots that hold at least one word is:

E(X) = E(f0) + E(f1) + E(f2) + ... + E(fM) 

The expected value of anything is the sum of all possible values multiplied by their probability. So, the expected value of an individual f for any of the M slots is:

E(fz) = 0 * p(fz=0) +  1 * p(fz=1) = p(fz=1)

(Note that the expected value need not be an integer nor indeed sensible. The expected number when throwing a 6-sided die is 3.5.)

Given M slots, the expected number of slots that contain something is:

E(X) = M (1-YN)

So the total number of collisions is the difference between this and N as the Quora link says.

For a spike, we used the MurmerHash3 hashing algorithm over 820 million n-grams and the number of slots that had more than one word was 70 million (close to the 74 million this equation predicts when M = 2^32 and N = 820 million).

So, although it's a nice idea, Spark's current implementation is not for us. We'll look at rolling-our-own using Longs for which we could reasonably expect no collisions (using the above equation with M = 2^64).

[1] Real World Machine Learning.