In an attempt to try to grok Conditional Random Fields used in the Mallet library in a hurry, I've made some miscellaneous notes.
The basics
You will have been taught at school that the joint probability of A and B is:
P(A ∩ B) = P(A) P(B)
iff A and B and independent (see here).
Depending on your interpretation of probability theory, it is axiomatic that the relationship between the joint probability and the conditional probability is:
P(A ∩ B) = P(A|B) P(B)
These will come in useful.
CRFs
"A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure." [1]
We consider the distribution over
V = X ∪ Y
where:
V are random variables
X observed inputs
Y outputs we'd like to predict
"The main idea is to represent a distribution over a large number of random variables by a product of local functions [ΨA] that each depend on only a small number of variables." [1]
The Local Function
The local function has the form:
ΨA (xA, yA) = exp { Σk θA,k fA,k(xA, yA) }
where k is the k-th feature of a feature vector.
Graphical Models
"Traditionally, graphical models have been used to represent the joint probability distribution p(y, x) ... But modeling the joint distribution can lead to difficulties ... because it requires modeling the distribution p(x), which can include complex dependencies [that] can lead to intractable models. A solution to this problem is to directly model the conditional distribution p(y|x). " [1]
Undirected Graphical Model
If these variables are A ⊂ V, we define an undirected graphical model as the set of all distributions that can be written as:
p(x, y) = (1/Z) ∏A ΨA (xA, yA)
for any choice of factors, F = {ΨA}, where:
ΨA is a function υn → ℜ+ where υ is the set of values v can take
Z is the partition function that normalizes the values such that
Z = Σx,y ∏A ΨA (xA, yA)
Factor Graph
This undirected graphical model can be represented as a factor graph. "A factor graph is a bipartite graph G = (V, F, E)" that is, a graph with two disjoint sets of vertices. One set is the variables, vs ∈ V, the other is the factors ΨA ∈ F. All edges are between these two sets. An edge only exists if vertex vs is an argument for vertex ΨA.
Directed Model (aka Bayesian Network)
This is based on a directed graph G = (V, E) and is a family of distributions (aka "model") that factorize as:
p(x,y) = ∏v∈V p(v| π(v))
where π(v) are the parents of v in the graph G.
Naive Bayesian Classifier
As a concrete example, let's look at Mallet's implementation of Naive Bayes.
Firstly, what make Naive Bayes naive? "Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features." (from here).
We ask ourselves: given the data, what is the probability that the classifiers is C? Or, in math-speak, what is p(C|D)?
Now, the data, D, is made up of lots of little data points, { d1, d2, d3, ... }. And given that little equation at the top of this post, if all these data points are independent then:
p(D|C) = p({ d1, d2, d3, ... } | C) = p(d1|C) p(d2,|C) p(d3,|C) ...
Mallet's Java code for this is surprisingly easy and there is a JUnit test that demonstrates it. In the test, there is a dictionary of all the words in a corpus. It's a small dictionary of the words { win, puck, team, speech, vote }.
We create two vectors that represent the weightings for these words if a document relates to politics or sport. Not surprisingly, {speech, vote, win} have a higher weighting for politics and {team, puck, win} have a higher weighting for sports but all words have a nominal value of 1 added (before normalization) in each vector. This is Laplace Smoothing and it ensures the maths doesn't blow up when we try to take the log of zero.
Note that these weightings for each category are by definition p(D|C), that is, the probability of the data given a classification.
This being Bayes, we must have a prior. We simply assume the probability of an incoming document to be about sports or politics as equally likely since there is no evidence to the contrary.
Now, a feature vector comes in with {speech, win} equally weighted. To which class should it go?
The code effectively
- Calculates an inner product between the feature vector and each of (the logarithm of) the class's vector and then adds (the logarithm of) the bias. This is the multiplication in the Local Function section above.
- Deducts the maximum result for all results. This appears to be to handle rounding errors and drops out of the equation when we normalize (see below)
- Takes the exponential of each result.
- Normalizes by dividing by the sum of these exponentials. This is the partition function we saw above, Z.
But why does the code do this? Bayes theorem gives us:
p(C|D) = p(D|C) p(C) / p(D)
= prior x p(D|C) / Z
= prior x p(d1|C) p(d2,|C) p(d3,|C) ... / Z
now, if we let:
λy = ln(prior)
λy,j = ln(p(dj|Cy))
then
p(C|D) = (eλy ∏j=1K eλy,jdj) / Z = (eλy + Σj=1K λy,jdj) / Z
which is what the Java code does and is the same as equation 1.6 in [1]. Note the Java effectively has a e-maxScore in the numerator and denominator which of course cancels out.
Finally, that exponential value λy + Σj=1K λy,jdj can be more elegantly expressed as a matrix multiplication between our dictionary weights for the classifier, f and the vector to classify, θ (where the bias has been subsumed into the 0-th entry of the vector). That then gives us the Local Function above and so we have come full circle.
Aside: One last note on Naive Bayesian Classifiers. Like the TF-IDF statistic, it works far better than it should do. "Statisticians are somewhat disturbed by use of the NBC (which they dub Idiot's Bayes) because the naive assumption of independence is almost always invalid in the real world. However, the method has been shown to perform surprisingly well in a wide variety of contexts.Research continues on why this is." (from here)
[1] An Introduction to Conditional Random Fields for Relational Learning.