Thursday, December 28, 2017

Introduction to the maths of SVMs

The machine learning technique Support Vector Machines get their names from the mathematical concept of supports - that is, the smallest closed set of a topological space that do not map to zero. We'll see this below.

They deal purely with numeric data but nominal values can easily be mapped to numeric values by having a column for all categories and a value 1 for those data that fall into that category and 0 for all others.

They are known to have a "low generalization error, [be] computationally inexpensive, easy to interpret results" [1] but the maths behind them is pretty complicated.

What is the kernel trick?

I found this nice explanation here.
It provides a bridge from linearity to non-linearity to any algorithm that can expressed solely on terms of dot products between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space. Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where resides the “trick”... This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus unfeasible to compute.

An example

A great example can be found here. Chitta Ranjan takes the classic example of points that are not linearly separable in 2-dimensions and decides that he will use 3-dimensions. The function he uses for this is:

Φ(x) → x1, x2, √2 x1x2

Leading to a similarity measure of:

<Φ(xi), Φ(xj) >

Although this raising the number of dimensions does indeed separate the data, his friend, Sam, notes that you this can equivalently be written as:

<xi, xj > 2

All Sam had done "is realize there is some higher dimensional space which can separate the data. Choose a corresponding Kernel and voila! He is now working in the high-dimension while doing computation in the original low dimensional space."

Which Kernel?

Although lists of kernels exist, "people just go with the infinite dimension by using the Gaussian (RBF) Kernel. RBF is the most commonly used Kernel. It has a tuning hyperparameter σσ that is tuned to choose between a smooth or curvy boundary. In short, as a rule of thumb, once you realize linear boundary is not going to work try a non-linear boundary with an RBF Kernel."


In an attempt to understand things better, I turned to Learing with Kernels (Schölkopf). These are some notes I made:

ℌ, a feature space that has a dot product.
Χ, a set that contains the inputs.
Φ, a map from the inputs to the feature space. This may be trivial or a more exotic, non-linear mapping.
m is the number of inputs and n is the size of a feature vector.
k(x, x') is the kernel function that gives us a similarity measure.

A naive implementation

Given two clusters, we can calculate their centres:

c+ = m+-1 Σ{i|yi=+1} xi

c- = m--1 Σ{i|yi=-1} xi

where we differentiate the two clusters with + and - symbols.

We define w as the vector between the two centres, that is c+-c-.

Half way between these two clusters (c++c-/2) is point c. Which cluster x is closest to can be given by the sign of <x-c, w>. Or, to put it another way, the sign of:

<x-cw> = <x-(c+c-/ 2), c+-c-> = <xc+>-<xc-> - b

where b = (||c+||2- ||c-||2) / 2

The sign of this equation essentially gives us our decision function, y. However, to use the kernel trick in this equation, we need to express everything in terms of x. Since w and b can be expressed in terms of c and c in terms of x, this gives:

y = sgn(m+-1 Σ{i|yi=+1} <x, xi> - m--1 Σ{i|yi=-1} <xxi > + b)
  = sgn(m+-1 Σ{i|yi=+1} k(x, xi) - m--1 Σ{i|yi=-1} k(x, xi) + (m+-2 Σ{i,j|yi=+1} k(xi, xj) - m--2 Σ{i,j|yi=-1} k(xi, xj))/2)

Aside: if we then choose the origin to be equidistant from the two two centres, then b becomes zero. And if the kernel function, k, is normalized such that we can view it as a probability density function (that is, for a given x' the probability ) then:

Χ k(x, x') dx = 1 ∀ x ∈ Χ

This leads to the equation for y taking the form:

y = sgn(Σ αi k(x, xi) + b)

which looks like the Bayes Classifier apparently.

A better implementation

This is just a toy implementation where the hyperplane is just the mean between the two centroids. What we really want is a hyperplane that maximizes the distance between the clusters. That is, given a decision function f,

f(x) = sgn(<xw>+ b)

we want to maximize the minimum distance between the two sets of points either side of the boundary. Let's say that the two points in each set that are the closest are x1 and x2. With the right scaling, we can say:

<x1w> + b = 1
<x2w> + b = -1

Taking these as simultaneous equations, we can eliminate b and get:

<x1 - x2w> = 2

or equivalently:

<x1 - x2/ |w|> = 2 / |w|

The left hand side of this equation is the distance between  x1 and x2 projected on the normal of the hyperplane. If we want to maximize this, we want to maximize (2/|w|), or to put it another way, minimize |w|.

To this end, let's introduce an objective function:

τ(w) = |w|2 / 2

that we wish to maximize (why we square it and divide it by 2 will become apparent later. Suffice to say minimizing is the same as minimizing τ).

Now, one solution would be the trivial w = 0 so we need to add the inequality constraints:

yi (<xiw>+ b) ≥ 1

Now we use Lagrange multipliers. That is, we note at the minimum of τ that dτ/dw is 0 (standard calculus). Let's propose the Lagrangian we want to maximize as:

L(w, b, α) = τ(w) - Σi=1m αi (yi (<xiw>+ b) - 1)

Note that if the inequality constraint is not exact, the term in the summation is negative. Since we we want to maximize the Lagrangian, necessarily its corresponding αi must be zero (note that we are subtracting the summation). So, αi will only be non-zero for points that make the inequality constraint exact. This is the Karush-Kuhn-Tucker condition. It agrees with our intuition that we only care about nearest points. All other points are irrelevant.

Also, these non-zero αis will obviously have their corresponding term equal to 0 as they're exactly 1, and we subtract 1 from such (in)equality constraints. This is enough information to make our Lagrangian.

Differentiating the Lagrangian by w yields:

wΣi=1m αi yi xi

and by b yields:

Σi=1m αi yi = 0

This latter equation compliments the former by being a constraint on the solution.

Substituting this equation for w into that for τ(w), we can now say that its minimum is at:

Σi,j=1m ααj yi yj <xixij>

and the objective function we started with at the top of this section is now:

f(x) = sgn(Σi=1m αi yi <xxi>+ b)

The hard part is calculating the alphas etc in that minimization. This is the dual optimization problem and it beyond the scope of this post.


One last point is how effective the boundary is. "Since the labels are in {±1} there are at most 2m different labelings for m patterns. A very rich function class might be able to realize all 2m separations, in which case it is said to shatter the m points."

[1] Machine Learning in Action published by Manning.

No comments:

Post a Comment