## Sunday, January 17, 2016

### The Maths of LSA

At the heart of Latent Semantic Analysis is a mathematical pattern called Singular Value Decomposition (SVD). This is an appropriate technique when you have sparse data in many dimensions. It reduces it to smaller dimensions with (hopefully) little loss of accuracy.

First, a little mathematical refresher.

Eigenvalues and Eigenvectors

An eigenvector is a vector that when multiplied by a matrix points in the same direction. It's length may have changed by (a non-zero) factor called the eigenvalue.

Only square matrices can have eigenvalues and eigenvectors. Anything non-square will change the dimension of the vector so the result would never be a multiple of itself. SVD is a hack to work with non-square matrices.

What's more, not all matrices will have real valued eigen-vectors/values. Let's say the matrix is a rotation. No vector (apart from the 0-vector) will point in the same direction after all rotations. In Python, this is a 45-degree rotation.

import numpy as np

def no_complex_elements(vec):
np.all(map(lambda arr: not np.iscomplex(arr.flatten), vec))

def print_eigenvectors_and_eigenvalues(matrix):
eigenvalues, eigenvectors = np.linalg.eig(rotation)
print "real eigenvalues \n", filter(lambda scalar: not isinstance(scalar, complex), eigenvalues)
print "real eigenvectors\n", filter(lambda vector: no_complex_elements(vector), eigenvectors)

rotation = np.matrix('0.707106781  -0.707106781  0;'
'0.707106781  0.707106781   0;'
'0            0             1')
print_eigenvectors_and_eigenvalues(rotation)

and it output's an empty list.

The trolley-line-location problem

In "Coding the Matrix", Philip Klein compares SVD to finding the optimal path for a straight tram line to run through a city. I'll skip most the proofs (they're in the book) but the gist goes like this:

Imagine the tram line must go through the city centre (the origin) while getting as close as possible to fixed land marks located with vectors ai.

Make each ai a row in a vector A.

Now, each point ai can be expressed in terms of two vectors, one parallel to the tram line, ai||v, and one perpendicular to it, ai¬v (notice that we're choosing a new co-ordinate system) and that since
ai = ai¬v + ai||v
then, squaring:
||ai||2 = ||ai¬v||2 + ||ai||v ||2
and re-arranging:
||ai¬v||= ||ai||2 - ||ai||v ||2

This, of course, is our distance from the tram line squared. If we want to minimize this value for all landmarks, we can express the problem as finding the minimum of:
 ∑ i (||ai||2 - ||ai||v ||2)
which, when you again re-arrange, will give you:
||A||F2  - ||Av||2
where the F indicates the Frobenius Norm (nothing too scary, just a generalization of a vector multiplied by itself) and where the second term is just us cleaning up all the other terms:
 ∑ i 2
where <aiv>is just the dot-product of ai and the unit vector, v which is of course ai||v.

Seems a lot of work. Don't worry, it's all been high-school maths so far.

Anyway, it's this second term that we want to maximize. As the first term is fixed, maximizing the second will find the overall minimum for the whole equation. We won't worry too much how we do this - there are numerical techniques for doing it - this post is just to give the flavour.

Having chosen a suitable unit vector to maximize the second term, we'll call it v1 to remind us that this optimizing of distances between landmarks and the tram line can be viewed as a 1-dimensional approximation to covering all the landmarks as best we can using just a straight line. Perhaps we want to place a tram-stop at the point closest to each landmark. In which case, we could build a set of vectors similar to A only this time each row is a vector to a tram stop. Let's call this matrix Ã. Furthermore, we call ||Av1|| the first singular value (or σ1) and v1 the first right singular vector.

Then, the error in our approximation is A - Ã and we also know that this represents the distance from the tram line to landmark in each row in resultant matrix. Well, we've already calculated this, so:
||A - Ã||F2  =  ||A||F2  - ||Av||2  = ||A||F2  σ12
Note that the positions of these tram stops can also be calculated as the projection of ai onto v1. So, each row of Ã can be expressed as:
<ai, v1v1
So, taking this equation and re-expressing
Ã = A vv1T  σuv1T
where we define σu1 = A v1

This is starting to look like our familiar equation in SVD. So, it only remains to note that this "trolley-line" example is a specific case. We could expand it to more than a 2-D map.

We do this by finding the unit vector v2 that is orthogonal to v1 and maximises ||Av|| and again for v3v4, etc.

It doesn't take a great deal of effort to now show that σ1σ2σ3 etc can be turned into a diagonal matrix and that would give us the SVD equation.

It's also not too hard to see that the values of this diagonal matrix are in descending order since at each stage we found the maximum before moving on to the next step.