Agile Java Man: The Maths of LSA

At the heart of Latent Semantic Analysis is a mathematical pattern called Singular Value Decomposition (SVD). This is an appropriate technique when you have sparse data in many dimensions. It reduces it to smaller dimensions with (hopefully) little loss of accuracy.

First, a little mathematical refresher.

Eigenvalues and Eigenvectors

An eigenvector is a vector that when multiplied by a matrix points in the same direction. It's length may have changed by (a non-zero) factor called the eigenvalue.

Only square matrices can have eigenvalues and eigenvectors. Anything non-square will change the dimension of the vector so the result would never be a multiple of itself. SVD is a hack to work with non-square matrices.

What's more, not all matrices will have real valued eigen-vectors/values. Let's say the matrix is a rotation. No vector (apart from the 0-vector) will point in the same direction after all rotations. In Python, this is a 45-degree rotation.

import numpy as np

def no_complex_elements(vec):
np.all(map(lambda arr: not np.iscomplex(arr.flatten), vec))

def print_eigenvectors_and_eigenvalues(matrix):
eigenvalues, eigenvectors = np.linalg.eig(rotation)
print "real eigenvalues \n", filter(lambda scalar: not isinstance(scalar, complex), eigenvalues)
print "real eigenvectors\n", filter(lambda vector: no_complex_elements(vector), eigenvectors)

if __name__ == "__main__":
rotation = np.matrix('0.707106781 -0.707106781 0;'
'0.707106781 0.707106781 0;'
'0 0 1')
print_eigenvectors_and_eigenvalues(rotation)

and it output's an empty list.

The trolley-line-location problem

In "Coding the Matrix", Philip Klein compares SVD to finding the optimal path for a straight tram line to run through a city. I'll skip most the proofs (they're in the book) but the gist goes like this:

Imagine the tram line must go through the city centre (the origin) while getting as close as possible to fixed land marks located with vectors a_i.

Make each a_i a row in a vector A.

Now, each point a_i can be expressed in terms of two vectors, one parallel to the tram line, a_i^||v, and one perpendicular to it, a_i^¬v (notice that we're choosing a new co-ordinate system) and that since

a_i = a_i^¬v + a_i^||v

then, squaring:

||a_i||² = ||a_i^¬v||² + ||a_i^||v||²

and re-arranging:

||a_i^¬v||²= ||a_i||² - ||a_i^||v||²

This, of course, is our distance from the tram line squared. If we want to minimize this value for all landmarks, we can express the problem as finding the minimum of:

∑
i

(||a_i||² - ||a_i^||v||²)

which, when you again re-arrange, will give you:

||A||_F²- ||Av||²

where the F indicates the Frobenius Norm (nothing too scary, just a generalization of a vector multiplied by itself) and where the second term is just us cleaning up all the other terms:

∑
i

<a_i, v>²

where <a_i, v>is just the dot-product of a_i and the unit vector, v which is of course a_i^||v.

Seems a lot of work. Don't worry, it's all been high-school maths so far.

Anyway, it's this second term that we want to maximize. As the first term is fixed, maximizing the second will find the overall minimum for the whole equation. We won't worry too much how we do this - there are numerical techniques for doing it - this post is just to give the flavour.

Having chosen a suitable unit vector to maximize the second term, we'll call it v₁ to remind us that this optimizing of distances between landmarks and the tram line can be viewed as a 1-dimensional approximation to covering all the landmarks as best we can using just a straight line. Perhaps we want to place a tram-stop at the point closest to each landmark. In which case, we could build a set of vectors similar to A only this time each row is a vector to a tram stop. Let's call this matrix Ã. Furthermore, we call ||Av₁|| the first singular value (or σ₁) and v₁ the first right singular vector.

Then, the error in our approximation is A - Ã and we also know that this represents the distance from the tram line to landmark in each row in resultant matrix. Well, we've already calculated this, so:

||A - Ã||_F²= ||A||_F²- ||Av||²= ||A||_F²- σ₁²

Note that the positions of these tram stops can also be calculated as the projection of a_i onto v₁. So, each row of Ã can be expressed as:

<a_i, v₁> v₁

So, taking this equation and re-expressing

Ã = A v₁v₁^T= σ₁u₁v₁^T

where we define σ₁u₁ = A v₁

This is starting to look like our familiar equation in SVD. So, it only remains to note that this "trolley-line" example is a specific case. We could expand it to more than a 2-D map.

We do this by finding the unit vector v₂ that is orthogonal to v₁ and maximises ||Av|| and again for v₃, v₄, etc.

It doesn't take a great deal of effort to now show that σ₁, σ₂, σ₃ etc can be turned into a diagonal matrix and that would give us the SVD equation.

It's also not too hard to see that the values of this diagonal matrix are in descending order since at each stage we found the maximum before moving on to the next step.

Agile Java Man

Sunday, January 17, 2016

The Maths of LSA

No comments:

Post a Comment

Blog Archive

About Me