Saturday, January 30, 2016

Cloud Atlas


Note on setting up Spark on AWS

YMMV, but these are some things that were useful to me.

Set up your Amazon account per the instructions page. I put my key in:

~/.ssh/MyTestAwsIreland.id_rsa

I downloaded the latest Spark and ran this from the top level directory:

./spark-ec2 -k MyTestAwsIreland -i ~/.ssh/MyTestAwsIreland.id_rsa -s 4 --instance-type=c4.8xlarge --region=eu-west-1 --vpc-id=MY_VPC_ID --subnet-id=MY_SUBNET_ID --zone=eu-west-1a launch MY_CLUSTER_NAME

(your region and zone may be different. If you see errors, this page may help).

This starts a fairly beefy cluster with one master and 4 slaves using the same version of Spark as you downloaded. Unfortunately, there wasn't much disk space (I was getting "no space left on device" from Spark jobs) so follow the instructions here and if there are problems, this page will almost certainly solve them.

Also, I was having trouble connecting. You must set up a public DNS if you want your local Spark driver to talk to the cluster (see here for more information).

Running Hadoop was/is proving more complicated. I could not use port 9000 as it was being blocked for some reason (my local client socket was left in SYNC_SENT state indicating a firewall issue) so I changed it to 8020. This remains a puzzle.

Also, Spark clients initially talk to the NameNode but then start talking direct to the DataNodes. AWS instances have a public domain name/IP combo and a private domain name/IP combo. Unfortunately, the NameNode was sending the private IP address. This link forces it to send the domain name but at the time of writing, it's only the private domain name. Using the hostname command on the boxes has not solved the problem.

Anyway, it will also help to get rsync set up so you can develop on your laptop and then synch your code with something as simple as

rsync -avz -e "ssh  -i ~/.ssh/MyTestAwsIreland.id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"  --exclude '.git' --exclude '.idea'  --progress LOCAL_PROJECT_DIRECTOR  root@YOUR_MACHINE_NAME:/REMOTE_PROJECT_DIRECTORY

Finally, a simple script to run on all your slave boxes during a job is:

while ( true ) ; do { jstat -gc `jps | grep CoarseGrainedExecutorBackend | awk '{print $1}'` | awk  '{print $8 "/" $7 " " $14}' 2>/dev/null ; uptime ; vmstat 1 2 ; echo ; sleep 10 ; } done

which monitors the JVM giving its old generation heap usage over its capacity, the time taken for full GCs plus the load average and various system stats (from vmstat).

Addendum: this seems to install a Spark instance with a really old Hadoop library (see SPARK_HOME/lib/spark-assembly-1.6.0-hadoop1.2.1.jar). I re-installed a better version and had it talking to Hadoop nicely but not before I wasted time trying to solve this issue.

Sunday, January 17, 2016

The Maths of LSA


At the heart of Latent Semantic Analysis is a mathematical pattern called Singular Value Decomposition (SVD). This is an appropriate technique when you have sparse data in many dimensions. It reduces it to smaller dimensions with (hopefully) little loss of accuracy.

First, a little mathematical refresher.

Eigenvalues and Eigenvectors

An eigenvector is a vector that when multiplied by a matrix points in the same direction. It's length may have changed by (a non-zero) factor called the eigenvalue.

Only square matrices can have eigenvalues and eigenvectors. Anything non-square will change the dimension of the vector so the result would never be a multiple of itself. SVD is a hack to work with non-square matrices.

What's more, not all matrices will have real valued eigen-vectors/values. Let's say the matrix is a rotation. No vector (apart from the 0-vector) will point in the same direction after all rotations. In Python, this is a 45-degree rotation.

import numpy as np


def no_complex_elements(vec):
    np.all(map(lambda arr: not np.iscomplex(arr.flatten), vec))


def print_eigenvectors_and_eigenvalues(matrix):
    eigenvalues, eigenvectors = np.linalg.eig(rotation)
    print "real eigenvalues \n", filter(lambda scalar: not isinstance(scalar, complex), eigenvalues)
    print "real eigenvectors\n", filter(lambda vector: no_complex_elements(vector), eigenvectors)


if __name__ == "__main__":
    rotation = np.matrix('0.707106781  -0.707106781  0;'
                         '0.707106781  0.707106781   0;'
                         '0            0             1')
    print_eigenvectors_and_eigenvalues(rotation)

and it output's an empty list.

The trolley-line-location problem

In "Coding the Matrix", Philip Klein compares SVD to finding the optimal path for a straight tram line to run through a city. I'll skip most the proofs (they're in the book) but the gist goes like this:

Imagine the tram line must go through the city centre (the origin) while getting as close as possible to fixed land marks located with vectors ai.

Make each ai a row in a vector A.

Now, each point ai can be expressed in terms of two vectors, one parallel to the tram line, ai||v, and one perpendicular to it, ai¬v (notice that we're choosing a new co-ordinate system) and that since
ai = ai¬v + ai||v
then, squaring:
||ai||2 = ||ai¬v||2 + ||ai||v ||2
and re-arranging:
||ai¬v||= ||ai||2 - ||ai||v ||2

This, of course, is our distance from the tram line squared. If we want to minimize this value for all landmarks, we can express the problem as finding the minimum of:
  

i 

(||ai||2 - ||ai||v ||2)

which, when you again re-arrange, will give you:
||A||F2  - ||Av||2
where the F indicates the Frobenius Norm (nothing too scary, just a generalization of a vector multiplied by itself) and where the second term is just us cleaning up all the other terms:
  

i 

<aiv>2

where <aiv>is just the dot-product of ai and the unit vector, v which is of course ai||v.

Seems a lot of work. Don't worry, it's all been high-school maths so far.

Anyway, it's this second term that we want to maximize. As the first term is fixed, maximizing the second will find the overall minimum for the whole equation. We won't worry too much how we do this - there are numerical techniques for doing it - this post is just to give the flavour.

Having chosen a suitable unit vector to maximize the second term, we'll call it v1 to remind us that this optimizing of distances between landmarks and the tram line can be viewed as a 1-dimensional approximation to covering all the landmarks as best we can using just a straight line. Perhaps we want to place a tram-stop at the point closest to each landmark. In which case, we could build a set of vectors similar to A only this time each row is a vector to a tram stop. Let's call this matrix Ã. Furthermore, we call ||Av1|| the first singular value (or σ1) and v1 the first right singular vector.

Then, the error in our approximation is A - Ã and we also know that this represents the distance from the tram line to landmark in each row in resultant matrix. Well, we've already calculated this, so:
||A - Ã||F2  =  ||A||F2  - ||Av||2  = ||A||F2  σ12
Note that the positions of these tram stops can also be calculated as the projection of ai onto v1. So, each row of Ã can be expressed as:
<ai, v1v1
So, taking this equation and re-expressing
à = A vv1T  σuv1T 
where we define σu1 = A v1

This is starting to look like our familiar equation in SVD. So, it only remains to note that this "trolley-line" example is a specific case. We could expand it to more than a 2-D map.

We do this by finding the unit vector v2 that is orthogonal to v1 and maximises ||Av|| and again for v3v4, etc.

It doesn't take a great deal of effort to now show that σ1σ2σ3 etc can be turned into a diagonal matrix and that would give us the SVD equation.

It's also not too hard to see that the values of this diagonal matrix are in descending order since at each stage we found the maximum before moving on to the next step.