Note on setting up Spark on AWS
YMMV, but these are some things that were useful to me.
Set up your Amazon account per the instructions page. I put my key in:
~/.ssh/MyTestAwsIreland.id_rsa
I downloaded the latest Spark and ran this from the top level directory:
./spark-ec2 -k MyTestAwsIreland -i ~/.ssh/MyTestAwsIreland.id_rsa -s 4 --instance-type=c4.8xlarge --region=eu-west-1 --vpc-id=MY_VPC_ID --subnet-id=MY_SUBNET_ID --zone=eu-west-1a launch MY_CLUSTER_NAME
(your region and zone may be different. If you see errors, this page may help).
This starts a fairly beefy cluster with one master and 4 slaves using the same version of Spark as you downloaded. Unfortunately, there wasn't much disk space (I was getting "no space left on device" from Spark jobs) so follow the instructions here and if there are problems, this page will almost certainly solve them.
Also, I was having trouble connecting. You must set up a public DNS if you want your local Spark driver to talk to the cluster (see here for more information).
Running Hadoop was/is proving more complicated. I could not use port 9000 as it was being blocked for some reason (my local client socket was left in SYNC_SENT state indicating a firewall issue) so I changed it to 8020. This remains a puzzle.
Also, Spark clients initially talk to the NameNode but then start talking direct to the DataNodes. AWS instances have a public domain name/IP combo and a private domain name/IP combo. Unfortunately, the NameNode was sending the private IP address. This link forces it to send the domain name but at the time of writing, it's only the private domain name. Using the hostname command on the boxes has not solved the problem.
Anyway, it will also help to get rsync set up so you can develop on your laptop and then synch your code with something as simple as
rsync -avz -e "ssh -i ~/.ssh/MyTestAwsIreland.id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --exclude '.git' --exclude '.idea' --progress LOCAL_PROJECT_DIRECTOR root@YOUR_MACHINE_NAME:/REMOTE_PROJECT_DIRECTORY
Finally, a simple script to run on all your slave boxes during a job is:
while ( true ) ; do { jstat -gc `jps | grep CoarseGrainedExecutorBackend | awk '{print $1}'` | awk '{print $8 "/" $7 " " $14}' 2>/dev/null ; uptime ; vmstat 1 2 ; echo ; sleep 10 ; } done
Addendum: this seems to install a Spark instance with a really old Hadoop library (see SPARK_HOME/lib/spark-assembly-1.6.0-hadoop1.2.1.jar). I re-installed a better version and had it talking to Hadoop nicely but not before I wasted time trying to solve this issue.