Here are some notes to enable you to quickly spin up a Spark cluster on the Google Cloud with about 20gb of data to play with although there is approximately 100 times that available if you have the patience.
Get the data
I was using data from here. There's gigs of the stuff. Each day's CSV logs when bzipped is about 1gb but when uncompressed becomes about 7gb; the JSON logs go from about 500mb to 14gb. I only used one day's worth to test my cluster.
Upload to BigQuery
You'll notice that you can't upload much through the Google web GUI to BigQuery but you can through the command line by following the instructions here:
$ bq load Holmes.netflowday02 netflow_day-02 load_time:integer,duration:integer,src_device:string,dst_device:string,protocol:integer,src_port:string,dst_port:string,src_packets:integer,dst_packets:integer,src_bytes:integer,dst_bytes:integer
JSON can be a little more tricky but this link helped.
$ bq load --source_format=NEWLINE_DELIMITED_JSON Holmes.winday02 wls_day-02 Time:integer,EventID:integer,LogHost,LogonType:integer,LogonTypeDescription,UserName,DomainName,LogonID,SubjectUserName,SubjectDomainName,SubjectLogonID,Status,Source,ServiceName,Destination,AuthenticationPackage,FailureReason,ProcessName,ProcessID,ParentProcessName,ParentProcessID
Or you could put it straight into Google Cloud Storage with:
$ gsutil cp wls_day-02 gs://dataproc-2e2cd005-3309-44f6-9858-33cc5344e856-eu/wls_day-02
These uploads can take quite a while.
Create a DataProc cluster
You can do this through the GUI (as documented here) or through the command line on your local computer with something like:
$ gcloud beta dataproc clusters create holmes --enable-component-gateway --bucket mlwstructureddata --region europe-west1 --subnet default --zone "" --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-highmem-4 --worker-boot-disk-size 500 --image-version 1.3-deb9 --optional-components JUPYTER,ZEPPELIN,ANACONDA --scopes 'https://www.googleapis.com/auth/cloud-platform' --project mlwstructureddata
where, of course, you replace any values specific to me.
Fix Zeppelin
Annoyingly, if you use Zeppelin, you'll need to edit /usr/lib/zeppelin/conf/interpreter.json, change zeppelin.bigquery.project_id to be the value of your project and run sudo systemctl restart zeppelin.service or else you'll see:
Copy to Google Cloud Storage
You can do this through the GUI (as documented here) or through the command line on your local computer with something like:
$ gcloud beta dataproc clusters create holmes --enable-component-gateway --bucket mlwstructureddata --region europe-west1 --subnet default --zone "" --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-highmem-4 --worker-boot-disk-size 500 --image-version 1.3-deb9 --optional-components JUPYTER,ZEPPELIN,ANACONDA --scopes 'https://www.googleapis.com/auth/cloud-platform' --project mlwstructureddata
where, of course, you replace any values specific to me.
Fix Zeppelin
Annoyingly, if you use Zeppelin, you'll need to edit /usr/lib/zeppelin/conf/interpreter.json, change zeppelin.bigquery.project_id to be the value of your project and run sudo systemctl restart zeppelin.service or else you'll see:
Invalid project ID ' '
when you try to execute a BigQuery statement.
Copy to Google Cloud Storage
Note that DataProc's Spark cannot directly call BigQuery even if Zeppelin can. Therefore, you need to export the data with something like:
$ bq --location=europe extract --destination_format CSV --print_header=false mlwstructureddata:Holmes.netflowday02 gs://dataproc-2e2cd005-3309-44f6-9858-33cc5344e856-eu/netflowday02/*
and now you'll be able to analyze it with Spark using something like:
val netflow = spark.read.format("csv").load("gs://dataproc-2e2cd005-3309-44f6-9858-33cc5344e856-eu/netflowday02/*").toDF("load_time","duration","src_device","dst_device","protocol","src_port","dst_port","src_packets","dst_packets","src_bytes","dst_bytes")
You'll notice that your notebooks will be stored to GCS and can be seen by executing something like:
$ gsutil ls gs://mlwstructureddata/notebooks/zeppelin
or you can view them through a web GUI by following the link here.
Connecting
Use SSH tunneling for security. Set up the tunnel with:
$ gcloud compute ssh "holmes-m" --project=mlwstructureddata --zone="europe-west1-b" -- -D 9888 -N
where holmes-m is my GCP machine, 9888 is my local port for proxying.
Then, set up a proxy server for Chrome with:
$ /usr/bin/google-chrome --proxy-server="socks5://localhost:9888" --user-data-dir=/tmp/holmes-m
which will open a new Chrome window. In this window you can navigate to services on the remote box safely. So, for instance, if you visit http://holmes-m:19888/ then you will see Hadoop's "JobHistory" page or http://holmes-m:8080/#/ will take you to Zeppelin.
That's it!
Congratulations! You should now have a Google Cloud environment all set up to start crunching data. Enjoy!