Tuesday, June 18, 2019

Signal from noise - part 1


PCA

How do you mathematically determine there is a signal in a scatter plot? A non-maths colleague said "just use principle component analysis", but it's much harder than that.

First, how would you even represent the data? Since it's a 2-D scatter plot, you might represent the data as an Nx2 matrix, where N is the number of points. You could then use singular value decomposition to calculate the principle components as SVD and PCA are related [StackOverflow].

Recall that if our matrix is called X,

(U, S, VT) = SVD(X)

Note that the "Principal components are given by 𝐗𝐕=𝐔𝐒𝐕T𝐕=𝐔𝐒." (ibid)

Now, we can reconstruct our original X by the matrix product, 𝐔𝐒𝐕T. We can see a real example of this in my GitHub repository here. Full reconstruction looks like this:
As you can see, I've added a sinusoidal shape to an otherwise random collection of points.

Now, we have only 2 principle components (as it's an Nx2 matrix) so let's see what it's like when reconstruct with just one PC:
Only the first Principle Component

Only the second Principle Component

Data Isomorphism

OK, so looking at just one principle component in this situation doesn't tell us much. So, let's rephrase the question. Let's treat the scatter plot as IxJ matrix that holds either ones or zeros. This representation holds exactly the same information as before but this time it will have I or J principle components.

Here, I've changed the visualisation from a scatter plot to a heat map since the reconstructed matrix may not be just ones and zeros if we discard some PCs. Here, is a reconstruction with only the top quartile of principle components:
Reconstruction discarding the bottom 75% of Principle Components.
The sinusoidal "signal" is still evident in the reconstruction but less so. Unfortunately, PCA has not helped us here to identify what we'd regard as the dominant feature of the data.

Conclusion

The principle components in a mathematical sense are not necessarily the the principle features that a human brain may identify.

There may be better ways to identify patterns in data than PCA. In this particular case, I got much more mileage from Convolutional Neural Nets (CNNs). I'll document that in another post.

Friday, June 14, 2019

Docker notes


This is my first play with Docker in anger. This tutorial (by Emanuele Cesena) helped me get Beam up and running with Docker very quickly. The Beam part is a bit too old now but it gives a good introduction to both technologies.


Instances

With Docker Compose, you can have a cluster of Docker containers. In a directory with a suitable docker-compose.yml file, run docker-compose (which "define[s] and run[s] multi-container applications with Docker" according to the man pages):

$ docker-compose up -d
...
190fcbe8a871: Pull complete
22b3697a4a0a: Pull complete
Digest: sha256:3d993a92474808f4920ccd346c0fad3f43b7c5d36539c723ac7947c406974300
Status: Downloaded newer image for dataradiant/beam-flink:latest
Creating dockerbeamflink_jobmanager_1
Creating dockerbeamflink_taskmanager_1
$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                                                             NAMES
87ec0891bc7b        dataradiant/beam-flink   "/usr/local/flink/bi…"   2 hours ago         Up 2 hours          6121-6123/tcp, 0.0.0.0:32768->22/tcp                                              dockerbeamflink_taskmanager_1

a057b8d8ff64        dataradiant/beam-flink   "/usr/local/flink/bi…"   2 hours ago         Up 2 hours          6123/tcp, 0.0.0.0:220->22/tcp, 0.0.0.0:48080->8080/tcp, 0.0.0.0:48081->8081/tcp   dockerbeamflink_jobmanager_1


Filesystems

$ docker run -t -i dataradiant/beam-flink  /bin/bash
root@54592dfb8040:/# ls -ltr
total 64

drwxr-xr-x   2 root root 4096 Apr 10  2014 mnt
...

Hey! That's not my root directory (StackOverflow)!

Note, this is not the same as logging onto a running container. 

root@54592dfb8040:/# /usr/java/default/bin/jps 
71 Jps

To do that, run this:

$ docker exec -it dockerbeamflink_taskmanager_1 /bin/bash
root@87ec0891bc7b:/# /usr/java/default/bin/jps 
238 TaskManager
323 Jps

Where are the images stored? "By default this will be aufs but can fall back to overlay, overlay2, btrfs, devicemapper or zfs depending on your kernel support."
https://stackoverflow.com/questions/19234831/where-are-docker-images-stored-on-the-host-machine

On my Ubuntu 16.04:

$ sudo du -sh /var/lib/docker/overlay2
11G /var/lib/docker/overlay2

"The main mechanics of OverlayFS relate to the merging of directory access when both filesystems present a directory for the same name." (Wikipedia)


Networking

The Docker daemon appears to act like a software router. That is, I access a URL on http://172.17.0.1:48080 but neither of my running Docker images have this IP address:

root@87ec0891bc7b:/# ifconfig 
eth0      Link encap:Ethernet  HWaddr 02:42:ac:11:00:03  
          inet addr:172.17.0.3  Bcast:172.17.255.255  Mask:255.255.0.0

root@a057b8d8ff64:/# ifconfig 
eth0      Link encap:Ethernet  HWaddr 02:42:ac:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0

Instead, it is the host OS that has it:

henryp@corsair:~$ ifconfig 
...
docker0   Link encap:Ethernet  HWaddr 02:42:02:bc:6e:af  
          inet addr:172.17.0.1  Bcast:172.17.255.255  Mask:255.255.0.0
...
henryp@corsair:~$ sudo netstat -nap | grep 48080
tcp        0      0 172.17.0.1:53756        172.17.0.1:48080        ESTABLISHED 5084/chrome --type=
tcp6       0      0 :::48080                :::*                    LISTEN      17660/docker-proxy

and the Docker daemon does some NATing because the instances don't have that port open:

root@a057b8d8ff64:/# netstat -nap | grep 48080
root@a057b8d8ff64:/#

root@87ec0891bc7b:/# netstat -nap | grep 48080
root@87ec0891bc7b:/#


Housekeeping

Stopping a Docker instance is simple: 

henryp@corsair:~$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                                                             NAMES
87ec0891bc7b        dataradiant/beam-flink   "/usr/local/flink/bi…"   20 hours ago        Up 20 hours         6121-6123/tcp, 0.0.0.0:32768->22/tcp                                              dockerbeamflink_taskmanager_1
a057b8d8ff64        dataradiant/beam-flink   "/usr/local/flink/bi…"   20 hours ago        Up 20 hours         6123/tcp, 0.0.0.0:220->22/tcp, 0.0.0.0:48080->8080/tcp, 0.0.0.0:48081->8081/tcp   dockerbeamflink_jobmanager_1
henryp@corsair:~$ docker stop 87ec0891bc7b
87ec0891bc7b
henryp@corsair:~$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                                                             NAMES
a057b8d8ff64        dataradiant/beam-flink   "/usr/local/flink/bi…"   20 hours ago        Up 20 hours         6123/tcp, 0.0.0.0:220->22/tcp, 0.0.0.0:48080->8080/tcp, 0.0.0.0:48081->8081/tcp   dockerbeamflink_jobmanager_1

whereupon the shell I had running in 87ec0891bc7b promptly terminates.

Hints on removing old images (they do take a lot of disk space) can be found at "Learn How To Stop, Kill And Clean Up Docker Containers"


Kubernetes

"Difference between Docker and Kubernetes: "Docker and rkt are two popular container technologies that allow you to easily run containerized applications. Kubernetes is a container orchestration platform that you can use to manage and scale your running containers across multiple instances or within a hybrid-cloud environment." (Google docs)


Thursday, June 13, 2019

The data science of cyber-squatting


The Apache Metron project has an interesting approach to detecting cyber-squatters that can be found here. Briefly, it generates lots of names that are similar to your domain using a tool called DNS Twist.

They then use Bloom Filters to check each domain name coming through in the stream is not on this black-list. Bloom Filters cannot say that something definitely is in a collection but can say when something definitely is not. The trade-off is that they consume very little memory. And for this use case, we're not interested whether our domain is not on a black-list, only whether it is.

Bloom Filters do this by using k algorithms that given the data generate a number between 0 and m. All of these numbers then becomes indices in an m-sized bit map and we flip that but to 1. Now, we have some new data come through and we ask, have we seen this before? Well, we run those same k algorithms and if each one gives us and index that has 0 in the corresponding array, we know that this datum is definitely not on our black-list.

My approach was different. Given the domains we own (over 900  of them), I generate n-grams of characters and create a distribution of their frequency. For me, n for is {2,3} but your mileage may vary.

The same calculation is performed on each domain as it flies by and compared (using Jensen-Shannon) with the 900 or so distributions representing our real estate. A score of 0 obviously means that it's one of ours. A score much greater means it's very different. But a score close to 0 is worrying.

How close before you raise alarms depends on your data and your tuning. During testing, the smallest, non-zero domain was "www-OUR_DOMAIN" rather than "www.OUR_DOMAIN". This was a known spear-phishing scam so clearly our algorithm works to some degree. What's more, DNS Twist does not seem to have anticipated this attack.

Friday, June 7, 2019

Architecting the Cloud


We're taking the plunge and moving a SOC to Google's cloud. This is the proposed architecture. With von Moltke's warning of "no plan survives first contact with the enemy" in mind, I write this for posterity and we shall see how well it stands up to reality.

Architecture

Ingestion

Tools: Apache Kafka.

Why? Because Kafka (unlike Google's PubSub) is not tied to a particular cloud provider. What's more, it's the de facto standard for high volume streams of data and finding developers who know it is not hard.

Storage

Tools: Google's BigQuery and DataProc.

Why? Google will worry about the long term storage of the data in BigQuery and we will pull subsets of it out into DataProc (which is basically Apache Spark, YARN and Hadoop) for analysis. This is Google's recommend way of doing it although one must bear in mind that they are trying to sell something. This basically makes BigQuery the golden source of data and DataProc the data lake.

Visualisation

Tools: Google's DataLab (aka, Jupyter) and some DataProc components (specifically, Apache Zeppelin)

Why? Jupyter and Zeppelin are open source. Zeppelin integrates very nicely with Spark, er, I mean Google DataProc; and Jupyter integrates easily with BigQuery.


Potential Use Cases

The Senior SOC Analyst - Investigation

A sufficiently sophisticated analyst should have no problem writing Jupyter notebooks that access BigQuery directly using plain SQL. They may need some familiarity with Python if they then wish to present this data.

The Junior SOC Analyst - Investigation

Google does not appear to have a managed Elastic component. However, plain old Linux instances can be spun up and Elastic installed on them manually. This Elastic cluster can then batch load data that was extracted from BigQuery. Alternatively, it could stream from the Kafka topic used in the ingestion layer. Either way, the analyst then can run free-text searches over the event data rather than using the more complicated SQL and Python languages.

The Data Scientist - Modelling

Given a requirement to build a deliverable, a data scientist can spin up an appropriately sized DataProc cluster and transfer data from BigQuery. BQ will be his data lake.
"Simply bringing data from various parts of a large organization together in one place is valuable because it enables joins across datasets that were previously disparate ... collecting data in its raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a data lake or enterprise data hub... Simply dumping data in its raw form ... has been dubbed the sushi principles: raw data is better". [1]
So, imagine that the data scientist is looking at anomalous employee behaviour. He can pull in large amounts of data from BigQuery in an arbitrary format. He can also pull in the employee database and marry the two together in DataProc. By itself, the BigQuery data won't necessarily tell you that an SSH connection is suspicious. However, coupled with employee data, it would be highly suspicious if that SSH connection came from the machine of a 52-year old HR lady rather than a 25-year old sysadmin.

The Data Scientist - Producing Artefacts

Google's Spark/Hadoop/YARN offering (DataProc) only holds ephemeral data. When the cluster is no longer needed, it can be closed down and the data will disappear into the ether but not before the Data Scientist has finished building his model. This model will be distilled from the petabytes in BigQuery and the terabytes in DataProc and can be brought out of the cloud inexpensively in the form of a moderately sized (megabytes?) artefact.

The Developer - Writing Parsers

Feeds into the ingestion system can be of disparate formats (CEF, Rsyslog etc). Therefore, developers must be able to write new parsers and deploy them. It makes sense to run these parsers on the ingestion layer rather than on all the agents to avoid deployment faff.

The Developer - Analysing Streams

Since analysts want to be warned of high-risk events as soon as possible, there will be a need for the developers to write stream monitoring software. This may listen to the same Kafka topic used in ingestion. Perhaps it can use Google's PubSub to send the warnings if it lives in the cloud.

The Developer - Using ML Models in Production

It is to be hoped that artefacts generated by the data scientists can be deployed in production. These artefacts should be written in such a way that they do not need to access the data in GCP. For instance, a data scientist might train a neural net in GCP to spot anomalous behaviour then, when happy with it, give it to the developer (preferably as a JAR) to use in their stream monitory software.


Schema

The ingestion layer can do the following to the data before feeding it into BigQuery:
  1. Do no parsing and put it into BigQuery raw.
  2. Partially parse it.
  3. Parse it completely.
Each has pros and cons.

No Parsing

This is the simplest path initially. It means all data lives in the same table irrespective of which stream it came from. But it also means that BigQuery will have no schema to speak of, the classic pattern of "schema-on-read [where] the structure of the data is implicit, and only interpreted when the data is read" [1].

This will make querying and data extraction extremely difficult as the table will look a little something like this:

IngestTimestamp  Value
Thu  6 Jun 16:44:01.123 BST 2019{ "additionalEventData": { "SSEApplied": "SSE_S3", "x-amz-id-2": "u8yUG+gwxxc3ovA49cFbD9XFLyHRBJAkwnHPeFZVg4AnfGKp30LXiQ2oAY3BUX5Lwflkijiaq9M=" }, "awsRegion": "eu-west-1", "eventID": "e4490a9a-1be4-4585-98ee-9d0a4e59fcc3", "eventName": "PutObject", "eventSource": "s3.amazonaws.com", "eventTime": "2018-12-05T14:47:26Z", "eventType": "AwsApiCall", "eventVersion": "1.05", "readOnly": false, "recipientAccountId": "479626555249", "requestID": "770130FED723D6EA", "requestParameters": { "bucketName": "vsecacloudtraillogs", "key": "AWSLogs/479626555249/CloudTrail/eu-west-1/2018/12/05/479626555249_CloudTrail_eu-west-1_20181205T1450Z_apgVO8lFdNcrVfHn.json.gz", "x-amz-acl": "bucket-owner-full-control", "x-amz-server-side-encryption": "AES256" }, "resources": [ { "ARN": "arn:aws:s3:::vsecacloudtraillogs/AWSLogs/479626555249/CloudTrail/eu-west-1/2018/12/05/479626555249_CloudTrail_eu-west-1_20181205T1450Z_apgVO8lFdNcrVfHn.json.gz", "type": "AWS::S3::Object" }, { "ARN": "arn:aws:s3:::vsecacloudtraillogs", "accountId": "479626555249", "type": "AWS::S3::Bucket" } ], "responseElements": { "x-amz-server-side-encryption": "AES256" }, "sharedEventID": "bfa6ef38-d494-4974-8d76-1c07b305ae90", "sourceIPAddress": "192.168.0.1", "userAgent": "cloudtrail.amazonaws.com", "userIdentity": { "invokedBy": "cloudtrail.amazonaws.com", "type": "AWSService" } }
Thu  6 Jun 16:44:01.124 BST 2019{ "region": "us-east-1", "detail": { "type": "UnauthorizedAccess:EC2/SSHBruteForce", "resource": { "resourceType": "Instance", "instanceDetails": { "instanceId": "i-99999999", "instanceType": "m3.xlarge", "launchTime": "2016-08-02T02:05:06Z", "platform": null, "productCodes": [ { "productCodeId": "GeneratedFindingProductCodeId", "productCodeType": "GeneratedFindingProductCodeType" } ], "iamInstanceProfile": { "arn": "GeneratedFindingInstanceProfileArn", "id": "GeneratedFindingInstanceProfileId" }, "networkInterfaces": [ { "ipv6Addresses": [], "networkInterfaceId": "eni-bfcffe88", "privateDnsName": "GeneratedFindingPrivateDnsName", "privateIpAddress": "10.0.0.1", "privateIpAddresses": [ { "privateDnsName": "GeneratedFindingPrivateName", "privateIpAddress": "10.0.0.1" } ], "subnetId": "Replace with valid SubnetID", "vpcId": "GeneratedFindingVPCId", "securityGroups": [ { "groupName": "GeneratedFindingSecurityGroupName", "groupId": "GeneratedFindingSecurityId" } ], "publicDnsName": "GeneratedFindingPublicDNSName", "publicIp": "127.0.0.1" } ], "tags": [ { "key": "GeneratedFindingInstaceTag1", "value": "GeneratedFindingInstaceValue1" }, { "key": "GeneratedFindingInstaceTag2", "value": "GeneratedFindingInstaceTagValue2" }, { "key": "GeneratedFindingInstaceTag3", "value": "GeneratedFindingInstaceTagValue3" }, { "key": "GeneratedFindingInstaceTag4", "value": "GeneratedFindingInstaceTagValue4" }, { "key": "GeneratedFindingInstaceTag5", "value": "GeneratedFindingInstaceTagValue5" }, { "key": "GeneratedFindingInstaceTag6", "value": "GeneratedFindingInstaceTagValue6" }, { "key": "GeneratedFindingInstaceTag7", "value": "GeneratedFindingInstaceTagValue7" }, { "key": "GeneratedFindingInstaceTag8", "value": "GeneratedFindingInstaceTagValue8" }, { "key": "GeneratedFindingInstaceTag9", "value": "GeneratedFindingInstaceTagValue9" } ], "instanceState": "running", "availabilityZone": "GeneratedFindingInstaceAvailabilityZone", "imageId": "ami-99999999", "imageDescription": "GeneratedFindingInstaceImageDescription" } }, "service": { "serviceName": "guardduty", "action": { "actionType": "NETWORK_CONNECTION", "networkConnectionAction": { "connectionDirection": "INBOUND", "remoteIpDetails": { "ipAddressV4": "127.0.0.1", "organization": { "asn": "-1", "asnOrg": "GeneratedFindingASNOrg", "isp": "GeneratedFindingISP", "org": "GeneratedFindingORG" }, "country": { "countryName": "GeneratedFindingCountryName" }, "city": { "cityName": "GeneratedFindingCityName" }, "geoLocation": { "lat": 0.0, "lon": 0.0 } }, "remotePortDetails": { "port": 32794, "portName": "Unknown" }, "localPortDetails": { "port": 22, "portName": "SSH" }, "protocol": "TCP", "blocked": false } }, "resourceRole": "TARGET", "additionalInfo": { "sample": true }, "eventFirstSeen": "2018-05-11T14:56:39.976Z", "eventLastSeen": "2018-05-11T14:56:39.976Z", "archived": false, "count": 1 }, "severity": 2, "createdAt": "2019-06-06T16:50:11.441Z", "updatedAt": "2018-05-11T14:56:39.976Z", "title": "127.0.0.1 is performing SSH brute force attacks against i-99999999. ", "description": "127.0.0.1 is performing SSH brute force attacks against i-99999999. Brute force attacks are used to gain unauthorized access to your instance by guessing the SSH password." } }

That is, all the data in one, big varchar if we were to use RDBMS parlance. Note how the table does not care whether the event is GuardDuty or a CloudTrail. Also note that although in this example it happens that GuardDuty and CloudTrail are JSON and a schema can be inferred, not all streams are going to be in the JSON format. In any case "if your BigQuery write operation creates a new table, you must provide schema information" (from the docs).

So, how would you write a query to, say, pull out a record with a given invokedBy field? And how would you limit that query to just a subset of the data (to reduce processing costs)?

It appears that BQ allows UDFs (user defined functions) but they're written in JavaScript. If you're going to write a parser, would you rather write and test it locally in a JVM language or in JS running on an opaque technology stack?

Partial Parsing

The idea here is to "lightly" parse the data. The fields that are common over all streams are event time, machine name and stream name - that is all (this appears to be what the Apache Metron team have gone with. "We are currently working on expanding the message standardization beyond these fields, but this feature is not yet available" - from here).

These parsed fields can be pulled out into their own columns but the payload must remain a string. So, like the solution above (No Parsing), analysts will still find it hard to grok the data if they access it directly via BQ.

However, things may change upon pulling the data from BQ to, say DataProc. In this case, only what is needed (given time range, machine and/or stream type) will be pulled out and bespoke parsers can be written that pull only the data needed for the query. This will require far less development work than the Complete Parsing solution below but will require an increased level of sophistication from the analysts.

Complete Parsing

This requires writing (or using open source) parsers for a particular feed and then mapping it to a rich schema in BigQuery. This will require a lot of up-front development work but it results in full, rich SQL queries than can be run by the analysts.

This solution would probably lead to a different table per event stream type as they all represent different data. If you were feeling particularly ambitious, you could try to unify all this data into one table. However, the history of attempting to unify all types of security events into one schema has been a sorry tale. Rafael Marty, who has spent his career trying to do exactly this says:
"For over a decade, we have been trying to standardize log formats. And we are still struggling. I initially wrote the Common Event Format (CEF) at ArcSight. Then I went to Mitre and tried to get the common event expression (CEE) work off the ground to define a vendor neutral standard. Unfortunately, getting agreement between Microsoft, RedHat, Cisco, and all the log management vendors wasn’t easy and we lost the air force funding for the project. In the meantime I went to work for Splunk and started the common information model (CIM). Then came Apache Spot, which has defined yet another standard (yes, I had my fingers in that one too). So the reality is, we have 4 pseudo standards, and none is really what I want. I just redid some major parts over here at Sophos (I hope I can release that at some point)."
So, only the brave, foolhardy or well-funded need apply.

Schema evolution

Note that it seems that the scope for schema evolution is limited in BigQuery which "natively supports [only] the following schema modifications: Adding columns to a schema definition; Relaxing a column's mode from REQUIRED to NULLABLE" (from the docs).

This means that having a rich schema (as in Complete Parsing) that is easy to interrogate is unfortunately very brittle. You must have your schema pretty much nailed down on your first attempt or else you may very well be forced to introduce hacks further down the line as you shoe-horn changes into it.

Which approach is best?

"Enforcement of schemas in database is a contentious topic, and in general there's no right or wrong answer." [1]

Conclusion

I'm firmly in the camp that says the architecture document should come at the end of an MVP not the beginning. It must adapt to the hidden realities that become apparent along the way. I'll revisit this post in the future to see what actually happened in our case.

We started with a military quote and we'll end with one, this time from Eisenhower: "In preparing for battle I have always found that plans are useless, but planning is indispensable."

[1] Designing Data Intensive Applications, Klepperman

Wednesday, June 5, 2019

Google Cloud Analytics - Notes


The Google Cloud ecosystem is a sprawling conurbation of tools. These are some notes I made that helped me remember.

One sentence summaries

DataProc - Spark, Hadoop, YARN.

DataProc components - Tools that compliment DataProc including Apache Zeppelin.

DataLab - Jupyter notebooks. Not to be confused with DataStudio which is for ad campaigns etc. DataLab can integrate with DataProc (see Google's documentation)

BigTable - Basically, HBase.

DataStore - Basically, a BLOB store.

PubSub - Messaging.

BigQuery - Basically, a SQL engine behind a REST API.

Cloud SQL - MySQL or PostrgresSQL.

Cloud Dataflow - Apache Beam.

Composer - Apache Airflow.

Google SDK - has emulators to let you run locally. Emulators are limited to BigTable, DataStore, FireStore and PubSub.

Stackdriver - think JConsole mixed with Splunk. "gives you access to logs, metrics, traces, and other signals from your infrastructure platform(s), virtual machines, containers, middleware, and application tier, so that you can track issues all the way from your end user to your backend services and infrastructure" (documentation).

Prepare your Environment

Unlike Amazon, you can quite easily do a lot of the admin from your local machine. On my home Ubuntu box, I ran:

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
sudo apt-get google-cloud-sdk-datalab
sudo apt-get install google-cloud-sdk-datalab
sudo apt-get install kubectl
gcloud init
sudo apt-get install docker

and had everything I needed to play.

Running DataLab locally

One very nice feature is that you can run pieces of Google's infrastructure locally.

export  IMAGE=gcr.io/cloud-datalab/datalab:latest
export  PROJECT_ID=$(gcloud config get-value project)
if [ "$OSTYPE" == "linux"* ]; then   PORTMAP="127.0.0.1:8081:8080"; else PORTMAP="8081:8080"; fi
docker pull $IMAGE
docker run -it -p $PORTMAP  -v "https://github.com/PhillHenry/googlecloud.git"  -e "PROJECT_ID=philltest"  $IMAGE
.
.
Open your browser to http://localhost:8081/ to connect to Datalab.

Note that not all components can be run locally. Significantly, BigQuery cannot.