Friday, June 7, 2019

Architecting the Cloud


We're taking the plunge and moving a SOC to Google's cloud. This is the proposed architecture. With von Moltke's warning of "no plan survives first contact with the enemy" in mind, I write this for posterity and we shall see how well it stands up to reality.

Architecture

Ingestion

Tools: Apache Kafka.

Why? Because Kafka (unlike Google's PubSub) is not tied to a particular cloud provider. What's more, it's the de facto standard for high volume streams of data and finding developers who know it is not hard.

Storage

Tools: Google's BigQuery and DataProc.

Why? Google will worry about the long term storage of the data in BigQuery and we will pull subsets of it out into DataProc (which is basically Apache Spark, YARN and Hadoop) for analysis. This is Google's recommend way of doing it although one must bear in mind that they are trying to sell something. This basically makes BigQuery the golden source of data and DataProc the data lake.

Visualisation

Tools: Google's DataLab (aka, Jupyter) and some DataProc components (specifically, Apache Zeppelin)

Why? Jupyter and Zeppelin are open source. Zeppelin integrates very nicely with Spark, er, I mean Google DataProc; and Jupyter integrates easily with BigQuery.


Potential Use Cases

The Senior SOC Analyst - Investigation

A sufficiently sophisticated analyst should have no problem writing Jupyter notebooks that access BigQuery directly using plain SQL. They may need some familiarity with Python if they then wish to present this data.

The Junior SOC Analyst - Investigation

Google does not appear to have a managed Elastic component. However, plain old Linux instances can be spun up and Elastic installed on them manually. This Elastic cluster can then batch load data that was extracted from BigQuery. Alternatively, it could stream from the Kafka topic used in the ingestion layer. Either way, the analyst then can run free-text searches over the event data rather than using the more complicated SQL and Python languages.

The Data Scientist - Modelling

Given a requirement to build a deliverable, a data scientist can spin up an appropriately sized DataProc cluster and transfer data from BigQuery. BQ will be his data lake.
"Simply bringing data from various parts of a large organization together in one place is valuable because it enables joins across datasets that were previously disparate ... collecting data in its raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a data lake or enterprise data hub... Simply dumping data in its raw form ... has been dubbed the sushi principles: raw data is better". [1]
So, imagine that the data scientist is looking at anomalous employee behaviour. He can pull in large amounts of data from BigQuery in an arbitrary format. He can also pull in the employee database and marry the two together in DataProc. By itself, the BigQuery data won't necessarily tell you that an SSH connection is suspicious. However, coupled with employee data, it would be highly suspicious if that SSH connection came from the machine of a 52-year old HR lady rather than a 25-year old sysadmin.

The Data Scientist - Producing Artefacts

Google's Spark/Hadoop/YARN offering (DataProc) only holds ephemeral data. When the cluster is no longer needed, it can be closed down and the data will disappear into the ether but not before the Data Scientist has finished building his model. This model will be distilled from the petabytes in BigQuery and the terabytes in DataProc and can be brought out of the cloud inexpensively in the form of a moderately sized (megabytes?) artefact.

The Developer - Writing Parsers

Feeds into the ingestion system can be of disparate formats (CEF, Rsyslog etc). Therefore, developers must be able to write new parsers and deploy them. It makes sense to run these parsers on the ingestion layer rather than on all the agents to avoid deployment faff.

The Developer - Analysing Streams

Since analysts want to be warned of high-risk events as soon as possible, there will be a need for the developers to write stream monitoring software. This may listen to the same Kafka topic used in ingestion. Perhaps it can use Google's PubSub to send the warnings if it lives in the cloud.

The Developer - Using ML Models in Production

It is to be hoped that artefacts generated by the data scientists can be deployed in production. These artefacts should be written in such a way that they do not need to access the data in GCP. For instance, a data scientist might train a neural net in GCP to spot anomalous behaviour then, when happy with it, give it to the developer (preferably as a JAR) to use in their stream monitory software.


Schema

The ingestion layer can do the following to the data before feeding it into BigQuery:
  1. Do no parsing and put it into BigQuery raw.
  2. Partially parse it.
  3. Parse it completely.
Each has pros and cons.

No Parsing

This is the simplest path initially. It means all data lives in the same table irrespective of which stream it came from. But it also means that BigQuery will have no schema to speak of, the classic pattern of "schema-on-read [where] the structure of the data is implicit, and only interpreted when the data is read" [1].

This will make querying and data extraction extremely difficult as the table will look a little something like this:

IngestTimestamp  Value
Thu  6 Jun 16:44:01.123 BST 2019{ "additionalEventData": { "SSEApplied": "SSE_S3", "x-amz-id-2": "u8yUG+gwxxc3ovA49cFbD9XFLyHRBJAkwnHPeFZVg4AnfGKp30LXiQ2oAY3BUX5Lwflkijiaq9M=" }, "awsRegion": "eu-west-1", "eventID": "e4490a9a-1be4-4585-98ee-9d0a4e59fcc3", "eventName": "PutObject", "eventSource": "s3.amazonaws.com", "eventTime": "2018-12-05T14:47:26Z", "eventType": "AwsApiCall", "eventVersion": "1.05", "readOnly": false, "recipientAccountId": "479626555249", "requestID": "770130FED723D6EA", "requestParameters": { "bucketName": "vsecacloudtraillogs", "key": "AWSLogs/479626555249/CloudTrail/eu-west-1/2018/12/05/479626555249_CloudTrail_eu-west-1_20181205T1450Z_apgVO8lFdNcrVfHn.json.gz", "x-amz-acl": "bucket-owner-full-control", "x-amz-server-side-encryption": "AES256" }, "resources": [ { "ARN": "arn:aws:s3:::vsecacloudtraillogs/AWSLogs/479626555249/CloudTrail/eu-west-1/2018/12/05/479626555249_CloudTrail_eu-west-1_20181205T1450Z_apgVO8lFdNcrVfHn.json.gz", "type": "AWS::S3::Object" }, { "ARN": "arn:aws:s3:::vsecacloudtraillogs", "accountId": "479626555249", "type": "AWS::S3::Bucket" } ], "responseElements": { "x-amz-server-side-encryption": "AES256" }, "sharedEventID": "bfa6ef38-d494-4974-8d76-1c07b305ae90", "sourceIPAddress": "192.168.0.1", "userAgent": "cloudtrail.amazonaws.com", "userIdentity": { "invokedBy": "cloudtrail.amazonaws.com", "type": "AWSService" } }
Thu  6 Jun 16:44:01.124 BST 2019{ "region": "us-east-1", "detail": { "type": "UnauthorizedAccess:EC2/SSHBruteForce", "resource": { "resourceType": "Instance", "instanceDetails": { "instanceId": "i-99999999", "instanceType": "m3.xlarge", "launchTime": "2016-08-02T02:05:06Z", "platform": null, "productCodes": [ { "productCodeId": "GeneratedFindingProductCodeId", "productCodeType": "GeneratedFindingProductCodeType" } ], "iamInstanceProfile": { "arn": "GeneratedFindingInstanceProfileArn", "id": "GeneratedFindingInstanceProfileId" }, "networkInterfaces": [ { "ipv6Addresses": [], "networkInterfaceId": "eni-bfcffe88", "privateDnsName": "GeneratedFindingPrivateDnsName", "privateIpAddress": "10.0.0.1", "privateIpAddresses": [ { "privateDnsName": "GeneratedFindingPrivateName", "privateIpAddress": "10.0.0.1" } ], "subnetId": "Replace with valid SubnetID", "vpcId": "GeneratedFindingVPCId", "securityGroups": [ { "groupName": "GeneratedFindingSecurityGroupName", "groupId": "GeneratedFindingSecurityId" } ], "publicDnsName": "GeneratedFindingPublicDNSName", "publicIp": "127.0.0.1" } ], "tags": [ { "key": "GeneratedFindingInstaceTag1", "value": "GeneratedFindingInstaceValue1" }, { "key": "GeneratedFindingInstaceTag2", "value": "GeneratedFindingInstaceTagValue2" }, { "key": "GeneratedFindingInstaceTag3", "value": "GeneratedFindingInstaceTagValue3" }, { "key": "GeneratedFindingInstaceTag4", "value": "GeneratedFindingInstaceTagValue4" }, { "key": "GeneratedFindingInstaceTag5", "value": "GeneratedFindingInstaceTagValue5" }, { "key": "GeneratedFindingInstaceTag6", "value": "GeneratedFindingInstaceTagValue6" }, { "key": "GeneratedFindingInstaceTag7", "value": "GeneratedFindingInstaceTagValue7" }, { "key": "GeneratedFindingInstaceTag8", "value": "GeneratedFindingInstaceTagValue8" }, { "key": "GeneratedFindingInstaceTag9", "value": "GeneratedFindingInstaceTagValue9" } ], "instanceState": "running", "availabilityZone": "GeneratedFindingInstaceAvailabilityZone", "imageId": "ami-99999999", "imageDescription": "GeneratedFindingInstaceImageDescription" } }, "service": { "serviceName": "guardduty", "action": { "actionType": "NETWORK_CONNECTION", "networkConnectionAction": { "connectionDirection": "INBOUND", "remoteIpDetails": { "ipAddressV4": "127.0.0.1", "organization": { "asn": "-1", "asnOrg": "GeneratedFindingASNOrg", "isp": "GeneratedFindingISP", "org": "GeneratedFindingORG" }, "country": { "countryName": "GeneratedFindingCountryName" }, "city": { "cityName": "GeneratedFindingCityName" }, "geoLocation": { "lat": 0.0, "lon": 0.0 } }, "remotePortDetails": { "port": 32794, "portName": "Unknown" }, "localPortDetails": { "port": 22, "portName": "SSH" }, "protocol": "TCP", "blocked": false } }, "resourceRole": "TARGET", "additionalInfo": { "sample": true }, "eventFirstSeen": "2018-05-11T14:56:39.976Z", "eventLastSeen": "2018-05-11T14:56:39.976Z", "archived": false, "count": 1 }, "severity": 2, "createdAt": "2019-06-06T16:50:11.441Z", "updatedAt": "2018-05-11T14:56:39.976Z", "title": "127.0.0.1 is performing SSH brute force attacks against i-99999999. ", "description": "127.0.0.1 is performing SSH brute force attacks against i-99999999. Brute force attacks are used to gain unauthorized access to your instance by guessing the SSH password." } }

That is, all the data in one, big varchar if we were to use RDBMS parlance. Note how the table does not care whether the event is GuardDuty or a CloudTrail. Also note that although in this example it happens that GuardDuty and CloudTrail are JSON and a schema can be inferred, not all streams are going to be in the JSON format. In any case "if your BigQuery write operation creates a new table, you must provide schema information" (from the docs).

So, how would you write a query to, say, pull out a record with a given invokedBy field? And how would you limit that query to just a subset of the data (to reduce processing costs)?

It appears that BQ allows UDFs (user defined functions) but they're written in JavaScript. If you're going to write a parser, would you rather write and test it locally in a JVM language or in JS running on an opaque technology stack?

Partial Parsing

The idea here is to "lightly" parse the data. The fields that are common over all streams are event time, machine name and stream name - that is all (this appears to be what the Apache Metron team have gone with. "We are currently working on expanding the message standardization beyond these fields, but this feature is not yet available" - from here).

These parsed fields can be pulled out into their own columns but the payload must remain a string. So, like the solution above (No Parsing), analysts will still find it hard to grok the data if they access it directly via BQ.

However, things may change upon pulling the data from BQ to, say DataProc. In this case, only what is needed (given time range, machine and/or stream type) will be pulled out and bespoke parsers can be written that pull only the data needed for the query. This will require far less development work than the Complete Parsing solution below but will require an increased level of sophistication from the analysts.

Complete Parsing

This requires writing (or using open source) parsers for a particular feed and then mapping it to a rich schema in BigQuery. This will require a lot of up-front development work but it results in full, rich SQL queries than can be run by the analysts.

This solution would probably lead to a different table per event stream type as they all represent different data. If you were feeling particularly ambitious, you could try to unify all this data into one table. However, the history of attempting to unify all types of security events into one schema has been a sorry tale. Rafael Marty, who has spent his career trying to do exactly this says:
"For over a decade, we have been trying to standardize log formats. And we are still struggling. I initially wrote the Common Event Format (CEF) at ArcSight. Then I went to Mitre and tried to get the common event expression (CEE) work off the ground to define a vendor neutral standard. Unfortunately, getting agreement between Microsoft, RedHat, Cisco, and all the log management vendors wasn’t easy and we lost the air force funding for the project. In the meantime I went to work for Splunk and started the common information model (CIM). Then came Apache Spot, which has defined yet another standard (yes, I had my fingers in that one too). So the reality is, we have 4 pseudo standards, and none is really what I want. I just redid some major parts over here at Sophos (I hope I can release that at some point)."
So, only the brave, foolhardy or well-funded need apply.

Schema evolution

Note that it seems that the scope for schema evolution is limited in BigQuery which "natively supports [only] the following schema modifications: Adding columns to a schema definition; Relaxing a column's mode from REQUIRED to NULLABLE" (from the docs).

This means that having a rich schema (as in Complete Parsing) that is easy to interrogate is unfortunately very brittle. You must have your schema pretty much nailed down on your first attempt or else you may very well be forced to introduce hacks further down the line as you shoe-horn changes into it.

Which approach is best?

"Enforcement of schemas in database is a contentious topic, and in general there's no right or wrong answer." [1]

Conclusion

I'm firmly in the camp that says the architecture document should come at the end of an MVP not the beginning. It must adapt to the hidden realities that become apparent along the way. I'll revisit this post in the future to see what actually happened in our case.

We started with a military quote and we'll end with one, this time from Eisenhower: "In preparing for battle I have always found that plans are useless, but planning is indispensable."

[1] Designing Data Intensive Applications, Klepperman

No comments:

Post a Comment