Friday, December 27, 2019

CNNs and network scanning with TensorFlow


Introduction

Port scanning leaves tell-tale signs in flow logs that are easy to spot when we visualize them (here is an example). Wouldn't it be good if we could teach a neural net to check our logs for such activity?

To this end, I've written some Python code using Tensorflow (here) that fakes some port data. Some of the fakes are clearly the result of a port scan, others are just random connections. Can a convolutional neural network (CNN) spot the difference?

Fake data look like this:

Graphical representation of network connection activity
Notice that port scanning is represented by having a straight line where vertical lines might represent scanning all hosts on a single box or horizontal lines might represent scanning a lot of boxes on just one port.

Realistically, nefarious actors only scan a subset of ports (typically those below 1024). I'll make the fake data more sophisticated later.

A brief introduction to CNNs

"The big idea behind convolutional neural networks is that a local understanding of an image is good enough. As a consequence, the practical benefit is that fewer parameters greatly improve the time it takes to learn as well as lessens the amount of data required to train the model." [2]

"Pooling layers are commonly inserted between successive convolutional layers. We want to follow convolutional layers with pooling layers to progressively reduce the spatial size (width and height) of the data representation. Pooling layers reduce the data representation progressively over the network and help control overfitting. The pooling layer operates independently on every depth slice of the output." [1]

"The pooling layer uses the max() operation to resize the input data spatially (width, height). This operation is referred to as max pooling. With a 2x2 filter size, the max() operation is taking the largest of four numbers in teh filter area." [1]

Code

The code I stole from Machine Learning with Tensorflow but while it reshapes its data to 24x24 images, my images are of a different dimension. And when I took the code in the book, I got errors. Apparently, this line (p184) was causing me problems:

Where was it getting 6*6*64 from? The 64 is easy to explain (it's an arbitrary number of convolutions we use in the previous layer) but the 6x6...?

When using MLwTF's code and my data, Tensorflow was complaining that logits_size and labels_size were not the same. What does this mean?
Logits is an overloaded term which can mean many different things. In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf)) ... Probability of 0.5 corresponds to a logit of 0. Negative logit correspond to probabilities less than 0.5, positive to > 0.5. 
In ML, it can be the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
(from StackOverflow).

After a lot of trial and error, I figured the 6*6 was coming from max pooling. This is where the algorithm"sweeps a window across an image and picks the pixel with the maximum value" [MLwTF]. The ksize in the code is 2 and this is the third layer so we have max-pooled the original matrix twice already. So, the 6 comes from our original size (24 pixels) twice max-pooled by a size of 2 giving 24/(2*2) = 6.

I noticed that the stride length also plays a part in the calculation of the size of the fully connected layer ("for example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image" MLwTF). In the example in MLwTF, the size is 1 so it makes no difference in this particular case but in general, we also need to divide our image size on each layer by this value.

So, the calculation of the size of the fully connected layer looks like (in Python):

        for _ in range(depth):
            i = math.ceil(i / self.stride_size)
            i = math.ceil(i / self.ksize)
            j = math.ceil(j / self.stride_size)
            j = math.ceil(j / self.ksize)
        return i * j

where i and j are originally set to be the size of the input data and depth is the layer for which we want to calculate the size.

[1] Deep Learning - a practitioners guide.
[2]  Machine Learning with TensorFlow

No comments:

Post a Comment