Introduction
Port scanning leaves tell-tale signs in flow logs that are easy to spot when we visualize them (here is an example). Wouldn't it be good if we could teach a neural net to check our logs for such activity?
To this end, I've written some Python code using Tensorflow (here) that fakes some port data. Some of the fakes are clearly the result of a port scan, others are just random connections. Can a convolutional neural network (CNN) spot the difference?
Fake data look like this:
Graphical representation of network connection activity |
Realistically, nefarious actors only scan a subset of ports (typically those below 1024). I'll make the fake data more sophisticated later.
A brief introduction to CNNs
"The big idea behind convolutional neural networks is that a local understanding of an image is good enough. As a consequence, the practical benefit is that fewer parameters greatly improve the time it takes to learn as well as lessens the amount of data required to train the model." [2]
"Pooling layers are commonly inserted between successive convolutional layers. We want to follow convolutional layers with pooling layers to progressively reduce the spatial size (width and height) of the data representation. Pooling layers reduce the data representation progressively over the network and help control overfitting. The pooling layer operates independently on every depth slice of the output." [1]
"The pooling layer uses the max() operation to resize the input data spatially (width, height). This operation is referred to as max pooling. With a 2x2 filter size, the max() operation is taking the largest of four numbers in teh filter area." [1]
Code
The code I stole from Machine Learning with Tensorflow but while it reshapes its data to 24x24 images, my images are of a different dimension. And when I took the code in the book, I got errors. Apparently, this line (p184) was causing me problems:
Where was it getting 6*6*64 from? The 64 is easy to explain (it's an arbitrary number of convolutions we use in the previous layer) but the 6x6...?
When using MLwTF's code and my data, Tensorflow was complaining that logits_size and labels_size were not the same. What does this mean?
Logits is an overloaded term which can mean many different things. In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf)) ... Probability of 0.5 corresponds to a logit of 0. Negative logit correspond to probabilities less than 0.5, positive to > 0.5.
In ML, it can be the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
(from StackOverflow).
I noticed that the stride length also plays a part in the calculation of the size of the fully connected layer ("for example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image" MLwTF). In the example in MLwTF, the size is 1 so it makes no difference in this particular case but in general, we also need to divide our image size on each layer by this value.
So, the calculation of the size of the fully connected layer looks like (in Python):
for _ in range(depth):
i = math.ceil(i / self.stride_size)
i = math.ceil(i / self.ksize)
j = math.ceil(j / self.stride_size)
j = math.ceil(j / self.ksize)
return i * j
where i and j are originally set to be the size of the input data and depth is the layer for which we want to calculate the size.
[1] Deep Learning - a practitioners guide.
[2] Machine Learning with TensorFlow