Saturday, December 22, 2018

Faux data and DeepLearning4J


Like most people in corporate data science, I'm having a hard time getting my hands on real but commercially sensitive data within the oganisation. To this end, I'm writing a library to generate generic data. Sure, ersatz data is rarely decent but I figured if my models can't find the signal deliberately planted in the data, they're unlikely to find a signal in the real data.

I'm using the DeepLearning4J framework to see if the data is sensible. DL4J looks like it could be a real game-changer in the near future as it integrates easily with Spark (I ran their example with no major problems but noted it took quite a long time) and can import Keras models.

The first thing I did was try to build an Artificial Neural Net to detect unusual behaviour. One of the challenges I have in the day job is to distinguish between a network intrusion and genuine behaviour. Machines being accessed at night are suspicious but not necessarily compromised (some insomniac wants to do some work, for example). So, could a LSTM neural net (specially designed for time series) distinguish between night time and day time activity?

I stole the code for the ANN from here but came across this bug in the DL4J library, wasting a lot of time. Unfortunately, there has been no release with this bug fix so I am forced to use SNAPSHOTs. But even then, I was unable to detect the signal in the noise and kept seeing "Warning: 1 class was never predicted by the model and was excluded from average precision". Since I only have two classes (hacked/not hacked) this wasn't a great sign. Since the DL4J guys say "at this point you are just looking at a plain old tuning problem", I spent a lot of fruitless time twiddling knobs.

The problem was more profound than this. It appears I was asking an LSTM the wrong question. That is, although the data was a time series, each point in time was entirely unrelated to all the others. Changing the data to being either (1) spread randomly across a year and (2) tightly clustered around date-time gave the neural net a fighting chance of distinguishing the two. In the latter case, the data points bunch around a single date-time (imagine somebody fat-finguring their password a few times per year versus somebody trying to guess it in a small time window). In this case, each timestamp is highly dependent on the others.

No comments:

Post a Comment