Friday, March 29, 2019

Hands on tuning a neural net


Introduction

Once again, I am trying to find what a typical domain name is using machine learning. For this, I use one million domains that Cisco helpfully have made available here. Once I have this, I can then use known malicious domains (provided by Bambenek Consulting here) to see how they deviate.

The data

I encode my domains first by creating a histogram of character pairs and turning this into probabilities. Now, each data point is a sparse vector of length 1444 (38 x 38 legitimate characters used in domain names).

Now, I fire up a Variational Autoencoder written using the DL4J framework.

Rig


I have an old GeForce GTX 650 GPU that has only 1gb of memory. "To be blunt, 1GB is really tiny by today's standards, it's not really enough for deep learning especially if you're running a display on it at the same time (OS will use some of the memory)" Alex Black, DL4J Gitter room (Jan 29 00:46 2019)

So, I ran the neural net on the CPU.

I run my code with:

export D=/media/sdb8/Data/CiscoUmbrella/MyData/Real/Results/`date +%d%m%y%H%M` ; mkdir -p $D ; export MAVEN_OPTS="-Xmx32g" ;  taskset 0xFFFCE mvn clean install -DskipTests exec:java -Dexec.mainClass=XXX  -Dexec.args="YYY" | tee $D/mvn.log ; shutter --window=.*firefox.* -o $D/overview.png -e

Note taskset which uses all cores but one (15 out of my 16) so the OS is not overwhelmed.

Also, note shutter which takes a picture of the DL4J console that I am running in Firefox.

Learning Rate

On the "Update: Parameter Ratio Chart" tab, the mean magnitude means "the average of the absolute value of the parameters or updates at the current time step". From the DeepLearning4J visualisation guide:
The most important use of this ratio is in selecting a learning rate. As a rule of thumb: this ratio should be around 1:1000 = 0.001. On the log10 chart, this corresponds to a value of -3. Note that is a rough guide only, and may not be appropriate for all networks. It’s often a good starting point, however. 
If the ratio diverges significantly from this (for example, > -2 (i.e., 10-2=0.01) or < -4 (i.e., 10-4=0.0001), your parameters may be too unstable to learn useful features, or may change too slowly to learn useful features. To change this ratio, adjust your learning rate (or sometimes, parameter initialization). 
With a learning rate of 1e-2 using activation function TANH, this chart was very jagged for me. Changing the learning rate (with Adam) to 1e-4 smoothed it with:
Learning rate 1e-4, activation functions TANH
The learning rate seems to be quite sensitive. With a value of 1e-3, performance is already degrading. Note the Parameter Ratios are touching -2 and above and becoming volatile:
TANH; learning rate 1e-3
The same is true of 1e-5 where this time the parameters are drifting below -4:

TANH; learning rate 1e-5
So, it seems at least for TANH that a learning rate of 1e-4 is the sweet spot.

But with LEAKYRELU, Parameter Ratios started touching -4 and then we started seeing NaNs again. Similarly, for SOFTPLUS, even a learning rate of 1e-5 wasn't enough to save us from NaNs:
SOFTPLUS; learning rate = 1e-5
This came as something of a surprise as LEAKYRELU is often said to be the default activation function.

What's more, my score sucked:

VAE's score with vectors of character bi-gram probabilities
We'll return to the choice of activation function later. In the meantime...

The Data (again)

Maybe representing my data as vector probabilities was not a great idea. So, let's try 1-hot encoding the characters:

VAE's score with 1-hot encoded characters
Ah, that's better (if far from optimal). It's also slower to train as the vectors are just as sparse but now have a size of about 4000.

Reconstruction Distribution

But wait - if the data has changed in its nature, maybe we need to re-asses the neural net. Well, the reconstruction distribution was a Gaussian which might have made sense when we were dealing with probability vectors. However, we're now 1-hot encoding so let's change it to a Bernoulli since "The Bernoulli distribution is binary, so it assumes that observations may only have two possible outcomes." (StackOverflow). Well, that sounds reasonable so let' see:

VAE score with 1-hot encoding, now using Bernoulli reconstruction distribution.
Negative scores? Time to ask the DL4J Gitter group:
PhillHenry @PhillHenry Mar 22 16:25Silly question but I assume a negative "Model Score vs. Iteration" in the "DL4J Training UI" is a bad sign...?
Alex Black @AlexDBlack 00:32@PhillHenry as a general rule, yes, negative score is badit is typically caused by a mismatch between the data, loss function and output layer activation functionfor example, something like using tanh activation function for classification instead of softmax
PhillHenry @PhillHenry 03:02Are there any circumstances where it's OK? (BTW I'm trying to run a VAE).
Samuel Audet @saudet 03:34IIRC, this happens with VAE when it's not able to fit the distribution well
Alex Black @AlexDBlack 04:14VAE is an exception, to a degreeyou can still have a mismatch between the reconstruction distribution and the actual data, and that will give you rubbish negative results (like incorrectly trying to fit gaussian data with bernoulli reconstruction distribution)however, it is possible (but rare) for negative log likelihood to be negative on very good fit of the data - IIRC we exclude some normalization terms. if that's the case, you should see excellent reconstructions though
Well, I've already switched to Bernoulli so what's going wrong?

The Data (yet again)


Scratching my head over a coffee, it struck me. I was still standardising the data as if it were Gaussian! Not sure how a unit test could have saved me here but OK, let's remove the standardisation since why would you standardise a Bernoulli distribution?
VAE score with LEAKYRELU
Wow, that looks almost normal. 

The Activation Function

Since we've changed an awful lot, now might be a good time to try to ditch TANH and try LEAKYREUL again. Sure enough, we don't see any NaNs and the score looks good.

Let's see what the ROC curve looks like with my test data:
ROC curve for VAE with LEAKYRELU, 1-hot encoding, Bernoulli reconstruction
OK, not amazing but it's starting to look better than a monkey score. But how do we make things better?

Architecture


Let's start with the architecture of the net itself.

"Larger networks will always work better than smaller networks, but their higher model capacity must be appropriately addressed with stronger regularization (such as higher weight decay), or they might overfit... The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting." (Andrej Karpathy)

"How many hidden layers are required to see an improvement where a shallow net underperforms is anyone's guess, but in general more depth would be better - you get more abstract, more general solutions. In practice though, optimization isn't quite so neat, and adding capacity adds risks of the process falling into various pitfalls - local minima, overfitting... And of course then there's the added computational cost." [StackOverflow]

So, let's try an architecture with layers of size [3002, 1200, 480, 192, 76, 30]:

VAE with 5 hidden layers


Here (StackOverflow) is an example of where the latent space is just too big causing overfitting. "You could compress the output further... Autoencoders do exactly that, except they get to pick the features themselves...  So what you do in your model is that you're describing your input image using over sixty-five thousand features. And in a variational autoencoder, each feature is actually a sliding scale between two distinct versions of a feature, e.g. male/female for faces...  Can you think of just a hundred ways to describe the differences between two realistic pictures in a meaningful way? Possible, I suppose, but they'll get increasingly forced as you try to go on.  With so much room to spare, the optimizer can comfortably encode each distinct training image's features in a non-overlapping slice of the latent space rather than learning the features of the training data taken globally."

The takeaway point is not to make the latent space layer too big. We deliberately want to squeeze the data through a small gap. So, let's really squeeze this data with an architecture like [2964, 1185, 474, 189, 75, 2]

VAE with 5 hidden layers and the last layer has 2 units
Hmm, no change and the same happened with final units of 5, 10 and 20. So, this looks like a dead-end.

Also, those weights are going crazy...

Weights

"For NNs though, what you want out of your deep layers is non-linearity. That's where the magnitudes of weights start to matter.

"In most activation functions, very small weights get more or less ignored or are treated as 'evidence against activation' so to speak.

"On top of that some activations, like sigmoid and tanh, get saturated. A large enough weight will effectively fix the output of the neuron to either the maximum or the minimum value of the activation function, so the higher the relative weights, the less room is left for subtlety in deciding whether to pass on the activation or not based on the inputs." [StackOverflow]

Regularization

"A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to training data. In practice, we prefer to choose the simpler models to solve a problem (e.g. Occam’s razor). We prefer models with smaller weights... The L2 approach is perhaps the most used and is traditionally referred to as “weight decay” in the field of neural networks." (MachineLearningMastery)

However, massively increasing the L2 score made the score worse and no amount of tuning seemed to change that. I was hoping to clip excessive activations but adding RenormalizeL2PerParamType didn't seem to make any difference either.

Well, since adding more layers can be responsible for overfitting, let's now take them away. With merely two hidden layers of size [1003, 401] the weights look more respectable.
VAE with 2 hidden units
The Data (last time)

Finally, I went back to the domain names yet again and stripped them not just of the Top Level Domains but also any preceding text. So, for instance, if the domain name is javaagile.blogspot.com it is mapped to simply blogspot.

Gratifyingly, the the score went from lingering at about 100 down to 40 and the ROC curve now looks decent:


Things learned

- Unit test your data munging code.

- Keep your logs. Print out your hyper-parameters so you capture this in the logs.

- Shuffle your data! I wasted a day testing the neural net on a subset of data but this subset was sorted! Therefore, only the top 10000 data points were being used. You might find the Linux command, shuf, useful for doing this.

- Write tests that test your results, making sure they're sensible

Goldilocks

Getting the best out of the neural net seems to be a combination of tuning:
  • The data that best represents your problem. This is perhaps the most important.
  • The hyper-parameters.
  • The architecture.
As an aside, take a look at TensorFlow playground. It's a great way to get an intuitive feel for what is going on. It's just a neural net that runs in your browser.

No comments:

Post a Comment