Agile Java Man: DeepLearning4J

Showing posts with label DeepLearning4J. Show all posts

Tuesday, February 26, 2019

Hands-on with Variational Autoencoders

I've hidden 25 samples in 10 000 that are different to the rest then used a Variational Auto-encoder (VAE) to find them. Here is my account of trying to find the anomalies.

Data

Each sample has 50 points in time (represented by a Long) that may be either bunched together in a few hours or scattered randomly across a calendar year.

By far the single biggest improvement came from normalising this data. Without normalising, the neural net was pretty useless.

So, before normalization, a single sample looks like:

1371877007, 1386677403, 1371954061, 1361641428, 1366151894, 1366819029, 1380334620, 1379574699, 1359865022, 1377141715, 1370407230, 1358989583, 1373813009, 1364038087, 1361247093, 1367920808, 1379825490, 1379755109, 1363559641, 1373945939, ...

and after normalization, it may look something like this:

0.2737, -0.0451, -0.6842, 1.6797, -1.3887, -0.0844, -0.6952, 0.9683, 0.7747, 1.6273, -1.0817, -0.0380, 1.3321, 0.2864, 0.9135, -1.3018, 1.0786, 0.0830, -0.3311, -1.6751, 1.6270, 1.4007, 0.8983, ...

Note that the normalized data is (roughly) zero-centred and (very roughly) in the region of -1 to 1. See below for why this is relevant.

Aside: it's really, really important for the data to be reproducible through runs. That is, although the data is random, it must be reproducibly random. I wasted a lot of time being fooled by randomness in the data.

What are VAEs?

"It is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a generative model." (Keras Blog)

With vanilla encoders, "If the [latent space] has discontinuities (eg. gaps between clusters) and you sample/generate a variation from there, the decoder will simply generate an unrealistic output, because the decoder has no idea how to deal with that region of the latent space.

"Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation." (TowardsDataScience)

Tuning

According to Andrej Karpathy:

"The most common hyperparameters in context of Neural Networks include:

the initial learning rate
learning rate decay schedule (such as the decay constant)
regularization strength (L2 penalty, dropout strength)"

But first, let's look at:

Activation Functions

"If you know the outputs have certain bounds, it makes sense to use an activation function that constrains you to those bounds." (StackOverflow)

Given our data, one might think that HARDSIGMOID, SIGMOID, SWISH etc or even TANH would yield the best results (SWISH is just x*sigmoid(x)). Wheras RELU, ELU etc don't model it at all.

From Shruti Jadon on Medium.com

(Graphic from here - EDIT see a more comprehensive list here).

But there is an interesting opinion at TowardsDataScience:

"The question was which one is better to use?

"Answer to this question is that nowadays we should use ReLu which should only be applied to the hidden layers. And if your model suffers form dead neurons during training we should use leaky ReLu or Maxout function.

"It’s just that Sigmoid and Tanh should not be used nowadays due to the vanishing Gradient Problem which causes a lots of problems to train,degrades the accuracy and performance of a deep Neural Network Model."

However, I found no difference in accuracy with my VAE using ELU, LEAKYRELU nor RELU. In fact, playing with the 21 activation functions that came with DL4J, I did not see any variety when applying them to the hidden layers.

I only saw a big difference when using it at the bottleneck layer and in the reconstruction distribution (see below).

Regularization

Setting the L2 regularization parameter gave me the following results

L2	Mean	Accuracy(%)	Standard deviation
10^-5	15.2	60.8	0.837
10^-4	15.2	60.8	0.837
10^-3	15.2	60.8	0.837
10^-2	15.2	60.8	0.837
10^-1	16	64	0.707
10⁰	16.2	64.8	0.447
10¹	16	64	0
10²	16	64	0

All using the SWISH activation function.

Batch Sizes

Accuracy hovered around 16 or 17 up to and including a batch size of 64. After that, it dropped off quickly to an accuracy of 13 (53%) and 6 (24%) for batch sizes of 128 and 256.

Updater

Adam with an initial value of 10^-4 seemed to give better accuracy at 17.8 / 71.2% (sd. 0.422) than RmsProp(10^-3) and AdaDelta (which both yielded an accuracy of 16 (64%), standard deviation of 0).

Reconstruction Distribution

Now fiddling with this knob did make quite a difference.

All the results so far were using a BernoulliReconstructionDistribution with a SIGMOID. This was because I had cribbed the code from somewhere else where the Bernoulli distribution was more appropriate as it represents "binary or 0 to 1 data only".

My data was not best approximated by a Bernoulli but a Gaussian. So, using a GaussianReconstructionDistribution with a TANH gave better results.

The DL4J JavaDocs state: "For activation functions, identity and perhaps tanh are typical - though tanh (unlike identity) implies a minimum/maximum possible value for mean and log variance. Asymmetric activation functions such as sigmoid or relu should be avoided". However, I didn't find SIGMOID or RELU made much difference to my data/ANN combination (although using CUBE led to zero anomalies being found).

This is similar to what I blogged last year that when modelling the features: features should (in a very loose sense) model your output.

Anyway, using a Gaussian reconstruction distribution, accuracy jumped to 18.6 (74.4%) albeit with a large standard deviation of 3.438.

Then, through brute force, I discovered that using SOFTPLUS in both pzxActivationFunction and GaussianReconstructionDistribution gave me an average accuracy 19.1 (sd. 3.542). This was the high-water marker of my investigation.

Architecture

All the results so far were using just a single encoder and a single decoder layer that was half the size of the input vector. Let's call this value x.

Using hidden layers of size [x, x, 2, 2, x, x] did not change the best accuracy. Neither did [x, x/2, 2, 2, x/2, x] nor [x, x/2, x/4, 2, 2, x/4, x/2, x] nor even [x, x/2, x/4, 1, 1, x/4, x/2, x].

So, this avenue proved fruitless.

Conclusion

I am still something of a neophyte to neural nets but although I can improve the accuracy it still seems more like guesswork than following a process. There was no a priori way I know of that would have indicated that SOFTPLUS was the best activation function to use in the reconstruction, for instance.

It's clear that there are some rules-of-thumb but I wish somebody would publish a full list. Even then, it seems very data-dependent. "For most data sets only a few of the hyper-parameters really matter, but [...] different hyper-parameters are important on different data sets" (Random Search for Hyper-Parameter Optimization).

Monday, December 24, 2018

Unbalanced

... or imbalanced (apparently, imbalanced is the noun and unbalanced is the verb. Live and learn).

Anyway, in a data scientist job interview I was asked how I would build a network intrusion system using machine learning. I replied that I could build them a system that was 99% accurate. Gasps from my interviewers. I said, I could even give 99.9% accuracy. No! I could give 99.99% accuracy with ease.

I then admitted I could do this by saying that all traffic was fine (since much less than 0.01% was malicious), bill them and then head home. The issue here is imbalanced data.

This is a great summary of what to do in this situation. To recap:

1. If you have plenty of data, ignore a lot from the major class. If you have too little, duplicate the minor class.

2. Generate new data by "wiggling" the original (eg, the SMOTE algorithm).

3. "Decision trees often perform well on imbalanced datasets" [ibid].

4. Penalize mistakes in misclassifying the minor data.

For this last suggestion, you can use a meta algorithm and any model but in ANNs, you get it out of the box. In DL4J, I added something like this:

.lossFunction(new LossNegativeLogLikelihood(Nd4j.create(Array(0.01f, 1f)))

"Ideally, this drives the network away from the poor local optima of always predicting the most commonly ccuring class(es)." [1]

Did it improve my model? Well, first we have to abandon accuracy as our only metric. We introduce two new chaps, precision and recall.

Precision is the ratio of true positives to all the samples my model declared were positive irrespective of being true or false ("what fraction of alerts were correct?"). Mnemonic: PreCision is the sum of the Positive Column. Precision = TP / (TP + FP).

Recall is the ratio of true positives to all the positives in the data ("what fraction of intrusions did we correctly detect?"). Mnemonic: REcall deals with all the REal positives. Recall = TP / (TP + FN).

(There is also the F1 score which "is the 2*precision*recall/(precision+recall). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall" - Dr Jason Brownlee. This equation comes from the harmonic mean of precision and recall).

What is a desirable results is very use case dependent. But with weights 0.005f, 1f, I got:

Accuracy: 0.9760666666666666
Precision: 0.3192534381139489
Recall: 0.9285714285714286
.
.

=========================Confusion Matrix=========================
0 1
-------------
28957 693 | 0 = 0
25 325 | 1 = 1

Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times
==================================================================

using faux data (6000 observations of a time series 50 data points long and a class imbalance of 100:1)

I got the job to build an intrusion detection system and these results would make the customer very happy - although I can't stress enough that this is a neural net trained on totally fake data. But if we could have 93% of intrusions detected with in only 1 in three investigations yielding a result, the client would be overjoyed. Currently, the false positives are causing them to turn the squelch in their current system too high and consequently they're not spotting real intrusions.

[1] Deep Learning: A Practitioner's Approach

Saturday, December 22, 2018

Faux data and DeepLearning4J

Like most people in corporate data science, I'm having a hard time getting my hands on real but commercially sensitive data within the oganisation. To this end, I'm writing a library to generate generic data. Sure, ersatz data is rarely decent but I figured if my models can't find the signal deliberately planted in the data, they're unlikely to find a signal in the real data.

I'm using the DeepLearning4J framework to see if the data is sensible. DL4J looks like it could be a real game-changer in the near future as it integrates easily with Spark (I ran their example with no major problems but noted it took quite a long time) and can import Keras models.

The first thing I did was try to build an Artificial Neural Net to detect unusual behaviour. One of the challenges I have in the day job is to distinguish between a network intrusion and genuine behaviour. Machines being accessed at night are suspicious but not necessarily compromised (some insomniac wants to do some work, for example). So, could a LSTM neural net (specially designed for time series) distinguish between night time and day time activity?

I stole the code for the ANN from here but came across this bug in the DL4J library, wasting a lot of time. Unfortunately, there has been no release with this bug fix so I am forced to use SNAPSHOTs. But even then, I was unable to detect the signal in the noise and kept seeing "Warning: 1 class was never predicted by the model and was excluded from average precision". Since I only have two classes (hacked/not hacked) this wasn't a great sign. Since the DL4J guys say "at this point you are just looking at a plain old tuning problem", I spent a lot of fruitless time twiddling knobs.

The problem was more profound than this. It appears I was asking an LSTM the wrong question. That is, although the data was a time series, each point in time was entirely unrelated to all the others. Changing the data to being either (1) spread randomly across a year and (2) tightly clustered around date-time gave the neural net a fighting chance of distinguishing the two. In the latter case, the data points bunch around a single date-time (imagine somebody fat-finguring their password a few times per year versus somebody trying to guess it in a small time window). In this case, each timestamp is highly dependent on the others.

Monday, March 5, 2018

DeepLearning4J in a Spark cluster

Running DL4J on Spark was not too hard but there were some none obvious gotchas.

Firstly, my tests ran on a Windows machine but I needed to add dependencies to get it to run on our Linux cluster where I was getting "no openblas in java.library.path" errors. This link helped me and now my dependencies looked like this:

<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark_${scala.compat.version}</artifactId>
<version>${deeplearning4j.version}_spark_2</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>${deeplearning4j.version}</version>
</dependency>
<dependency> 
<groupId>org.nd4j</groupId>
<artifactId>nd4j-kryo_${scala.compat.version}</artifactId>
<version>${nd4j.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>${nd4j.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native</artifactId>
<version>${nd4j.version}</version>
</dependency>

with

<deeplearning4j.version>0.9.1</deeplearning4j.version>
<nd4j.version>0.9.1</nd4j.version>

Second, you need to add --conf "spark.kryo.registrator=org.nd4j.Nd4jRegistrator" to the CLI when you start a Spark shell.

Thirdly, I randomly grabbed some Recurrent Neural Network from here just to test my code. I found that it was immensely memory-hungry. I needed to give my driver 20gb, my executors 30gb and only 1 core per executor to avoid occasional errors in the Spark stages.

Even then, I couldn't measure the accuracy of my neural net because of this issue. Apparently, it's fixed in the SNAPSHOT but then there are issues with the platform JARs not being up to date. I asked about this on the DeepLearning4J gitter channel archived here. (The team also helpfully told me to use org.deeplearning4j.eval.Evaluation with an argument to avoid the bug).

Finally, I started getting results but not before one last gotcha and this time in Spark: use RDD.sample to get your hands on data with which to test rather than take as you want a nice distribution over all categories. With this, I started getting more sensible answers when evaluation my results.

Tuesday, February 13, 2018

Neural Network Notes

Some miscellaneous neural network terms. I hope to expand on all of them later.

Deep Learning

"Taking complex raw data and creating higher-order features automatically in order to make a simpler classification (or regression) output is a hallmark of deep learning... The best way to take advantage of this power is to match the input data to the appropriate deep network architecture." [1]

Architecture of ANNs

"We can use an arbitrary number of neurons to define a layer and there are no rules about how big or small this number can be. However, how complex of a problem we can model is directly correlated to how many neurons are in the hidden layers of our networks. This might push you to begin with a large number of neurons from the start but these neurons come with a cost... There are also cases in which a larger model will sometimes converge easier because it will simply 'memorize' the training data." [1]

"Hidden layers are concerned with extracting progressively higher-order features from the raw data."

"A more continuous distribution of input data is generally best modelled with a ReLU activation function. Optionally, we'd suggest using the tanh activation function (if the network isn't very deep) in the event that the ReLU did not achieve good results (with the caveat that there could be other hyperparameter-related issues with the network." [1]

"Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all…One hidden layer is sufficient for the large majority of problems." (StackOverflow)

In the output layer for binary classification "we'd use a sigmoid output layer with a single neuron to give us a real value in the range of 0.0 to 1.0". [1]

"the best way to build a neural network model: Cause it to overfit and then regularize it to death... Regularization works by adding an extra term to the normal gradient computed."

The right model

"If we have a mutliclass modeling problem yet we only care about the best score across these classes, we'd use a softmax output layer with an arg-max() function to get the highest score of all the classes. The softmax output layer gives us a probability distribution over all the classes" [1]

"If we want to get multiple classifications per output (eg person + car), we do not want softmax as an output layer. Instead, we'd use the sigmoid output layer with n number of neurons, giving us a probability distribution (0.0 to 1.0) for every class independently". [1]

"In certain architectures of deep networks, reconstruction loss functions help the network extract features more effectively when paired with the appropriate activation function. An example of this would be using the multiclass cross-entropy as a loss function in a layer with a softmax activation function for classification output." [1] However, note that cross-entropy is not symmetric so might wrongly favour certain values over others even if they're equally wrong [see this StackOverflow answer].

Optimizations

"We define a hyperparameter as any configuration setting that is free to be chosen by the user that might affect [sic] performance." [1] For example layer size, activation functions, loss functions, epochs etc.

"And so on, until we've exhausted the training inputs, which is said to complete an epoch of training."
http://neuralnetworksanddeeplearning.com/chap1.html

"First-order optimization algorithms calculate the Jacobian matrix... Second-order algorithms calculate the derivative of the Jacobian (ie, the derivative of a matrix of derivatives) by approximating the Hessian." [1]

"A major difference in first- and second-order methods is that second-order methods converge in fewer steps yet take more computation per step". [1]

"The 'vanilla' version of SGD uses gradient directly, and this can be problematic because gradient can be nearly zero fir any parameter. This causes SGD to take tiny steps in some cases, and steps that are too big for situations in which the gradient is too large. To alleviate these issues, we can use the technique such as:

"AdaGrad is monotonically decreasing and never increases the learning rate" [1]

Autoencoders

"An autoencoder is trained to reproduce its own input data."[1]

"The key difference to note between a multilayer perceptron network diagram (from earlier chapters) and an autoencoder diagram is the output layer in an autoencoder has the same number of units as the input layer does." [1]

"Autoencoders are good at powering anomaly detection systems."

Mini-batches

If the cost function is C, then "stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C by computing ∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C, and this helps speed up gradient descent, and thus learning.

This "works by randomly picking out a small number mm of randomly chosen training inputs. We'll label those random training inputs X1,X2,…,Xm and refer to them as a mini-batch. Provided the sample size m is large enough we expect that the average value of the ∇CXj will be roughly equal to the average over all ∇Cx" (from neuralnetworksanddeeplearning.com).

Boltzmann machines

Defined as "a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off". [1]

"The main difference between RBMs and the more general class of autoencoders is in how they calculate the gradients." [1]

Regularization

"Dropout and DropConnect mute parts of the input to each layer such that the neural network learns other portions".

ReLU

ReLU - rectified linear units ReLU "are the current state of the art because they have proven to work in many situations. Because the gradient of a ReLU are either zero or constant, it is possible to reign in the vanishing exploding gradient issue. ReLU activation functions have shown [sic] to train better in practice than sigmoid functions". [1]

Leaky ReLUs

"Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so)" (from here).

[1] Deep Learning - a practitioners guide.

Tuesday, September 12, 2017

Neural Networks are just Matrices

DeepLearning4J brings neural nets to Java programmers. They suggest in the getting started section to run the XorExample. This is a neural net that, given XOR inputs and outputs, learns its logic. This is non-trivial for a simple neural net (see here) as the true and false values are not linearly separable in a single XOR matrix. DL4J provides a way of making more complicated neural nets but hides a lot of detail.

Matrices

The network in XorExample "consists in 2 input-neurons, 1 hidden-layer with 4 hidden-neurons, and 2 output-neurons... the first fires for false, the second fires for true".

But instead of talking about neurons, it's much easier to think of this neural net as matrices (at least if you're familiar with simple linear algebra).

So, imagine this neural net of just:

A 4 x 2 matrix of the inputs ("features").
A hidden layer that is a 2 x 4 matrix
A layer that is a 4 x 2 matrix that yields the output.

We need to apply functions to these matrices before we multiply, but that's essentially it.

Multi-layer Neural Nets in a few lines of Python

To do this, I'm not going to use TensorFlow or other frameworks dedicated to neural nets. I'll just use Numpy as basically syntactic sugar around manipulating arrays of arrays.

Also, I'll present a more intuitive approach to the maths. A more thorough analysis can be found on Brian Dohansky's blog here.

Finally, I tried to do this in Scala using the Breeze linear algebra library but it was just so much easier to do it in Python and Numpy as it ran in a fraction of the time it took Scala to even compile.

The code

As mentioned, we need Numpy.

import numpy as np

Now, let's take the input to a XOR operation (our first matrix):

features = np.matrix('0.0, 0.0;'
'1.0, 0.0;'
'0.0, 1.0;'
'1.0, 1.0')

and the expected output:

labels = np.matrix('1.0, 0.0;'

'0.0, 1.0;'

'1.0, 0.0')

(Remember that that this is not trying to be a truth table. Instead, the first column indicates that the output is true if it's 1 and and the second column indicates it's false if it's 1).

We need those other 2 matrices. It doesn't matter what their values are initially as we'll correct them whatever they are. So, let's chose some random values but in a well-defined shape:

weightsLayer1 = np.random.rand(2, 4)

weightsLayer2 = np.random.rand(4, 2)

We also need the gradient of the weights and biases. We could have subsumed the biases into our matrices - that's mathematically equivalent - but the DeepLearning4J example doesn't do this so we won't either. The weights and biases for the first and second layers respectively are:

_0_W = np.zeros([2, 4])

_0_b = np.zeros([1, 4])

_1_W = np.zeros([4, 2])

_1_b = np.zeros([1, 2])

Finally, we need a step value and a batch size:

learning_rate = 0.1

mini_batch_size = np.shape(features)[0]

Now we can do our first multiplication (or forward propagation):

s0 = features * weightsLayer1

s0 += _0_b

But we need to apply a function to this (an activation function). In this example, we'll use a sigmoid function. It has a simple derivative that we'll need later.

def f_sigmoid(X):
return 1 / (1 + np.exp(-X))

So, now we can apply the sigmoid activation function to the first layer:

sigmoided = f_sigmoid(s0)

This we'll feed into the next layer with another matrix multiplication:

s1 = sigmoided * weightsLayer2

s1 += _1_b

No we need our second activation function to apply to this matrix. It's the softmax function that basically normalizes to 1 all the values in each row allowing us to treat each row as a probability distribution. It looks like this (as stolen from Brian):

def f_softmax(X):
Z = np.sum(np.exp(X), axis=1)
Z = Z.reshape(Z.shape[0], 1)
return np.exp(X) / Z

We apply this to the output from the previous layer and find the delta with what the output should be:

softmaxed = f_softmax(s1)
delta = softmaxed - labels

We calculate the delta weighted according to the weights of this layer, Transposing appropriately:

epsilonNext = (weightsLayer2 * delta.T).T

Now, it's a good job that sigmoid function has an easy derivative. It looks like this:

dLdz = np.multiply(sigmoided, (1-sigmoided))

where, note, this is an element-wise multiplication, not a matrix multiplication. Similarly, we calculate the back propagation (which "allows the information from the cost to then flow backward through the network in order to compute the gradient" [1]) using the same Numpy operation:

backprop = np.multiply(dLdz, epsilonNext)

Intuitively, you can think of this as each element of the weighted delta only being multiplied by the gradient of the function at that element.

Note:

"The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm such as stochastic gradient descent, is used to perform learning using the gradient." [1]

Anyway, we can now update the gradients and the weights:

_0_W = (features.T * backprop) + _0_W
_1_W = (sigmoided.T * delta) + _1_W
_0_b = _0_b - (learning_rate * np.sum(backprop, 0))
_1_b = _1_b - (learning_rate * np.sum(delta, 0))
weightsLayer1 = weightsLayer1 - (learning_rate * _0_W)
weightsLayer2 = weightsLayer2 - (learning_rate * _1_W)

Note that the biases are derived from the columnar sums of the backprop and delta matrices.

Now, repeat this some 500 times and the output looks like:

print softmaxed

[[ 1.00000000e+00 3.16080274e-36]

[ 1.50696539e-52 1.00000000e+00]

[ 1.64329041e-30 1.00000000e+00]

[ 1.00000000e+00 8.51240114e-40]]

which for all intents and purposes is the same as labels. QED.

Some things we ignored

For brevity, I didn't address the score (that tells us how close we are to our desired values).

Also, any attempt at regularization (that attempts to avoid over-fitting) was ignored. The DL4J XorExample set the L1 (New York taxi distance) and L2 (the Euclidean distance) to zero so we're true to the original there.

[1] Deep Learning (Goodfellow et al)

Agile Java Man