I've hidden 25 samples in 10 000 that are different to the rest then used a Variational Auto-encoder (VAE) to find them. Here is my account of trying to find the anomalies.
Data
Each sample has 50 points in time (represented by a Long) that may be either bunched together in a few hours or scattered randomly across a calendar year.
By far the single biggest improvement came from normalising this data. Without normalising, the neural net was pretty useless.
So, before normalization, a single sample looks like:
1371877007, 1386677403, 1371954061, 1361641428, 1366151894, 1366819029, 1380334620, 1379574699, 1359865022, 1377141715, 1370407230, 1358989583, 1373813009, 1364038087, 1361247093, 1367920808, 1379825490, 1379755109, 1363559641, 1373945939, ...
and after normalization, it may look something like this:
0.2737, -0.0451, -0.6842, 1.6797, -1.3887, -0.0844, -0.6952, 0.9683, 0.7747, 1.6273, -1.0817, -0.0380, 1.3321, 0.2864, 0.9135, -1.3018, 1.0786, 0.0830, -0.3311, -1.6751, 1.6270, 1.4007, 0.8983, ...
Note that the normalized data is (roughly) zero-centred and (very roughly) in the region of -1 to 1. See below for why this is relevant.
Aside: it's really, really important for the data to be reproducible through runs. That is, although the data is random, it must be reproducibly random. I wasted a lot of time being fooled by randomness in the data.
What are VAEs?
"It is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a generative model." (Keras Blog)
"Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation." (TowardsDataScience)
Tuning
According to Andrej Karpathy:
"The most common hyperparameters in context of Neural Networks include:
- the initial learning rate
- learning rate decay schedule (such as the decay constant)
- regularization strength (L2 penalty, dropout strength)"
Activation Functions
"If you know the outputs have certain bounds, it makes sense to use an activation function that constrains you to those bounds." (StackOverflow)
Given our data, one might think that HARDSIGMOID, SIGMOID, SWISH etc or even TANH would yield the best results (SWISH is just x*sigmoid(x)). Wheras RELU, ELU etc don't model it at all.
From Shruti Jadon on Medium.com |
But there is an interesting opinion at TowardsDataScience:
"The question was which one is better to use?
"Answer to this question is that nowadays we should use ReLu which should only be applied to the hidden layers. And if your model suffers form dead neurons during training we should use leaky ReLu or Maxout function.
"It’s just that Sigmoid and Tanh should not be used nowadays due to the vanishing Gradient Problem which causes a lots of problems to train,degrades the accuracy and performance of a deep Neural Network Model."However, I found no difference in accuracy with my VAE using ELU, LEAKYRELU nor RELU. In fact, playing with the 21 activation functions that came with DL4J, I did not see any variety when applying them to the hidden layers.
I only saw a big difference when using it at the bottleneck layer and in the reconstruction distribution (see below).
Regularization
Setting the L2 regularization parameter gave me the following results
Batch Sizes
Accuracy hovered around 16 or 17 up to and including a batch size of 64. After that, it dropped off quickly to an accuracy of 13 (53%) and 6 (24%) for batch sizes of 128 and 256.
Updater
Adam with an initial value of 10-4 seemed to give better accuracy at 17.8 / 71.2% (sd. 0.422) than RmsProp(10-3) and AdaDelta (which both yielded an accuracy of 16 (64%), standard deviation of 0).
Reconstruction Distribution
Now fiddling with this knob did make quite a difference.
All the results so far were using a BernoulliReconstructionDistribution with a SIGMOID. This was because I had cribbed the code from somewhere else where the Bernoulli distribution was more appropriate as it represents "binary or 0 to 1 data only".
My data was not best approximated by a Bernoulli but a Gaussian. So, using a GaussianReconstructionDistribution with a TANH gave better results.
The DL4J JavaDocs state: "For activation functions, identity and perhaps tanh are typical - though tanh (unlike identity) implies a minimum/maximum possible value for mean and log variance. Asymmetric activation functions such as sigmoid or relu should be avoided". However, I didn't find SIGMOID or RELU made much difference to my data/ANN combination (although using CUBE led to zero anomalies being found).
This is similar to what I blogged last year that when modelling the features: features should (in a very loose sense) model your output.
Anyway, using a Gaussian reconstruction distribution, accuracy jumped to 18.6 (74.4%) albeit with a large standard deviation of 3.438.
Then, through brute force, I discovered that using SOFTPLUS in both pzxActivationFunction and GaussianReconstructionDistribution gave me an average accuracy 19.1 (sd. 3.542). This was the high-water marker of my investigation.
Architecture
All the results so far were using just a single encoder and a single decoder layer that was half the size of the input vector. Let's call this value x.
Using hidden layers of size [x, x, 2, 2, x, x] did not change the best accuracy. Neither did [x, x/2, 2, 2, x/2, x] nor [x, x/2, x/4, 2, 2, x/4, x/2, x] nor even [x, x/2, x/4, 1, 1, x/4, x/2, x].
So, this avenue proved fruitless.
ConclusionSetting the L2 regularization parameter gave me the following results
L2 | Mean | Accuracy(%) | Standard deviation |
10-5 | 15.2 | 60.8 | 0.837 |
10-4 | 15.2 | 60.8 | 0.837 |
10-3 | 15.2 | 60.8 | 0.837 |
10-2 | 15.2 | 60.8 | 0.837 |
10-1 | 16 | 64 | 0.707 |
100 | 16.2 | 64.8 | 0.447 |
101 | 16 | 64 | 0 |
102 | 16 | 64 | 0 |
All using the SWISH activation function.
Batch Sizes
Accuracy hovered around 16 or 17 up to and including a batch size of 64. After that, it dropped off quickly to an accuracy of 13 (53%) and 6 (24%) for batch sizes of 128 and 256.
Updater
Adam with an initial value of 10-4 seemed to give better accuracy at 17.8 / 71.2% (sd. 0.422) than RmsProp(10-3) and AdaDelta (which both yielded an accuracy of 16 (64%), standard deviation of 0).
Reconstruction Distribution
Now fiddling with this knob did make quite a difference.
All the results so far were using a BernoulliReconstructionDistribution with a SIGMOID. This was because I had cribbed the code from somewhere else where the Bernoulli distribution was more appropriate as it represents "binary or 0 to 1 data only".
My data was not best approximated by a Bernoulli but a Gaussian. So, using a GaussianReconstructionDistribution with a TANH gave better results.
The DL4J JavaDocs state: "For activation functions, identity and perhaps tanh are typical - though tanh (unlike identity) implies a minimum/maximum possible value for mean and log variance. Asymmetric activation functions such as sigmoid or relu should be avoided". However, I didn't find SIGMOID or RELU made much difference to my data/ANN combination (although using CUBE led to zero anomalies being found).
This is similar to what I blogged last year that when modelling the features: features should (in a very loose sense) model your output.
Anyway, using a Gaussian reconstruction distribution, accuracy jumped to 18.6 (74.4%) albeit with a large standard deviation of 3.438.
Then, through brute force, I discovered that using SOFTPLUS in both pzxActivationFunction and GaussianReconstructionDistribution gave me an average accuracy 19.1 (sd. 3.542). This was the high-water marker of my investigation.
Architecture
All the results so far were using just a single encoder and a single decoder layer that was half the size of the input vector. Let's call this value x.
Using hidden layers of size [x, x, 2, 2, x, x] did not change the best accuracy. Neither did [x, x/2, 2, 2, x/2, x] nor [x, x/2, x/4, 2, 2, x/4, x/2, x] nor even [x, x/2, x/4, 1, 1, x/4, x/2, x].
So, this avenue proved fruitless.
I am still something of a neophyte to neural nets but although I can improve the accuracy it still seems more like guesswork than following a process. There was no a priori way I know of that would have indicated that SOFTPLUS was the best activation function to use in the reconstruction, for instance.
It's clear that there are some rules-of-thumb but I wish somebody would publish a full list. Even then, it seems very data-dependent. "For most data sets only a few of the hyper-parameters really matter, but [...] different hyper-parameters are important on different data sets" (Random Search for Hyper-Parameter Optimization).