Tuesday, February 26, 2019

Hands-on with Variational Autoencoders


I've hidden 25 samples in 10 000 that are different to the rest then used a Variational Auto-encoder (VAE) to find them. Here is my account of trying to find the anomalies.

Data

Each sample has 50 points in time (represented by a Long) that may be either bunched together in a few hours or scattered randomly across a calendar year.

By far the single biggest improvement came from normalising this data. Without normalising, the neural net was pretty useless.

So, before normalization, a single sample looks like:

1371877007, 1386677403, 1371954061, 1361641428, 1366151894, 1366819029, 1380334620, 1379574699, 1359865022, 1377141715, 1370407230, 1358989583, 1373813009, 1364038087, 1361247093, 1367920808, 1379825490, 1379755109, 1363559641, 1373945939, ...

and after normalization, it may look something like this:

0.2737,   -0.0451,   -0.6842,    1.6797,   -1.3887,   -0.0844,   -0.6952,    0.9683,    0.7747,    1.6273,   -1.0817,   -0.0380,    1.3321,    0.2864,    0.9135,   -1.3018,    1.0786,    0.0830,   -0.3311,   -1.6751,    1.6270,    1.4007,    0.8983,    ...

Note that the normalized data is (roughly) zero-centred and (very roughly) in the region of -1 to 1. See below for why this is relevant.

Aside: it's really, really important for the data to be reproducible through runs. That is, although the data is random, it must be reproducibly random. I wasted a lot of time being fooled by randomness in the data.


What are VAEs?

"It is an autoencoder that learns a latent variable model for its input data. So instead of letting your neural network learn an arbitrary function, you are learning the parameters of a probability distribution modeling your data. If you sample points from this distribution, you can generate new input data samples: a VAE is a generative model." (Keras Blog)

With vanilla encoders, "If the [latent space] has discontinuities (eg. gaps between clusters) and you sample/generate a variation from there, the decoder will simply generate an unrealistic output, because the decoder has no idea how to deal with that region of the latent space.

"Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation." (TowardsDataScience)


Tuning

According to Andrej Karpathy:

"The most common hyperparameters in context of Neural Networks include:
  • the initial learning rate
  • learning rate decay schedule (such as the decay constant)
  • regularization strength (L2 penalty, dropout strength)"
But first, let's look at:


Activation Functions

"If you know the outputs have certain bounds, it makes sense to use an activation function that constrains you to those bounds." (StackOverflow)

Given our data, one might think that HARDSIGMOIDSIGMOIDSWISH etc or even TANH would yield the best results (SWISH is just x*sigmoid(x)). Wheras RELU, ELU etc don't model it at all.
From Shruti Jadon on Medium.com
(Graphic from here - EDIT see a more comprehensive list here).

But there is an interesting opinion at TowardsDataScience:
"The question was which one is better to use? 
"Answer to this question is that nowadays we should use ReLu which should only be applied to the hidden layers. And if your model suffers form dead neurons during training we should use leaky ReLu or Maxout function. 
"It’s just that Sigmoid and Tanh should not be used nowadays due to the vanishing Gradient Problem which causes a lots of problems to train,degrades the accuracy and performance of a deep Neural Network Model."
However, I found no difference in accuracy with my VAE using  ELU, LEAKYRELU nor RELU. In fact, playing with the 21 activation functions that came with DL4J, I did not see any variety when applying them to the hidden layers.

I only saw a big difference when using it at the bottleneck layer and in the reconstruction distribution (see below).


Regularization

Setting the L2 regularization parameter gave me the following results

L2MeanAccuracy(%)Standard deviation
10-515.260.80.837
10-415.260.80.837
10-315.260.80.837
10-215.260.80.837
10-116640.707
10016.264.80.447
10116640
10216640

All using the SWISH activation function.


Batch Sizes

Accuracy hovered around 16 or 17 up to and including a batch size of 64. After that, it dropped off quickly to an accuracy of 13 (53%) and 6 (24%) for batch sizes of 128 and 256.


Updater

Adam with an initial value of 10-4 seemed to give better accuracy at 17.8 / 71.2% (sd. 0.422) than RmsProp(10-3) and AdaDelta (which both yielded an accuracy of 16 (64%), standard deviation of 0).


Reconstruction Distribution

Now fiddling with this knob did make quite a difference.

All the results so far were using a BernoulliReconstructionDistribution with a SIGMOID. This was because I had cribbed the code from somewhere else where the Bernoulli distribution was more appropriate as it represents "binary or 0 to 1 data only".

My data was not best approximated by a Bernoulli but a Gaussian. So, using a GaussianReconstructionDistribution with a TANH gave better results.

The DL4J JavaDocs state: "For activation functions, identity and perhaps tanh are typical - though tanh (unlike identity) implies a minimum/maximum possible value for mean and log variance. Asymmetric activation functions such as sigmoid or relu should be avoided". However, I didn't find SIGMOID or RELU made much difference to my data/ANN combination (although using CUBE led to zero anomalies being found).

This is similar to what I blogged last year that when modelling the features: features should (in a very loose sense) model your output.

Anyway, using a Gaussian reconstruction distribution, accuracy jumped to 18.6 (74.4%) albeit with a large standard deviation of 3.438.

Then, through brute force, I discovered that using SOFTPLUS in both pzxActivationFunction and GaussianReconstructionDistribution gave me an average accuracy 19.1 (sd. 3.542). This was the high-water marker of my investigation.


Architecture

All the results so far were using just a single encoder and a single decoder layer that was half the size of the input vector. Let's call this value x.

Using hidden layers of size [x, x, 2, 2, x, x] did not change the best accuracy. Neither did [x, x/2, 2, 2, x/2, x] nor [x, x/2, x/4, 2, 2, x/4, x/2, x] nor even [x, x/2, x/4, 1, 1, x/4, x/2, x].

So, this avenue proved fruitless.


Conclusion

I am still something of a neophyte to neural nets but although I can improve the accuracy it still seems more like guesswork than following a process. There was no a priori way I know of that would have indicated that SOFTPLUS was the best activation function to use in the reconstruction, for instance.

It's clear that there are some rules-of-thumb but I wish somebody would publish a full list. Even then, it seems very data-dependent. "For most data sets only a few of the hyper-parameters really matter, but [...] different hyper-parameters are important on different data sets" (Random Search for Hyper-Parameter Optimization).

Saturday, February 2, 2019

Statistical Covariance



The meaning of statistical covariance is completely different to that used in programming languages but it's very important to data scientists. Here are some miscellaneous notes I made playing around.

The covariance matrix

... can be calculated for matrix X with something like this:

    x = X - np.mean(X, axis = 0)
    C = np.dot(x, x.T) / (n - 1)

For x, we're just taking the means of each column and subtracting each mean from each element in its respective column.

Note that in the diagram above, the three vectors from the covariance all live in the same plane. This is not a coincidence. The covariance matrix “is ALWAYS singular for ANY square matrix because you subtract off the column means. This guarantees that you reduces the rank by one (unless it is already singular) before multiplying the matrix with its transpose.” (from MathWorks).

Take this R code:

NROW = 10 NCOL = 10
res <- rep(NA, 1000)
for (i in 1:1000) {
   x <- matrix(runif(NROW*NCOL), ncol = NCOL, nrow = NROW)
   C <- cov(x)
   res[i] <- det(C)
}
hist(res, breaks = 100, main="det(C)")



and look at the histogram:

The reason it's not zero all the time is simply rounding errors. You can see from the spread that it should really be zero.

Anyway, there are many properties of an NxN matrix whose determinant is zero that are all equivalent (here are some) but the one we are interested in is that "the columns of the matrix are dependent vectors in ℝN ".

The proof in 2-D looks like this. Take the matrix:

a
c
b
d

The means of the columns are:

μ1 = (a+b)/2
μ2 = (c+d)/2

So, centering this matrix results in:

a-μ1
c-μ2
b-μ1
d-μ2

Substituting in the values for μ1 and μ2 gives:

(a-b)/2
(c-d)/2
(b-a)/2
(d-c)/2

Multiplying this matrix by its transpose gives:

1/4 
(a-b)2+(c-d)2
-(a-b)2(c-d)2
-(a-b)2-(c-d)2
(a-b)2+(c-d)2

and the determinant is therefore:

(1/42) [ ((a-b)2+(c-d)2)2 - ((a-b)2+(c-d)2)2 ] = 0

The matrix is always singular meaning necessarily that there is linear dependency among the vectors.

This generalizes to higher dimensions.

Note that we might choose the correlation matrix rather than the covariance matrix (see StackOverflow).

Note also that the “units [of covariance] are the product of the units of X and Y. So the covariance of weight and height might be in units of kilogram-meters, which doesn’t mean much.” [Think Stats p108].


Relationship with Cosine Similarities and Pearson Correlation

The triangle inequality is a defining property of norms and measures of distance. It simply says that for vectors x, y and z that make a triange, x + y <= z. It is a consequence of the law of cosines and a defining property of distances, that is, functions that satisfy:

non-negativity:      f(x, y)   > 0
identity:            f(x,y)    = 0 means x =y
symmetry:            f(x,y)    = f(y,x)
triangle-inequality: f(x,z)    <= f(x,y) + f(y,z)

Only symmetry is true for cosine similarities therefore it is not a true distance  although they can be converted to one (see Wikipedia). Despite this, we can still use it for comparing data.

"Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the Pearson correlation" [1] which is very similar. The relationship between the two looks like this:

Pearson correlation = cosine similarity (x - x̄, y - ȳ)

"People usually talk about cosine similarity in terms of vector angles, but it can be loosely thought of as a correlation, if you think of the vectors as paired samples." [1]

Pearson correlation is the normalized form of the covariance. Or, to put it in Dirac notation:

covariance = <x - x̄, y - ȳ> / n

where n=2 as we're finding the mean which the same as:

cov(X, Y) = E[(X - E[X])(Y - E[Y])]

And so all these concepts are related.

[1] Brendan O'Connor's blog