Tuesday, September 12, 2017

Neural Networks are just Matrices

DeepLearning4J brings neural nets to Java programmers. They suggest in the getting started section to run the XorExample. This is a neural net that, given XOR inputs and outputs, learns its logic. This is non-trivial for a simple neural net (see here) as the true and false values are not linearly separable in a single XOR matrix. DL4J provides a way of making more complicated neural nets but hides a lot of detail.


The network in XorExample "consists in 2 input-neurons, 1 hidden-layer with 4 hidden-neurons, and 2 output-neurons... the first fires for false, the second fires for true".

But instead of talking about neurons, it's much easier to think of this neural net as matrices (at least if you're familiar with simple linear algebra).

So, imagine this neural net of just:

  • A 4 x 2 matrix of the inputs ("features").
  • A hidden layer that is a 2 x 4 matrix 
  • A layer that is a 4 x 2 matrix that yields the output.

We need to apply functions to these matrices before we multiply, but that's essentially it.

Multi-layer Neural Nets in a few lines of Python

To do this, I'm not going to use TensorFlow or other frameworks dedicated to neural nets. I'll just use Numpy as basically syntactic sugar around manipulating arrays of arrays.

Also, I'll present a more intuitive approach to the maths. A more thorough analysis can be found on Brian Dohansky's blog here.

Finally, I tried to do this in Scala using the Breeze linear algebra library but it was just so much easier to do it in Python and Numpy as it ran in a fraction of the time it took Scala to even compile.

The code

As mentioned, we need Numpy.

import numpy as np

Now, let's take the input to a XOR operation (our first matrix):

features = np.matrix('0.0, 0.0;'
                     '1.0, 0.0;'
                     '0.0, 1.0;'
                     '1.0, 1.0')

and the expected output:

labels = np.matrix('1.0, 0.0;'
                   '0.0, 1.0;'
                   '0.0, 1.0;'
                   '1.0, 0.0')

(Remember that that this is not trying to be a truth table. Instead, the first column indicates that the output is true if it's 1 and and the second column indicates it's false if it's 1).

We need those other 2 matrices. It doesn't matter what their values are initially as we'll correct them whatever they are. So, let's chose some random values but in a well-defined shape:

weightsLayer1 = np.random.rand(2, 4)
weightsLayer2 = np.random.rand(4, 2)

We also need the gradient of the weights and biases. We could have subsumed the biases into our matrices - that's mathematically equivalent - but the DeepLearning4J example doesn't do this so we won't either.  The weights and biases for the first and second layers respectively are:

_0_W = np.zeros([2, 4])
_0_b = np.zeros([1, 4])
_1_W = np.zeros([4, 2])
_1_b = np.zeros([1, 2])

Finally, we need a step value and a batch size:

learning_rate = 0.1
mini_batch_size = np.shape(features)[0]

Now we can do our first multiplication (or forward propagation):

    s0 = features * weightsLayer1
    s0 += _0_b   

But we need to apply a function to this (an activation function). In this example, we'll use a sigmoid function. It has a simple derivative that we'll need later.

def f_sigmoid(X):
        return 1 / (1 + np.exp(-X))

So, now we can apply the sigmoid activation function to the first layer:

sigmoided = f_sigmoid(s0)

This we'll feed into the next layer with another matrix multiplication:

    s1 = sigmoided * weightsLayer2
    s1 += _1_b

No we need our second activation function to apply to this matrix. It's the softmax function that basically normalizes to 1 all the values in each row allowing us to treat each row as a probability distribution. It looks like this (as stolen from Brian):

def f_softmax(X):
    Z = np.sum(np.exp(X), axis=1)
    Z = Z.reshape(Z.shape[0], 1)
    return np.exp(X) / Z

We apply this to the output from the previous layer and find the delta with what the output should be:

softmaxed = f_softmax(s1)
delta = softmaxed - labels

We calculate the delta weighted according to the weights of this layer, Transposing appropriately:

epsilonNext = (weightsLayer2 * delta.T).T

Now, it's a good job that sigmoid function has an easy derivative. It looks like this:

dLdz = np.multiply(sigmoided, (1-sigmoided)) 

where, note, this is an element-wise multiplication, not a matrix multiplication. Similarly, we calculate the back propagation (which "allows the information from the cost to then flow backward through the network in order to compute the gradient" [1]) using the same Numpy operation:

backprop = np.multiply(dLdz, epsilonNext)

Intuitively, you can think of this as each element of the weighted delta only being multiplied by the gradient of the function at that element.

"The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm such as stochastic gradient descent, is used to perform learning using the gradient." [1]
Anyway, we can now update the gradients and the weights:

    _0_W = (features.T * backprop) + _0_W
    _1_W = (sigmoided.T * delta) + _1_W
    _0_b = _0_b - (learning_rate * np.sum(backprop, 0))
    _1_b = _1_b - (learning_rate * np.sum(delta, 0))
    weightsLayer1 = weightsLayer1 - (learning_rate * _0_W)
    weightsLayer2 = weightsLayer2 - (learning_rate * _1_W)

Note that the biases are derived from the columnar sums of the backprop and delta matrices.

Now, repeat this some 500 times and the output looks like:

print softmaxed

[[  1.00000000e+00   3.16080274e-36]
 [  1.50696539e-52   1.00000000e+00]
 [  1.64329041e-30   1.00000000e+00]
 [  1.00000000e+00   8.51240114e-40]]

which for all intents and purposes is the same as labels. QED.

Some things we ignored

For brevity, I didn't address the score (that tells us how close we are to our desired values).

Also, any attempt at regularization (that attempts to avoid over-fitting) was ignored. The DL4J XorExample set the L1 (New York taxi distance) and L2 (the Euclidean distance) to zero so we're true to the original there.

[1] Deep Learning (Goodfellow et al)

No comments:

Post a Comment