## Wednesday, May 23, 2018

Gradient Descent is perhaps the simplest means of finding minima. The maths is pretty easy. We are trying to minimize the sum of squared errors of our target value and our algorithm's output. Let J(w) be our cost function which is a function of weights w and is based on the SSE. Then

J(w) = Σi(targeti - outputi)2 / 2

where i is the index of a sample in our dataset and we divide by 2 for convenience (it will soon disappear).

Say that

f(w) = target - output = y - wT . x

then using the chain rule, we can calculate:

dJ / dw = (dJ / df) . (df / dw)

if η is our step value then Δwj, the value by which we adjust w.

≈ -η (dJ / dw)
= -η Σi(targeti - outputi)(-xij)

the negative coming from the fact that we want to work in the opposite direction of the gradient as we are finding the minimum. As you can see, it cancels out the leading negative and note, too, that the divide-by-2 has also disappeared.

Now, the difference between GD and Stochastic Gradient Descent comes in how we implement the algorithm. In GD, we calculate wj for all i before moving to the next position. In SGD, we calculate wj for each sample, i, and then move on (see Quora).

Note that in TensorFlow you can achieve SGD by using a mini-batch size of 1 (StackOverflow).

Also note "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize" (from here).

"Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative" (from StackOverflow).

"More people should be using Newton's method in machine learning... Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness" (ibid). The post includes a good chat about best way to find minima and how GD compares to Newton's methods.

LBFGS and Spark

Spark's default solver for a MultilayerPerceptronClassifier uses lbfgs (a memory efficient version of BFGS) which at least for my dataset is both fast and accurate. The accuracy looks like:

 Algorithm Iterations Batch Size Accuracy (%) l-bfgs 1000 64 94.4 l-bfgs 100 32 94.4 l-bfgs 100 128 94.0 gd 2000 128 92.4 gd 1000 128 90.0 gd 1000 1 (ie, SGD) 88.5 gd 100 128 15.6

[Note: figures are indicative only.]

For my configuration (10 executors each with 12gb of memory and 3 cores plus a driver with 20gb of memory), l-bfgs was several times faster.

Interestingly, "there used to be l-BFGS implemented in standard TensorFlow but it was deleted because it never got robust enough. Because tensorflow distributed framework is quite low-level so people sometimes get surprised by how much work it takes to get something robust" (GitHub).