Agile Java Man: Making the gradient

Wednesday, May 23, 2018

Making the gradient

Gradient Descent is perhaps the simplest means of finding minima. The maths is pretty easy. We are trying to minimize the sum of squared errors of our target value and our algorithm's output. Let J(w) be our cost function which is a function of weights w and is based on the SSE. Then

J(w) = Σ_i(target_i - output_i)² / 2

where i is the index of a sample in our dataset and we divide by 2 for convenience (it will soon disappear).

Say that

f(w) = target - output = y - w^T . x

then using the chain rule, we can calculate:

dJ / dw = (dJ / df) . (df / dw)

if η is our step value then Δw_j, the value by which we adjust w.

≈ -η (dJ / dw)
= -η Σ_i(target_i - output_i)(-x_ij)

the negative coming from the fact that we want to work in the opposite direction of the gradient as we are finding the minimum. As you can see, it cancels out the leading negative and note, too, that the divide-by-2 has also disappeared.

Now, the difference between GD and Stochastic Gradient Descent comes in how we implement the algorithm. In GD, we calculate w_j for all i before moving to the next position. In SGD, we calculate w_j for each sample, i, and then move on (see Quora).

Note that in TensorFlow you can achieve SGD by using a mini-batch size of 1 (StackOverflow).

Also note "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize" (from here).

Gradient Descent vs. Newton's Methods

"Gradient descent maximizes a function using knowledge of its derivative. Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative" (from StackOverflow).

"More people should be using Newton's method in machine learning... Of course, Newton's method will not help you with L1 or other similar compressed sensing/sparsity promoting penalty functions, since they lack the required smoothness" (ibid). The post includes a good chat about best way to find minima and how GD compares to Newton's methods.

LBFGS and Spark

Spark's default solver for a MultilayerPerceptronClassifier uses lbfgs (a memory efficient version of BFGS) which at least for my dataset is both fast and accurate. The accuracy looks like:

Algorithm	Iterations	Batch Size	Accuracy (%)
l-bfgs	1000	64	94.4
l-bfgs	100	32	94.4
l-bfgs	100	128	94.0
gd	2000	128	92.4
gd	1000	128	90.0
gd	1000	1 (ie, SGD)	88.5
gd	100	128	15.6

[Note: figures are indicative only.]

For my configuration (10 executors each with 12gb of memory and 3 cores plus a driver with 20gb of memory), l-bfgs was several times faster.

Interestingly, "there used to be l-BFGS implemented in standard TensorFlow but it was deleted because it never got robust enough. Because tensorflow distributed framework is quite low-level so people sometimes get surprised by how much work it takes to get something robust" (GitHub).

Agile Java Man

Wednesday, May 23, 2018

Making the gradient

No comments:

Post a Comment

Blog Archive

About Me