Sunday, October 8, 2017

Gradients Refresher


There is a lot of talk of gradients in Spark's machine learning library. Here are some notes to remind you what it's all about.

Approximation techniques

Newton method uses differentials. Secant methods use finite differences.

"For problems with more than a few variables, it also quickly became clear that the cost of evaluation Jacobians or Hessians [see below] at every step could be exorbitant. Faster methods were needed that might make use of inexact Jacobians or Hessians and/or inexact solutions of the associated linear equations, while still achieving superlinear convergence. An early breakthrough of this kind was the discovery of  Quasi-Newtonian methods in the 1960s by Broyden, Davidon, Fletcher and Powell, in which partial information is used to generate steadily improving estimates of the true Jacobian or Hessian or its matrix factors." [1]

Under certain circumstances "the convergence is quadratic, meaning that the error at each stage is roughly the square of the error at the previous stage" [1]

Taylor Expansion

A polynomial can be expressed as:

f(x) = f(a) + (x - a) f'(a) + (x - a)2 f''(a) / 2! + ... + (x - a)n fn(a) / n! + ...

where f' is the first differential, f'' the second and fn the n-th. (see Boas [2] p25 for a proof).

The multivariate version (that is, when x and a are vectors) looks like this:

f(x) = f(a) + (x - a) (∂f(a)/∂x+ ∂f(a)/∂x2) + (x - a)2 (∂2f(a)/∂x12 + 2 ∂f(a)/∂x1∂x2 + ∂2f(a)/∂x22) / 2! + ...

in the case of 2 dimensions (that is, when x = (x1, x2)) although it could be any number of dimensions.

For a polynomial of degree 2, the first three terms are all you need as any further differentiation will always yield zero. Alternatively, we might choose that the first three terms are "good enough". Either way, let's now drop all the other terms. We also note that the above equation, when we drop the extraneous terms, can be more succinctly expressed as matrices, thus:

f(x) = f(a) + pT ∇ f(a) + pT B p / 2

where p = x - a and B is the Hessian (see below). By putting some simple values into a square matrix, it's not hard to see that this last term in the matrix equation when expanded is the same as the more verbose equation above it.

Gradients

φ is a vector if φ is a field ("a collection of values with a plus operator and a times operator"[3]). In Cartesian 3 dimensions, it's:

φ = grad φ = i ∂φ /∂x + j ∂φ /∂y + k ∂φ /∂z

where φ is a scalar and i, j and k are unit vectors although there is no reason to restrict ourselves to 3 dimensions. It's just commonly used in the mathematical sciences.

"The vector ∇ φ is perpendicular to the surface φ = const" (see Boas [2] p292, 293, 296)

Divergence

If  acts on a vector, V:

V = i Vx(x, y, z) + j Vy(x, y, z) + k Vz(x, y, z)

then

 . V = div V = ∂Vx/∂x + ∂Vy/∂y + ∂Vz/∂z

Again, this is the special case in 3-D but we need not be restricted to just 3 dimensions.

Laplacian

The Laplacian occurs in a variety of problems in physics (wave equations, diffusion [2] etc). Again, this is the 3-D version but it can be generalized to any number of dimensions:

2 φ = div grad φ = ∂2φ /∂x2 + ∂2φ /∂y2 + ∂2φ /∂z2

"Now, that is just the mathematical definition, depending on the field where the Laplacian appears it may have different interpretations. The one I like is the relation with the mean curvature of a surface (how much bent the surface is). The mean curvature is the divergence of the unit normal of a surface, if the normal is given by the gradient of a function (that means that your surface is a level curve, an equipotential, an isopotential, etc...) then the curvature is proportional to the Laplacian." (from Quora)

Jacobians

"The Jacobian of x, y with respect to s, t is the determinant" [2]

J =
∂x/δs∂x/∂t
∂y/∂s∂y/∂t

The Jacobian matrix is the matrix for which this is the determinant, namely:

Ji,j = ∂fi / ∂xj

From what I can tell, 'Jacobian' can refer to either the matrix or its determinant depending on the context.

Hessians

A Hessian is a matrix that looks like:

Hi,j = ∂2f / ∂xi ∂xj

Confusingly, 2 the operator is used for both the Laplacian and the Hessian. However, if  is a column matrix, the Hessian is T and the Laplacian is ∇ T.

Note that the "trace of the Hessian (the sum of the diagonal elements) is the Laplace operator." (from Quora.com)

Note that the "Hessian is always symmetric" [1].

[1] The Princeton Companion to Mathematics
[2] Mathematical Methods in the Physical Sciences, Boas
[3] Coding the Matrix, Klein

No comments:

Post a Comment