## Friday, November 10, 2017

### BFGS - part 2

Last time we looked at finding the minimum over a field. We used Python to describe a field as:

f = z ** 2 * cos(x) ** 2 +  cos(y) ** 2

(note: I just made this equation up for fun and it has no other significance).

The direction towards the minimum for every point is a vector, thus we have a vector field. (See this description of a vector field at Quora where the direction somebody is looking while riding a rollercoaster is analogous to a vector field. "It's OK to gradually turn your head, which will result in a vector field --- at every point of the rollercoaster ride, you'd be looking somewhere.")

[Aside: this is a great resource for plotting vector fields in Python]

What we didn't address is: where on that vector would we find the minimum? For example, let's take one vector for a point (source code for the plots can be found here). Direction to the minimum at a give (arbitrary) point.
I've shown the surface for which f=0 on which this point lies just for giggles.

Anyway, taking this point, and plot f over its length. Let's introduce:

φ(α) = f(xk + αpk)

where α ≥ 0

and we get:

So, our vector points towards a minimum but where exactly is it?

The Wolfe Conditions

The principal reason for imposing the Wolfe conditions in an optimization algorithm where is to ensure convergence of the gradient to zero (see this wiki).

"However, it is usually expensive to accurately compute a minimizer of φ (and, in fact, usually impossible to find the global minimizer of φ given the available information)" 

Armijo Condition

This stops us from overshooting the minimum with a step that's too large.

"In this case, the iterations repeatedly go from one side of a valley to the other, always reducing f but never by much. The problem is that the reduction in each step is very little compared to the length of the steps--the steps are too long." 

f(xk + αpk) ≤ f(xk) + αc1pkT ∇ f(xk)

Curvature Condition

This stops us from choosing a step too small.

"The second way that an algorithm can reduce f without reducing it sufficiently is to take steps that are too short."  For example, we may asymptotically approach a value but that value is above the minimum.

pkT ∇f(xk + αpk) ≥ c2 pkT ∇f(xk)

where α is the step length and 0 < c1 < c2 < 1.

Combined

Combining these two conditions on the same graph gives us:

where the green dots indicate a valid region for the Armijo condition (we don't overshoot) and the red dots indicate valid regions for the curvature condition (we don't undershoot). Where the two overlap is a potential region for our next step.

How the Wolfe conditions apply

Let's introduce:

sk = xk+1  - xk
yk = ∇f(xk+1) - ∇f(xk)

then

Bk+1 sk = yk

This is the secant equation.

Since the curvature condition above can now be expressed in terms of s and y as:

sykT > 0

then we are already fulfilling this condition. Why? Take skT and multiply it by the secant equation. Note that since Bk+1 is  positive definite (as we proved in the first blog post) sBk+1  skT > 0. QED.

Another post should hopefully draw all this together and explain why BFGS works.

 Globalizing Newton's Method: Line Searches (Prof Gockenbach)