Agile Java Man: November 2017

Sunday, November 19, 2017

The what and why of ALS

ALS is another matrix decomposition technique. It has been used by people like Netflix in recommender systems. "The idea of characterizing users as collections of items, and items as collections of users, is the basis of collaborative filtering, in which users are clustered based on common item preferences, and items are clustered based on the affinities of common users."[1]

In ALS, "you decompose your large user/item matrix into lower dimensional user factors and item factors" (Quora).

The algorithm distributes easily and Spark implements it out-of-the-box.

What is it?

The maths behind it says: imagine a matrix of ratings, R, that approximates to X^TY where X represents the users and Y the items.

Thus, the problem is reduced to:

min_X,Y = Σ_{r_ui}(r_ui - x_u^Ty_i)² + λ (Σ_u||x_u||² + Σ_i||y_i||²)

So, basically we want to minimize a standard least squares problem plus an error. ALS does this by fixing X then solving for Y, then fixing Y solving for X. Rinse and repeat until it converges to an answer.

But why?

But that's not the interesting thing (to me, at least). The question is why it works.

The error in the above equation is often referred to as regularization. "Regularization is a technique to structure the parameters in a form we prefer, often to solve the problem of overfitting" [2] This regularization penalizes outliers. OK, but that's unsatisfactory. Why did we choose that particular equation?

A little digging found this on StackExchange. Basically, if we say:

R = X^TY + ε

where ε is the error with a Gaussian distribution, our minimization equation drops out naturally.

If the problem is ill-posed (ie, there is not a unique solution) then we introduce the assumptions that the distribution is a multivariate Gaussian. The conjugate prior of a Gaussian is another Gaussian. Taking the log of the product of the distribution and its conjugate prior, we see something very like the min_X,Y equation above. (See the section on the Bayesian interpretation of Tikhonov Regularization).

This comes straight from Bayes theorem where we ignore the denominator as it does not depend on our parameters (see maximum a posteriori or MAP). We then maximize the parameters (by differentiating) and find the most likely parameters.

MAP deserves a blog post all to itself and I will do that when I have time.

Convexity

"The term 'convex' can be applied to both sets and to functions. A set S ∈ ℝⁿis a convex set if the straight line segment connecting any two points in S lies entirely in S... The function f is a convex function if its domain S is a convex set and if for any two points x and y in S, the following property is satisfied:

f(α x + (1-α) y) ≤ αf(x) + (1-α)f(y)

Interesting properties derive from this.

"For convex programming problems, and more particularly for linear programs, local solutions are also global solutions. General nonlinear problems, both constrained and unconstrained, may posses local solutions that are not global solutions" [3] (Convex programming describes a constrained optimization problem where the objective function is convex, the equality constraint functions are linear and the inequality functions are concave [see 3, p8]

"If the objective function in the optimization problem and the feasible region are both convex, then any local solution is in fact a global solution." [3]

An interesting explanation for why this is the case is here on StackExchange. If x^* is the lowest point of f, the substituting in any other x will always reduce the equation to f(x^*) > f(x). Try it!

For the equation to hold, f must be a linear equation. Since the minimization equation above has a quadratic, it is not linear so approximating it is the only option.

[1] Real World Machine Learning, Harrington
[2] Machine Learning with TensorFlow, Shukla
[3] Numerical Optimisation, Nocedal and Wright

Friday, November 10, 2017

BFGS - part 2

Last time we looked at finding the minimum over a field. We used Python to describe a field as:

f = z ** 2 * cos(x) ** 2 + cos(y) ** 2

(note: I just made this equation up for fun and it has no other significance).

The direction towards the minimum for every point is a vector, thus we have a vector field. (See this description of a vector field at Quora where the direction somebody is looking while riding a rollercoaster is analogous to a vector field. "It's OK to gradually turn your head, which will result in a vector field --- at every point of the rollercoaster ride, you'd be looking somewhere.")

[Aside: this is a great resource for plotting vector fields in Python]

What we didn't address is: where on that vector would we find the minimum? For example, let's take one vector for a point (source code for the plots can be found here).

Direction to the minimum at a give (arbitrary) point.

I've shown the surface for which f=0 on which this point lies just for giggles.

Anyway, taking this point, and plot f over its length. Let's introduce:

φ(α) = f(x_k + αp_k)

where α ≥ 0

and we get:

So, our vector points towards a minimum but where exactly is it?

The Wolfe Conditions

The principal reason for imposing the Wolfe conditions in an optimization algorithm where is to ensure convergence of the gradient to zero (see this wiki).

"However, it is usually expensive to accurately compute a minimizer of φ (and, in fact, usually impossible to find the global minimizer of φ given the available information)" [1]

Armijo Condition

This stops us from overshooting the minimum with a step that's too large.

"In this case, the iterations repeatedly go from one side of a valley to the other, always reducing f but never by much. The problem is that the reduction in each step is very little compared to the length of the steps--the steps are too long." [1]

f(x_k + αp_k) ≤ f(x_k) + αc₁p_k^T ∇ f(x_k)

Curvature Condition

This stops us from choosing a step too small.

"The second way that an algorithm can reduce f without reducing it sufficiently is to take steps that are too short." [1] For example, we may asymptotically approach a value but that value is above the minimum.

p_k^T ∇f(x_k + αp_k) ≥ c₂ p_k^T ∇f(x_k)

where α is the step length and 0 < c₁ < c₂ < 1.

Combined

Combining these two conditions on the same graph gives us:

where the green dots indicate a valid region for the Armijo condition (we don't overshoot) and the red dots indicate valid regions for the curvature condition (we don't undershoot). Where the two overlap is a potential region for our next step.

How the Wolfe conditions apply

Let's introduce:

s_k = x_k+1 - x_k
y_k = ∇f(x_k+1) - ∇f(x_k)

then

B_k+1 s_k = y_k

This is the secant equation.

Since the curvature condition above can now be expressed in terms of s and y as:

s_ky_k^T > 0

then we are already fulfilling this condition. Why? Take s_k^T and multiply it by the secant equation. Note that since B_k+1is positive definite (as we proved in the first blog post) s_kB_k+1s_k^T > 0. QED.

Another post should hopefully draw all this together and explain why BFGS works.

[1] Globalizing Newton's Method: Line Searches (Prof Gockenbach)

Saturday, November 4, 2017

BFGS - part 1

Nothing to do with Roald Dahl but a trick used to optimize machine learning algorithms (see Spark's mllib library). We're iteratively trying to find the lowest point in some space and representing this value with m_k where k is the iteration step number.

We start with the quadratic model:

m_k(p_k) = f(x_k) + ∇ f(x_k) ^T p_k + (p_k^T B_k p_k) / 2

taken from a Taylor expansion where p_k is the step and B_k is a Hessian matrix that is positive definite (for reasons concerning convergence - see below).

At the minimum (that is, differentiating with respect to p_k), ∇m_k will equal 0. Also, the last term will disappear (since our model is quadratic). Re-arranging gives:

p_k = - B_k^-1 ∇ f(x_k)

So, the next step can be given as:

x_k+1 = x_k + α_kB_k^-1 ∇ f(x_k)

where α_k is the step size at the k-th step.

Why positive definite?

A positive definite matrix, M, satisfies x^TM x > 0 for all x. Note that if we do a Taylor expansion around the minimum point x^*, we get

f(x^* + p) = f(x^*) + p^T∇ f(x^*) + p^T ∇² f(x^*) p / 2 + ...

where obviously ∇ f(x^*) is zero and f(x^*) is less than f(x^* + p) by the definition of x^*. Therefore, the final term is greater than 0 and hence it is positive definite. QED.

See p16 of Numerical Optimization (Norcedel & Wright) for a more convincing argument.

Surfaces

We can treat the space just like an electromagnetic field that has an electric potential at every point in that space. So, although there is a scalar associated with every point in that space, how that scalar changes can lead to a vector associated with every point in that space.

Lets' take an example (full source code available here):

import mayavi.mlab as plt
import numpy as np
import math as m
from sympy import *

x, y, z = symbols('x y z')

f = z ** 2 * cos(x) ** 2 + cos(y) ** 2

what does this look like? Well, since it's a scalar value at every point in 3-D space, it's hard to visualize. We could view it as a heatmap at various values of z (that is, take a slice on the x-y plane for a given z value).

Slices of the space of f at given values of z

Or we could fix f for an arbitrary value and look at the surface that is defined by all points that map to that value (an "equipotential" in physics terms). For instance:

Surfaces at f=0 (bottom), f=2 (middle) and f=4 (top)

For simplicity, let's arbitrarily look at the surface formed by f=0 and look at the gradient - that is ∇f(x_k) - along the line x=0 that we chose arbitrarily.

Because I'm a careless mathematician, it's a pleasure to find that Sympy does the work for me. I write:

div = [f.diff(v) for v in [x, y, z]]

then I can calculate the gradient vector at all points on this surface with:

def gradient(xpt, ypt, zpt):

gradx = div[0].subs(x, xpt).subs(y, ypt).subs(z, zpt).evalf()

grady = div[1].subs(x, xpt).subs(y, ypt).subs(z, zpt).evalf()

gradz = div[2].subs(x, xpt).subs(y, ypt).subs(z, zpt).evalf()

return float(gradx), float(grady), float(gradz)

and plot it so:

gradient where f=0 and x=0

Unsurprisingly, the gradient vectors are perpendicular to the surface.

Now we add the vectors corresponding to B_k^-1 ∇ f(x_k) along this line of x=0. Again, Sympy does all the heavy lifting:

def eval_p_at(xpt, ypt, zpt):
co_ords = [xpt, ypt, zpt]
hessian_fn = [[f.diff(a).diff(b) for a in [x, y, z]] for b in [x, y, z]]
values = [sub_vals(item, co_ords) for item in hessian_fn]
hess_mat_eval = Matrix(values)
div_vals = sub_vals(div, co_ords)
if hess_mat_eval.det() == 0:
print "uninvertible hessian at ", xpt, ypt, zpt
return 0, 0, 0
else:
p = - (hess_mat_eval.inv() * Matrix(div_vals))
return float(p[0]), float(p[1]), float(p[2])

def sub_vals(items, vals):

return [i.subs(x, vals[0]).subs(y, vals[1]).subs(z, vals[2]).evalf() for i in items]

where this is the equation for p we see at the top of this post. Plotting it gives:

Gradient and p along f=0, x=0

Note how the p vectors are all pointing downwards, trying to minimize f.

Now, if we look at these vectors of p at varies points on this arbitrarily chosen surface, f=0, it looks like this:

p vectors on f=0

which is starting to look like our heatmap if we looked directly down on it. The p vectors are trying to find the "coldest" spot, the minimum of the function f.

I'll flesh out the algorithm more in the next post.

Agile Java Man