ALS is another matrix decomposition technique. It has been used by people like Netflix in recommender systems. "The idea of characterizing users as collections of items, and items as collections of users, is the basis of collaborative filtering, in which users are clustered based on common item preferences, and items are clustered based on the affinities of common users."[1]
In ALS, "you decompose your large user/item matrix into lower dimensional user factors and item factors" (Quora).
The algorithm distributes easily and Spark implements it out-of-the-box.
What is it?
The maths behind it says: imagine a matrix of ratings, R, that approximates to XTY where X represents the users and Y the items.
Thus, the problem is reduced to:
minX,Y = Σrui(rui - xuTyi)2 + λ (Σu||xu||2 + Σi||yi||2)
So, basically we want to minimize a standard least squares problem plus an error. ALS does this by fixing X then solving for Y, then fixing Y solving for X. Rinse and repeat until it converges to an answer.
But why?
But that's not the interesting thing (to me, at least). The question is why it works.
The error in the above equation is often referred to as regularization. "Regularization is a technique to structure the parameters in a form we prefer, often to solve the problem of overfitting" [2] This regularization penalizes outliers. OK, but that's unsatisfactory. Why did we choose that particular equation?
A little digging found this on StackExchange. Basically, if we say:
R = XTY + ε
where ε is the error with a Gaussian distribution, our minimization equation drops out naturally.
If the problem is ill-posed (ie, there is not a unique solution) then we introduce the assumptions that the distribution is a multivariate Gaussian. The conjugate prior of a Gaussian is another Gaussian. Taking the log of the product of the distribution and its conjugate prior, we see something very like the minX,Y equation above. (See the section on the Bayesian interpretation of Tikhonov Regularization).
This comes straight from Bayes theorem where we ignore the denominator as it does not depend on our parameters (see maximum a posteriori or MAP). We then maximize the parameters (by differentiating) and find the most likely parameters.
MAP deserves a blog post all to itself and I will do that when I have time.
Convexity
"The term 'convex' can be applied to both sets and to functions. A set S ∈ ℝn is a convex set if the straight line segment connecting any two points in S lies entirely in S... The function f is a convex function if its domain S is a convex set and if for any two points x and y in S, the following property is satisfied:
f(α x + (1-α) y) ≤ αf(x) + (1-α)f(y)
Interesting properties derive from this.
"For convex programming problems, and more particularly for linear programs, local solutions are also global solutions. General nonlinear problems, both constrained and unconstrained, may posses local solutions that are not global solutions" [3] (Convex programming describes a constrained optimization problem where the objective function is convex, the equality constraint functions are linear and the inequality functions are concave [see 3, p8]
An interesting explanation for why this is the case is here on StackExchange. If x* is the lowest point of f, the substituting in any other x will always reduce the equation to f(x*) > f(x). Try it!
For the equation to hold, f must be a linear equation. Since the minimization equation above has a quadratic, it is not linear so approximating it is the only option.
[1] Real World Machine Learning, Harrington
[2] Machine Learning with TensorFlow, Shukla
[3] Numerical Optimisation, Nocedal and Wright