## Monday, August 10, 2015

### The Mathematics of Linear Regression

There's an excellent and free book online called The Elements of Statistical Learning that can be found here. I was browsing through it but hit one of those points where a derivation misses some steps and confuses the reader. It took me a few hours to work it out and it's not too hard but just in case somebody it's the same problem, here is a step-by-step derivation.

The section was on Linear Models and Least Squares (2.3.1) that says:
 Ŷ = β̂0 + p  ∑ j = 1 Xj β̂j

Where β̂is a constant called the bias, β̂ is a vector of coefficients, X is a vector of inputs and Ŷ is the prediction.

To make things more mathematically convenient, let's subsume β̂0 and the β̂ vector of coefficients into one big vector such that the above equation becomes:

Ŷ = XTβ̂

This is exactly the same equation in matrix notation rather than summation notation (XT is simply the transpose of X).

So, the question becomes what's the optimal value of β̂ ?

Let's re-express this thus: say that we have N trials each with a result yi (this y has nothing to do with Ŷ). All these results can be put in a vector, y. Let's say the matrix X has N rows each representing p data points that are properties of the system (that is, it's an N x p matrix).

Then the error can be expressed as:

y - X ß

(that is, the output values minus all of the input values that are multiplied by our coefficients). This is what we want to minimize.

How do we do this? Well an obvious way is to find the smallest sum of squares of the differences. So, let's square it:

(y - X ß)T (y - X ß)

then expand it noting that matrices are distributive, associative but not commutative and:

(A + B)T = AT + BT              (1)

and

(AB)T = BT AT                   (2)

so:

= (yT - ßT XT ) (y - X ß)
= yTy - yT X ß ßT XT y + ßT XT X ß

and differentiate it with respect to ß. Where the result of this differentation equals 0 is our optimal value for ß.

The first term is easy to differentiate. There is no dependence of y on ß so it disappears and our equation becomes.

δ (- yT X ß ßT XT y + ßT XT X ß) = 0     (3)
δß

The next bit is tricky. To help, let's take a look at this useful summary on matrix calculus. The interesting sections for us are Proposition 7 (equations 36 and 37) and Proposition 9 (equation 49). They say:

If:

α = yT A x

then:

δαyT A              (4)
δx

and:

δα = xT AT             (5)
δy

where A is independent of and y.

And if:

α = xT A x

then

δα = 2 xT A            (6)
δx

where A is symmetric.

Now, take the first term of (3). That's basically equation (4) if we replace A for X and x for ß.

Take the second term of (3). That's basically equation (5) if we replace A for XT and y for ß and x for y.

Finally, take the third term of (3) and that looks like (6) if we replace A for XX and x for ß. How do we know that XX is symmetric? Well any matrix multiplied by its transpose is symmetric (see here for the simple proof).

So, taking these three observations, equation (3) becomes:

0 = - yT - yT X + 2 ßXX
ßX- 2 yT X
= (ßXT - yTX

Now take the transpose of both sides (the transpose of 0 is 0) and using (1) and (2):

0 = XT (ßXT - yT)T
= XT ((ßXT)T- (yT)T)
= XT (ß y)
= XT ß XT y

Rearranging:

XT ß = XT y

and multiplying both sides with (XT X)-1 gives:

ß = (XT X)-1 XT y

which is equation 2.6 in The Elements of Statistical Learning