There's an excellent and free book online called

The section was on

Where

To make things more mathematically convenient, let's subsume

This is exactly the same equation in matrix notation rather than summation notation (

So, the question becomes what's the optimal value of

Let's re-express this thus: say that we have N trials each with a result y

Then the error can be expressed as:

y -

(that is, the output values minus all of the input values that are multiplied by our coefficients). This is what we want to minimize.

How do we do this? Well an obvious way is to find the smallest sum of squares of the differences. So, let's square it:

(y -

then expand it noting that matrices are distributive, associative but not commutative and:

(

and

(

so:

= (y

= y

and differentiate it with respect to ß. Where the result of this differentation equals 0 is our optimal value for ß.

The first term is easy to differentiate. There is no dependence of y on ß so it disappears and our equation becomes.

δß

The next bit is tricky. To help, let's take a look at this useful summary on matrix calculus. The interesting sections for us are Proposition 7 (equations 36 and 37) and Proposition 9 (equation 49). They say:

If:

α =

then:*The Elements of Statistical Learning*that can be found here. I was browsing through it but hit one of those points where a derivation misses some steps and confuses the reader. It took me a few hours to work it out and it's not too hard but just in case somebody it's the same problem, here is a step-by-step derivation.The section was on

*Linear Models and Least Squares*(2.3.1) that says:Ŷ = β̂_{0 + } |
p ∑ j = 1 | X_{j }β̂_{j } |

Where

*β̂*_{0 }is a constant called the bias,*β̂*is a vector of coefficients,*X*is a vector of inputs and*Ŷ*is the prediction.To make things more mathematically convenient, let's subsume

*β̂*_{0}and the*β̂*vector of coefficients into one big vector such that the above equation becomes:*Ŷ*=*X*^{T}*β̂*This is exactly the same equation in matrix notation rather than summation notation (

*X*is simply the transpose of^{T}*X*).So, the question becomes what's the optimal value of

*β̂*?Let's re-express this thus: say that we have N trials each with a result y

_{i}(this y has nothing to do with*Ŷ*). All these results can be put in a vector, y. Let's say the matrix*X*has N rows each representing p data points that are properties of the system (that is, it's an N x p matrix).Then the error can be expressed as:

y -

**X**ß(that is, the output values minus all of the input values that are multiplied by our coefficients). This is what we want to minimize.

How do we do this? Well an obvious way is to find the smallest sum of squares of the differences. So, let's square it:

(y -

**X**ß)^{T}(y -**X**ß)then expand it noting that matrices are distributive, associative but not commutative and:

(

**A**+**B**)^{T}=**A**^{T}+**B**^{T}**(1)**and

(

**AB**)^{T}=**B**^{T}**A**^{T}**(2)**so:

= (y

^{T}- ß^{T}**X**) (y -^{T}**X**ß)= y

^{T}y - y^{T}**X**ß - ß^{T}X^{T}y + ß^{T}X^{T}**X**ßand differentiate it with respect to ß. Where the result of this differentation equals 0 is our optimal value for ß.

The first term is easy to differentiate. There is no dependence of y on ß so it disappears and our equation becomes.

__δ__(- y^{T}**X**ß - ß^{T}X^{T}y + ß^{T}X^{T}**X**ß) = 0 (3)δß

The next bit is tricky. To help, let's take a look at this useful summary on matrix calculus. The interesting sections for us are Proposition 7 (equations 36 and 37) and Proposition 9 (equation 49). They say:

If:

α =

**y**^{T}**A****x**__δα__=

**y**

^{T}

**A (4)**

δ

**x**

and:

__δα__=

**x**

^{T}

**A**

^{T}

**(5)**

δ

**y**

where

**A**is independent of and y.

And if:

α =

**x**

^{T}

**A**

**x**

then

__δα__= 2

**x**

^{T}

**A**

**(6)**

δ

**x**

where A is symmetric.

Now, take the first term of (3). That's basically equation (4) if we replace

**A**for

**X**and

**x**for ß.

Take the second term of (3). That's basically equation (5) if we replace

**A**for X

^{T}and

**y**for ß and

**x**for

**y**.

Finally, take the third term of (3) and that looks like (6) if we replace

**A**for X

^{T }X and

**x**for ß. How do we know that X

^{T }X is symmetric? Well any matrix multiplied by its transpose is symmetric (see here for the simple proof).

So, taking these three observations, equation (3) becomes:

0 = - y

^{T}

**X**- y

^{T}

**X**+ 2

**ß**

^{T }X

^{T }X

= 2

**ß**

^{T }X

^{T }X - 2 y

^{T}

**X**

= (ß

^{T }X

^{T}- y

^{T})

**X**

Now take the transpose of both sides (the transpose of 0 is 0) and using (1) and (2):

0 = X

^{T }(ß

^{T }X

^{T}- y

^{T})

^{T}

= X

^{T }((ß

^{T }X

^{T})

^{T}- (y

^{T})

^{T})

= X

^{T }(X ß - y)

= X

^{T }X ß - X

^{T }y

Rearranging:

X

^{T }X ß = X

^{T }y

and multiplying both sides with (X

^{T }X)

^{-1}gives:

ß = (X

^{T }X)

^{-1 }X

^{T }y

which is equation 2.6 in

*The Elements of Statistical Learning*.

## No comments:

## Post a Comment