Here's a sweet little challenge. Given a linear function, write some code that estimates what it is given only the data it produces.

Let's make the function a simple one, say f(x): Double -> Double, such that we can plot a straight line. A first draft might look like this in Scala:

First, generate a random function of the form α x + β

def main(args: Array[String]): Unit = {

val intercept = Random.nextDouble()

val rate = Random.nextDouble()

val f = generateLinearFn(intercept, rate)

val (estRate, estIntercept) = guessRateAndIntercept(f)

println(s"real: intercept = $intercept, rate = $rate")

println(s"guessed: intercept = $estIntercept, rate = $estRate")

}

def generateLinearFn(intercept: Double, rate: Double): (Double => Double) = { x =>

intercept + (rate * x)

}

Now, generate some values for x and run them through a similar shaped function where we have guessed the values for α and β. Then, we wiggle α and β and compare the results of this ersatz funtion with the results from the real function. If a wiggle makes the results better, keep the adjusted values of α and β.

def guessRateAndIntercept(f: Double => Double): (Double, Double) = {

val nIterations = 100

val stepSize = 1d / nIterations

var rate = Random.nextDouble()

var intercept = Random.nextDouble()

val xs = (1 to 100).map(x => x.toDouble / 100)

for (i <- 0 to nIterations) {

intercept = intercept + bestStep(stepSize, delta => generateLinearFn(intercept + delta, rate), f, xs)

rate = rate + bestStep(stepSize, delta => generateLinearFn(intercept, rate + delta), f, xs)

println(s"intercept = $intercept, rate = $rate")

}

(rate, intercept)

}

def bestStep(stepSize: Double, df: Double => Double => Double, actualFn: Double => Double, xs: Seq[Double]): Double = {

val errorToStep = Seq(stepSize, 0d, -stepSize).map(step => df(step) -> step).map { case (guessFn, step) =>

(errorOver(actualFn, guessFn, xs), step)

}

errorToStep.minBy(_._1)._2

}

Where the errors are measured simply by the sum of squared values between the results from our approximate function and the real thing.

def errorOver(f: Double => Double, guessFn: Double => Double, data: Seq[Double]): Double =

sumOfSquaredErrors(data.map(guessFn), data.map(f))

def sumOfSquaredErrors(y1s: Seq[Double], y2s: Seq[Double]): Double =

y1s.zip(y2s).map{ case(x, y) =>

val diff = x - y

pow(diff, 2)

}.sum

OK, nothing clever here. It's just a simple approximation technique that gives reasonable answers:

real: intercept = 0.9830623503713442, rate = 0.8187143023851281

guessed: intercept = 0.9598594405202521, rate = 0.8497631295034207

So, how do the big boys do it in maths libraries? One cunning way can be found here in the Apache commons math library. It takes a bit of head-scratching to understand real: intercept = 0.9830623503713442, rate = 0.8187143023851281

guessed: intercept = 0.9598594405202521, rate = 0.8497631295034207

But it has some obvious flaws. This only handles functions with one independent variable (ie, it only deals with (Double) => Double. What about (Double, Double) => Double or (Double, Double, Double) => Double etc?)

*why*it works but this is what I found.

First, lets expand on the comments in the JavaDoc. Say, we have a set of real data and a proposed linear function that can approximate that data. Let's say we have n samples of real data and our function has k independent variables.

Aside: a note on terms used here:

"A simple way to describe variables is as eitherRight, so we have n equations that look something like this:dependent, if they represent some outcome of a study, orindependent, if they are presumed to influence the value of the dependent variable(s)... The dependent variable is also referred to as the 'Y variable' and the independent variables as the 'X variables'. Other names for the dependent variable include theoutcome variableand theexplained variable. Other names for the independent variables includeregressors,predictorvariables andexplanatoryvariables." [1]

k | ||

y = | Σ | x_{i} b_{i} + ε |

i = 1 |

where y is the actual output for trial n, x is an input, b is the coefficient we're trying to find and ε is the error between our prediction and the actual value.

This can be expressed in matrix form as:

Y = (X' B') + E

where Y is a vector of all the n outputs, X' is the matrix of all the n inputs, B' is all k coefficients and E is a vector of errors.

By adding an extra row to B' and an extra column to X', we can subsume the E vector into these terms to make the equation look neater and easier to handle, thus:

Y = X B

Now comes the clever bit. A matrix can be represented as the product of two matrices in a process called QR decomposition where Q and R are the two matrices. The interesting thing about this is that R is an upper triangular matrix. Why this is important will become clear later.

__QR Factorization__

This trick simply states that "a matrix whose columns are v

_{1}, v

_{2}, ... v

_{n}can be expressed as the product of two matrices: a matrix [Q] whose columns v*

_{1}, v*

_{2}, ... v*

_{n}are mutually orthogonal and a triangular matrix [R]." [2]

The algorithm goes like this. Take the vectors v

_{1}, v

_{2}, ... v

_{n}one at a time. For each vector, find its components that are orthogonal to those that have come before. This is done by subtracting from it its projection on the space represented by the vectors that have already been processed and adding the result to the set. Full proofs can be found in [2] but the gist is: imagine shaving off each vector its projection on the space defined by the vectors before it.

This is a recursive algorithm that would give you:

v

_{1}, v

_{2}, ... v

_{n}= σ

_{1}v*

_{1}, σ

_{2}( (v*

_{1}) - (v

_{2}. v*

_{1})), ... , σ

_{n}(v*

_{1 }- (v

_{2}. v*

_{1}) - ... (v

_{n}. (v*

_{1 }- (v

_{2}. v*

_{1}) - ... (v

_{n-1}. (v*

_{1}- ...))

or, to express it more neatly in matrix form:

| | v*_{11} |
v*_{12} |
v*_{13} |
... | | | | | σ_{11} |
σ_{12} |
σ_{13} |
... | | |

| | v*_{21} |
v*_{22} |
v*_{23} |
... | | | | | σ_{21} |
σ_{22} |
σ_{23} |
... | | |

| | v*_{31} |
v*_{32} |
v*_{33} |
... | | | | | σ_{31} |
σ_{32} |
σ_{33} |
... | | |

where v*

_{11}, v*

_{21}, v*

_{31}, ... are the components of v*

_{1}and σ

_{11}, σ

_{21}, σ

_{31}, ... are the components of σ

_{1}.

Now, since σ

_{1}is only ever multiplied by v*

_{1}(see the expansion of the recursive algorithm above) then all the elements σ

_{i1}must be 0 for i>1.

Similarly, since σ

_{2}is only ever multiplied by v*

_{1}and v*

_{2}(again, see the above expansion) then σ

_{i1}must be 0 for i>2 and so on.

So the matrix for σ must be an upper triangular matrix and the matrix product above can better be written as:

| | v*_{11} |
v*_{12} |
v*_{13} |
... | | | | | σ_{11} |
σ_{12} |
σ_{13} |
... | | |

| | v*_{21} |
v*_{22} |
v*_{23} |
... | | | | | 0 | σ_{22} |
σ_{23} |
... | | |

| | v*_{31} |
v*_{32} |
v*_{33} |
... | | | | | 0 | 0 | σ_{33} |
... | | |

Note that the v*

_{n }vectors can be chosen to have unit length of 1. This will come in handy later.

__Back to linear regression__

Knowing this, we can now take our equation

Y = X B

and note that X can be expressed as Q R

Y = Q R B

multiply both sides by Q

^{-1}(at the beginning of each side as matrix multiplication is not commutative):

Q

^{-1}Y = Q

^{-1}Q R B

but note that Q

^{-1}Q is multiplying every orthogonal column vector within it by all the column vectors. This means that the value will be 1 (they're unit vectors) when a column is multiplied by itself and 0 otherwise (from the definition of orthogonal vectors). So, the equation becomes:

Q

^{-1}Y = R B = A

because Q

^{-1}Q is the identity matrix (A is just short-hand for Q

^{-1}Y).

Now, we can solve through back-substitution. Since R is an upper rectangular matrix, this equation expands to:

r

_{n,n}b

_{n}= a

_{n,n}

r

_{n-1,n-1}b

_{n-1}+ r

_{n-1,n}b

_{n}= a

_{n-1, n-1 }+ a

_{n-1, n }

_{.}

_{.}

_{.}

Well, we know all the values of the A and R matrices (since they are derived from the outputs and the v*

_{n }vectors) we can solve the first equation for b

_{n}. We can substitute this into the second equation and solve it for b

_{n-1}and so on (this is back-substitution).

[1] Statistics in a Nutshell (O'Reilly)

[2] Coding the Matrix - Philip Klein

## No comments:

## Post a Comment