1 - (variance you predict / total variance)
It is therefore model agnostic. (For more about the maths of variance, see a previous post).
A value of 1.0 means the model describes the data perfectly, 0.0 means it is the same as guessing the mean and less than zero means it's worse than useless.
So, here's an interesting question: if my prediction for the next value in a series is the last known value, what is R2? In theory, it should be 1 and there is a mathematical argument for this.
Some general points.
1. The formula for covariance is Σ(Xt-X̅)(Yt-Ȳ)/N .
Expand it and you'll see it's the same as E[XY] - E[X]E[Y].
2. We assume that a value is the sum of previous shocks mutliplied by a factor that geometrically reduces with each step. We'll call this φ, the autoregressive coefficient.
So:
yt = εt + φεt-1 + φ2εt-2 + ... where |φ|<1
And with a self-substitution, this becomes:
yt = εt + φ yt-1
We generally add a constant to this formula but for mathematical simplicity the rest of the argument assumes we've centred the variable on its mean - ie E[Y]=0.
3. The covariance of two unrelated distributions is 0
1. note that the joint probability distribution of x and y if they're independent is f(x,y) = f(x)f(y)2. integrate the expected values, that is integrate xy.f(x,y) = x f(x) y f(y)3. When you do that, you'll find E[XY] = E[X]E[Y].4. Substituing this into the formula for covariance above, Cov(X,Y) = 0 if X and Y are independent.
4. The covariance of two distributions that are the same is the variance.
Substitute X=Y into the formula for covariance and you get Cov(X,Y)=Cov(X,X)=E[X2]-E[X]2=Var(X) as you can see in a previous blog post.
5. Take that formula in step 2, multiply both sides by yt-1 and take the covariance of all the terms.
Cov(εtyt-1) is zero because of step 3.
Cov((yt-1)2)=Var(Y)= E[Y2]-E[Y]2 = E[Y2] because E[Y]=0 since we centred Y on its mean.
So, rearranging φ=Cov(yt,yt-1)/Var(yt-1)
For a random walk:
yt = εt + yt-1
Necessarily φ=1 because the most recent step is the last plus some randomness.
So, the variance in our model is Var(ε). This is because the above formula can equivalently be written as:
yt = y0 + ε1 + ε2 + ε3 +...
and choosing y0=0 then:
Var(yt) = Var(ε1) + Var(ε2) + Var(ε3)+...
Var(yt) = N Var(ε)
The total variance changes over time. Centering our walk at zero, yt = N εt after N steps. So, here we find the variance is N Var(ε).
Putting these two values into the equation at the top of this post,
R2 = 1 - Var(ε) / N Var(ε) = 1 - (1/N)
Naturally, as N tends to infinity, R2 tends to 1.
So, a model that predicts the next value based on the last is a good one because R2 is 1, right? No! This is a spurious regression. They may be useful for diagnostics (maybe) but not for models.
Linear Regression tip
I built a decent sales model with:
MAE : 1439240.3746 6.3954 % of mean
MAPE : 17.0928 %
R² : 0.9646
If you plot the residuals against the target values, you should see a random scatter plot around y=0.
This was not the case for me on my first try. There was a pattern. The residuals tended to be below zero for small target values and above zero for large. Basically, the plots were saying I was overestimating low values and underestimating large values.
Since this was a linear regression, adding some non-linearity really helped. I both squared and loged all my regressors:
MAE : 1293045.8398 5.1369 % of mean
MAPE : 10.7136 %
R² : 0.9717
This was great but the improvement didn't appear to actually come from the non-linearity. Instead, it was this line removing lots of rows:
df = df.replace([np.inf, -np.inf], np.nan).dropna()
We need this because linear regression is intolerant to anything but a real number. Why removing the lines that had negative values (and consequently infinite loged regressors) remains a mystery.
But it also caused another problem. As more columns were added, the probability of a row being excluded because it had a nan increased. Therefore, this little line of Python in our data pipeline had the unintended consequence of reducing the number of rows as the number of columns increased!