Sunday, May 17, 2026

R-squared

R2 is a good way to test how well your model describes the data. It is literally

1 - (variance you predict / total variance)

It is therefore model agnostic. (For more about the maths of variance, see a previous post).

A value of 1.0 means the model describes the data perfectly, 0.0 means it is the same as guessing the mean and less than zero means it's worse than useless.

So, here's an interesting question: if my prediction for the next value in a series is the last known value, what is R2? In theory, it should be 1 and there is a mathematical argument for this. 

Some general points.

1. The formula for covariance is Σ(Xt-X̅)(Yt-Ȳ)/N
Expand it and you'll see it's the same as E[XY] - E[X]E[Y].

2. We assume that a value is the sum of previous shocks mutliplied by a factor that geometrically reduces with each step. We'll call this φ, the autoregressive coefficient.

So:

y= εt + φεt-1 + φ2εt-2 + ... where |φ|<1

And with a self-substitution, this becomes:

y= εt + φ yt-1

We generally add a constant to this formula but for mathematical simplicity the rest of the argument assumes we've centred the variable on its mean - ie E[Y]=0.

3. The covariance of two unrelated distributions is 0
1. note that the joint probability distribution of x and y if they're independent is f(x,y) = f(x)f(y)
2. integrate the expected values, that is integrate xy.f(x,y) = x f(x) y f(y)
3. When you do that, you'll find E[XY] = E[X]E[Y].
4. Substituing this into the formula for covariance above, Cov(X,Y) = 0 if X and Y are independent.

4. The covariance of two distributions that are the same is the variance.
Substitute X=Y into the formula for covariance and you get Cov(X,Y)=Cov(X,X)=E[X2]-E[X]2=Var(X) as you can see in a previous blog post.

5. Take that formula in step 2, multiply both sides by yt-1 and take the covariance of all the terms. 
Cov(εtyt-1) is zero because of step 3.

Cov((yt-1)2)=Var(Y)= E[Y2]-E[Y]2 = E[Y2] because E[Y]=0 since we centred Y on its mean.
So, rearranging φ=Cov(yt,yt-1)/Var(yt-1)

For a random walk:

y= εt + yt-1

Necessarily φ=1 because the most recent step is the last plus some randomness.

So, the variance in our model is Var(ε). This is because the above formula can equivalently be written as:

y= y0 + ε1 + ε2 + ε3 +...

and choosing y0=0 then:

Var(yt) = Var(ε1) + Var(ε2) + Var(ε3)+...
Var(yt) = N Var(ε)

The total variance changes over time. Centering our walk at zero, y= N εt after N steps. So, here we find the variance is N Var(ε).

Putting these two values into the equation at the top of this post,

R2 = 1 - Var(ε) / N Var(ε) = 1 - (1/N)

Naturally, as N tends to infinity, R2 tends to 1.

So, a model that predicts the next value based on the last is a good one because R2 is 1, right? No! This is a spurious regression. They may be useful for diagnostics (maybe) but not for models.

Linear Regression tip 

I built a decent sales model with:

MAE      :         1439240.3746               6.3954 % of mean
MAPE     :              17.0928 %
R²       :               0.9646

If you plot the residuals against the target values, you should see a random scatter plot around y=0.

This was not the case for me on my first try. There was a pattern. The residuals tended to be below zero for small target values and above zero for large. Basically, the plots were saying I was overestimating low values and underestimating large values. 

Since this was a linear regression, adding some non-linearity really helped. I both squared and loged all my regressors:

MAE      :         1293045.8398               5.1369 % of mean
MAPE     :              10.7136 %
R²       :               0.9717

This was great but the improvement didn't appear to actually come from the non-linearity. Instead, it was this line removing lots of rows:

        df = df.replace([np.inf, -np.inf], np.nan).dropna()

We need this because linear regression is intolerant to anything but a real number. Why removing the lines that had negative values (and consequently infinite loged regressors) remains a mystery. 

But it also caused another problem. As more columns were added, the probability of a row being excluded because it had a nan increased. Therefore, this little line of Python in our data pipeline had the unintended consequence of reducing the number of rows as the number of columns increased!


Friday, May 15, 2026

Polaris and Cloud tokens

Polaris rather pleasingly mints cloud tokens that are subscoped to a directory in a bucket or blob container for AWS and GCP. That is, even if the token has been hijacked, the blast radius is limited by:
  • the token only allowing access to a single directory and its subfolders, not the whole bucket
  • the token is no good after X minutes (where the default valu of X is 60)
There's currently an outstanding ticket to give subscoping to Azure.

The code for vending for the different clouds belongs in the implementations of PolarisStorageIntegration.getSubscopedCreds and this is where the tokens are created. You could put breakpoints in the breakpoints of:

com.google.auth.oauth2.AccessToken
com.azure.core.credential.AccessToken
software.amazon.awssdk.auth.credentials.AwsSessionCredentials 

and grab the credentials and use them on the command line (that is, entirely outside of Polaris) thus:

# AWS
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_SESSION_TOKEN=...  aws s3 ls s3://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

# Azure
az storage blob list  --account-name $STORAGE_ACCOUNT   --container-name $CONTAINER --sas-token $SAS_TOKEN --prefix YOUR_DIRECTORY

# GCP
CLOUDSDK_AUTH_ACCESS_TOKEN=ya29... gcloud storage ls gs://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

But even if you did, the tokens no longer work after 60 minutes and in the case of AWS and GCP, you cannot even view directories for which the token was not defined.

Wednesday, May 6, 2026

The state of Polaris

Huzzah! My PR for Apache Polaris has been accepted and merged with its main branch! Here are some miscellaneous notes I made as I looked at what needed to be done and how to test my code.

Federated catalogs

How Polaris handles vended credentials in federated catalogs is still an ongoing concern [Polaris mailing list]. The issue concerns who has say over what is vended. If the external catalog does not allows user X but the Polaris instance that defers to it does, is user X allowed to use that data or not?

In the ticket "Does Polaris support credential vending for external REST Catalogs?", Polaris maintainer, Alex Dutra, says:
"When the client requests credential vending, Polaris forwards the request to the remote catalog, but mints temporary credentials itself and vends them to the client. IOW, a PolarisStorageConfigurationInfo must have been configured when declaring the external catalog in Polaris, and it's this storage config that will be used for vending credentials."
Integration tests 

My GcpCatalogFederationIntegrationIT lives in runtime/service/src/cloudTest/ and unlike their counterparts in runtime/service/src/, they need to run against an already started Polaris instance (the latter start their own and run out of the box).

Note, you'll have to set:

polaris.features."ENABLE_CATALOG_FEDERATION"=true
polaris.features."ALLOW_OVERLAPPING_CATALOG_URLS"=true

in runtime/defaults/src/main/resources/application.properties and

"-Dpolaris.bootstrap.credentials=POLARIS,test-admin,test-secret",

in runtime/server/build.gradle.kts if you're running your Polaris from the source code because ServerManager hard codes the ClientCredentials.

Note that your Polaris will need to Google credentials for my code - for example:

export GOOGLE_APPLICATION_CREDENTIALS=/home/henryp/gcp.json 

Now you can run Polaris with:

./gradlew --stop && ./gradlew run