Sunday, May 17, 2026

R-squared

R2 is a good way to test how well your model describes the data. It is literally

1 - (variance you predict / total variance)

It is therefore model agnostic. (For more about the maths of variance, see a previous post).

A value of 1.0 means the model describes the data perfectly, 0.0 means it is the same as guessing the mean and less than zero means it's worse than useless.

So, here's an interesting question: if my prediction for the next value in a series is the last known value, what is R2? In theory, it should be 1 and there is a mathematical argument for this. 

Some general points.

1. The formula for covariance is Σ(Xt-X̅)(Yt-Ȳ)/N
Expand it and you'll see it's the same as E[XY] - E[X]E[Y].

2. We assume that a value is the sum of previous shocks mutliplied by a factor that geometrically reduces with each step. We'll call this φ, the autoregressive coefficient.

So:

y= εt + φεt-1 + φ2εt-2 + ... where |φ|<1

And with a self-substitution, this becomes:

y= εt + φ yt-1

We generally add a constant to this formula but for mathematical simplicity the rest of the argument assumes we've centred the variable on its mean - ie E[Y]=0.

3. The covariance of two unrelated distributions is 0
1. note that the joint probability distribution of x and y if they're independent is f(x,y) = f(x)f(y)
2. integrate the expected values, that is integrate xy.f(x,y) = x f(x) y f(y)
3. When you do that, you'll find E[XY] = E[X]E[Y].
4. Substituing this into the formula for covariance above, Cov(X,Y) = 0 if X and Y are independent.

4. The covariance of two distributions that are the same is the variance.
Substitute X=Y into the formula for covariance and you get Cov(X,Y)=Cov(X,X)=E[X2]-E[X]2=Var(X) as you can see in a previous blog post.

5. Take that formula in step 2, multiply both sides by yt-1 and take the covariance of all the terms. 
Cov(εtyt-1) is zero because of step 3.

Cov((yt-1)2)=Var(Y)= E[Y2]-E[Y]2 = E[Y2] because E[Y]=0 since we centred Y on its mean.
So, rearranging φ=Cov(yt,yt-1)/Var(yt-1)

For a random walk:

y= εt + yt-1

Necessarily φ=1 because the most recent step is the last plus some randomness.

So, the variance in our model is Var(ε). This is because the above formula can equivalently be written as:

y= y0 + ε1 + ε2 + ε3 +...

and choosing y0=0 then:

Var(yt) = Var(ε1) + Var(ε2) + Var(ε3)+...
Var(yt) = N Var(ε)

The total variance changes over time. Centering our walk at zero, y= N εt after N steps. So, here we find the variance is N Var(ε).

Putting these two values into the equation at the top of this post,

R2 = 1 - Var(ε) / N Var(ε) = 1 - (1/N)

Naturally, as N tends to infinity, R2 tends to 1.

So, a model that predicts the next value based on the last is a good one because R2 is 1, right? No! This is a spurious regression. They may be useful for diagnostics (maybe) but not for models.

Linear Regression tip 

I built a decent sales model with:

MAE      :         1439240.3746               6.3954 % of mean
MAPE     :              17.0928 %
R²       :               0.9646

If you plot the residuals against the target values, you should see a random scatter plot around y=0.

This was not the case for me on my first try. There was a pattern. The residuals tended to be below zero for small target values and above zero for large. Basically, the plots were saying I was overestimating low values and underestimating large values. 

Since this was a linear regression, adding some non-linearity really helped. I both squared and loged all my regressors:

MAE      :         1293045.8398               5.1369 % of mean
MAPE     :              10.7136 %
R²       :               0.9717

This was great but the improvement didn't appear to actually come from the non-linearity. Instead, it was this line removing lots of rows:

        df = df.replace([np.inf, -np.inf], np.nan).dropna()

We need this because linear regression is intolerant to anything but a real number. Why removing the lines that had negative values (and consequently infinite loged regressors) remains a mystery. 

But it also caused another problem. As more columns were added, the probability of a row being excluded because it had a nan increased. Therefore, this little line of Python in our data pipeline had the unintended consequence of reducing the number of rows as the number of columns increased!


Friday, May 15, 2026

Polaris and Cloud tokens

Polaris rather pleasingly mints cloud tokens that are subscoped to a directory in a bucket or blob container for AWS and GCP. That is, even if the token has been hijacked, the blast radius is limited by:
  • the token only allowing access to a single directory and its subfolders, not the whole bucket
  • the token is no good after X minutes (where the default valu of X is 60)
There's currently an outstanding ticket to give subscoping to Azure.

The code for vending for the different clouds belongs in the implementations of PolarisStorageIntegration.getSubscopedCreds and this is where the tokens are created. You could put breakpoints in the breakpoints of:

com.google.auth.oauth2.AccessToken
com.azure.core.credential.AccessToken
software.amazon.awssdk.auth.credentials.AwsSessionCredentials 

and grab the credentials and use them on the command line (that is, entirely outside of Polaris) thus:

# AWS
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_SESSION_TOKEN=...  aws s3 ls s3://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

# Azure
az storage blob list  --account-name $STORAGE_ACCOUNT   --container-name $CONTAINER --sas-token $SAS_TOKEN --prefix YOUR_DIRECTORY

# GCP
CLOUDSDK_AUTH_ACCESS_TOKEN=ya29... gcloud storage ls gs://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

But even if you did, the tokens no longer work after 60 minutes and in the case of AWS and GCP, you cannot even view directories for which the token was not defined.

Wednesday, May 6, 2026

The state of Polaris

Huzzah! My PR for Apache Polaris has been accepted and merged with its main branch! Here are some miscellaneous notes I made as I looked at what needed to be done and how to test my code.

Federated catalogs

How Polaris handles vended credentials in federated catalogs is still an ongoing concern [Polaris mailing list]. The issue concerns who has say over what is vended. If the external catalog does not allows user X but the Polaris instance that defers to it does, is user X allowed to use that data or not?

In the ticket "Does Polaris support credential vending for external REST Catalogs?", Polaris maintainer, Alex Dutra, says:
"When the client requests credential vending, Polaris forwards the request to the remote catalog, but mints temporary credentials itself and vends them to the client. IOW, a PolarisStorageConfigurationInfo must have been configured when declaring the external catalog in Polaris, and it's this storage config that will be used for vending credentials."
Integration tests 

My GcpCatalogFederationIntegrationIT lives in runtime/service/src/cloudTest/ and unlike their counterparts in runtime/service/src/, they need to run against an already started Polaris instance (the latter start their own and run out of the box).

Note, you'll have to set:

polaris.features."ENABLE_CATALOG_FEDERATION"=true
polaris.features."ALLOW_OVERLAPPING_CATALOG_URLS"=true

in runtime/defaults/src/main/resources/application.properties and

"-Dpolaris.bootstrap.credentials=POLARIS,test-admin,test-secret",

in runtime/server/build.gradle.kts if you're running your Polaris from the source code because ServerManager hard codes the ClientCredentials.

Note that your Polaris will need to Google credentials for my code - for example:

export GOOGLE_APPLICATION_CREDENTIALS=/home/henryp/gcp.json 

Now you can run Polaris with:

./gradlew --stop && ./gradlew run

Monday, April 20, 2026

Cloud Topologies and Azure

All clouds follow a similar pattern when it comes to networking. All of them have:
  • a VPN (virtual private network) that is an entirely isolated environment
  • subnets that are divided into public and private IP spaces by using non-overlapping CIDR blocks
  • an internet gateway that allows access to the outside world and performs the NAT
  • Route tables that permit traffic flows by typically bringing together subnets and security groups. If subnets are rooms in a building, route tables are the corridors between them and the Network ACLs (NACLs) are the bouncers on the door. Security groups are more identity based and they do a similar job to NACLs albeit at the vNIC level.
  • a load balancer that allows incoming traffic from the internet. This differs from the load balancer as it's more the receptionist directing people rather than the security guard on the front door.
  • the computers/VMs on which Kubernetes runs, called nodes.
The terminology may slightly differ but this is true for all the main cloud providers

Azure

Now, bearing in mind that this is the general concept for all cloud/K8s clusters, the recipe in Azure is:
  1. Create the Resource Group
  2. Create an identity for this group.
  3. Create a network for this group and identity.
Now, when it comes to the network, the steps are:
  1. Create the Virtual Network compatible with the resource group
  2. Create the subnet for the Virtual Network
  3. Create a Network Security Group for the resource group. This defines the inbound and outbound rules. Funnily enough, the rules in the NSG are of higher priority the lower the number.
  4. We associate the subnet with the Network Security Group
  5. We assign a role to the subnet.
Now we can set up the Kubernetes cluster if we:
  1. Assign the subnet to a K8s Node Pool
  2. Create a network profile
This last one is interesting. It's the "the reference to the network interface card" [Azure for Architects, Packt]. The network car is the actual physical hardware the packets must use but generally they're only referenced in Azure (and GCP) as virtual network interface cards (vNIC). In AWS, they're called ENIs.

Tuesday, April 7, 2026

Network security for numpties

Some concepts and terminology:

STUN (Session Traversal Utility for NAT)
STUN servers can be queried via the STUN protocol to give the IP address you are known as on the internet (that is, after NAT).

Hole Punching
Outbound packets implicitly create an open port to allow the expected reply. This is exploited to allow a long running conversation to take place between peers.
Some NAT blocks this technique.

DERP (Designated Encrypted Relay for Packets)
A server through which traffic can pass if hole punching is not available.
This is secure since the packets are encypted by the two peers, so the server just routes packets without being part of the conversation.

WireGuard
Uses UDP so fast.
Built into the Linux kernel using a very small number of lines of code.
Peers are not identified by their IP addresses but their public keys.

TailScale
Part open source, part paid service that allows you to have a Virtual Private Network that can span multiple physical locations.
Each device talks to each other using all the technologies mentioned above.
The server part is proprietary but there are open source alternatives (Headscale, NetBird, Nebula from Slack).
Tailscale differs from the commercial Prisma Access in that it's architecture is peer-to-peer whereas yout traffic in Prisma passes through their edge Service Edge where packets can be inspected for security reasons.

Remote Desktops
There are a few (Selkies, Apache Guacamole, Kasm, n.eko etc) but they all follow one of two paradigms: old fashioned RDP; and WebRTC. The latter doesn't expose the VM directly (by using the technologies above), and uses encryption natively.

Kubernetes Cilium
Cilium is considered more secure than Calico as it uses WireGuard (see above) and eBPF - where Linux filters packets at the kernel level and also reduces copying data into user space so it's more efficient. (Apparently, Calico can now be configured to use WireGuard).

Note that if you want to use Cilium in AWS and you're using EksCluster to create your Kubernetes cluster, you first need to kubectl delete ds both aws-node and kube-system.

OAuth logins
If you want to hide your Kubernetes service behind an OAuth page, you can use oauth2-proxy which starts a pod in your cluster that links to the OAuth provider defined by --oidc-issuer-url. In my case this is https://accounts.google.com and I've configured my Google account to redirect to my URL and have it generate the credentials under OAuth 2.0 Client IDs at the Google GUI.

Friday, March 20, 2026

Multi-cloud Devops tips

I'm using multiple clouds on a regular basis and constantly need to jump between them. So, here are some commands I use often:

Kubernetes

See all the clusters you have access to with:

kubectl config get-contexts

Your current one is highlighted with an asterisk. If you just want to see your current context, run:

kubectl config current-context

Change to another with:

kubectl config use-context NAME_OF_CONTEXT

With AWS, you might need to refresh your K8s config with:

aws eks update-kubeconfig --region $REGION --name $CLUSTER_NAME

after switching.

AWS

See who you currently are in AWS with:

aws sts get-caller-identity 

and just to check you have access to S3:

aws s3 ls s3://BUCKET_NAME/DIRECTORY/

Azure

See who you are with:

az ad signed-in-user show --query id -o tsv

or

az account show

And check access to BLOB storage with:

az storage blob list --account-name ACCOUNT_NAME --container-name BUCKET --output table --prefix acceptancetests/ --delimiter /

GCP

See who you are with:

gcloud auth list

and check storage access with:

gcloud storage ls gs://BUCKET_NAME/DIRECTORY

I've put all these commands here because I keep forgetting them.

Saturday, March 14, 2026

What happens in an LLM? (part 1)

A nice overview that's detailed but not too intricate is here [blog of SteelPh0enix AKA Wojciech Olech]

Note that when using a fully trained LLM, things are conceptually much simpler because it is more or less just a feedforward network. That is, the weights are immutable. State lives outside of the ANN and is updated by the output after each token runs through the feedforward network.

Attention!

The attention mechanism of an RNN takes the encodings and for each token, augments the input vector both forwards and backwards. "The rationale behind this is to capture additional information since current inputs may have a dependence on sequence elements that came either before or after it in a sentence, or both." [1] The two vectors are then concatenated to make one long vector. "We can consider this concatenated hidden state as the annotation of the source word since it contains the information of the jth word in both directions." [1]

The self-attention mechanism for each word has three vectors: query, key and and value. It compares the query of each word to the key of the others. This process is done in parallel ("multi-head attention") using different Q/K/V weights and the results combined.

The transformer architecture has superceded RNNs.

The cat sat on the mat

Conceptually, q, represents the query (eg, the word "sat" is looking for something to in the sentence to do the sitting); k is the key where the word "cat" is saying I am a noun that can sit; and v links the verb looking for a noun and the noun looking for a verb.

This is similar to when we apply singular value decomposition (SVD) to a document/term space and create a concept space.

Basic self attention

Imagine T input vectors x(i)
Tokens are embedded in a space of size d.
The T output vectors of self-attention are vectors z(i).
These vectors are calculated thus:

z(i) = Σj=1T αij x(i)

where α is a matrix of the dot products of all the x vectors with softmax applied to it (remember, softmax does not change the relative sizes of the logits but does convert them into probabilities).

Multi-headed attention is the same algorithm calculated for multiple heads that is, mutliple swimlanes of datat that represent nuances in language structure. Note that these nuances (grammar, sentiment etc) are not deliberately targeted. They are an emergent property like concept space in SVD/NLP.

Self attention with learnable parameters

Here we project each x vector onto UkUv and Uq. Note that Q=xUq etc and Uq etc are fixed for a feed forward network. That is, somebody has done the hard work of calculating them during training.

The projections onto q and v are then multiplied together and the result is put into a matrix ωij where i is a token and j any other token.

The matrix ω is divided by (typically) √d and softmaxed. 

That is:

Attention(Q, V, K) = softmax ( Q KT / √dk ) V

Tiling

Tiling is a technique when performing matrix operations on data that won't fit into memory.

"With naïve algorithm, to compute each element of the result, we gonna need to fetch S elements from both matrices. The output matrix has S2 elements, therefore the total count of loaded elements is 2S3." [Penny Xu's blog]

This breaks down as 2 [vectors - one from each matrix] * S [the size of those vectors] * S[the number of elements that are the result of all these dot products - that is, the size of the matrix].

"With 32×32 tiling, to compute each 32×32 tile of the result, we gonna need to fetch S/32 tiles from both matrices. The output size in tiles is (S/32)2, the total count of loaded [elements] is 2*(S/32)3. Each 32×32 tile contains 322 elements, the total count of loaded elements is therefore (322)*2*(S/32)3 = (2/32)*S3. Therefore, the tiling reduced global memory bandwidth by the factor of 32, which is a huge performance win." [ibid]

In other words, having broken the matrix down into a grid of size 32×32, and each block in the grid involving 2*(S/32)3 operations, the total number of operations is this first number times the second - that is, (2/32)*S3.

Note that the resulting matrix is tiled also. So, if we doing the matrix multiplication of C=AB, the total memory needed is one tile of each of AB and C.

There's a further optimization. If we're looking for the maximum value (which is very common in neural nets where we typically employ the softmax function), we only need to store one value per tile per column/row.

[1] Machine Learning with PyTorch and SciKitLearn.