Saturday, June 6, 2026

Production ready Polaris

Persisting the metadata

First, create a DB. I created a Postgres RDS database in AWS and allowed it to create the VPC, subnets etc. It took me a while to work out why I could connect from my laptop but not a Polaris running in Azure: the source in the security group AWS automatically generated allowed my IP address but not Microsoft's.

You can check what IP address the greater internet sees you as with:

curl -s https://checkip.amazonaws.com

Bootstrap Polaris with:

docker run --rm -it   --env="polaris.persistence.type=relational-jdbc"   --env="quarkus.datasource.username=$DB_USERNAME"   --env="quarkus.datasource.password=$DB_PASSWORD"   --env="quarkus.datasource.jdbc.url=jdbc:postgresql://$DB_HOST:5432/polaris_db"   apache/polaris-admin-tool:latest bootstrap -r POLARIS -c POLARIS,root,s3cr3t

(You can purge the database by using the above but with arguments purge -r POLARIS)

Then you can see in Postgres:

postgres=> \c polaris_db
polaris_db=> SELECT * FROM pg_catalog.pg_tables;
     schemaname     |           tablename           | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity 
--------------------+-------------------------------+------------+------------+------------+----------+-------------+-------------
 polaris_schema     | version                       | postgres   |            | t          | f        | f           | f
 polaris_schema     | entities                      | postgres   |            | t          | f        | f           | f
 polaris_schema     | grant_records                 | postgres   |            | t          | f        | f           | f
 polaris_schema     | principal_authentication_data | postgres   |            | t          | f        | f           | f

Create the database with:

CREATE DATABASE polaris_db;
CREATE USER polaris_user WITH PASSWORD 'your_secure_password';
GRANT ALL PRIVILEGES ON DATABASE polaris_db TO polaris_user;
\c polaris_db
GRANT ALL ON SCHEMA public TO polaris_user;

If you mess up your Polaris, just run:

kubectl rollout restart deployment polaris-deployment

as now the data is all in the database.

Access Control

For integration tests, you can just use the client_id and client_secret with which you set up Polaris. But if you put it in production, you'll want to create users (Principals).

"At the most basic level, Polaris' persistence layer stores Entities and Grants, where Grants define the access-control-related relationship between entities." [Apache Polaris Catalog Federation Proposal]

To access Polaris, you need a Principal. This will have Principal Roles. They need to be associated with the Catalog Roles that in turn belong to a Catalog.

REST via curl

Effective debugging of Polaris can be done by poking its REST API.

First, you need a token

POLARIS_TOKEN=$(curl -X POST "https://$HOST/api/catalog/v1/oauth/tokens"   -H "Content-Type: application/x-www-form-urlencoded"   -d "grant_type=client_credentials&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET&scope=PRINCIPAL_ROLE:ALL" | jq -r '.access_token')

View namespaces

curl -X GET "https://$HOST/api/catalog/v1/$CATALOG_NAME/namespaces/$NAMESPACE" -H "Authorization: Bearer ${POLARIS_TOKEN}" | jq
{
  "namespace": [
    "samples"
  ],
  "properties": {
    "owner": "henryp",
    "location": "s3a://emrys-afon-bucket/samples/"
  }
}

View tables

curl -X GET "https://$HOST/api/catalog/v1/$CATALOG_NAME/namespaces/$NAMESPACE/tables"   -H "Authorization: Bearer ${POLARIS_TOKEN}"   -H "Accept: application/json"   -s | jq .

or given a table:

curl -X GET "https://$HOST/api/catalog/v1/$CATALOG_NAME/namespaces/$NAMESPACE/tables/$TABLE" -H "Authorization: Bearer ${POLARIS_TOKEN}" | jq

Clean up

Remove lingering details in the namespace with something like:

curl -X POST -H "Content-Type: application/json" -H "Authorization: Bearer ${POLARIS_TOKEN}"  "https://$HOST/api/catalog/v1/aws/namespaces/samples/properties" --data '{"removals": ["owner"] }' | jq

Delete Catalog

curl -X DELETE "https://$HOST/api/management/v1/catalogs/$CATALOG_NAME"   -H "Authorization: Bearer ${POLARIS_TOKEN}"   -H "Content-Type: application/json" -o /dev/null -s -w "%{http_code}\n"

Sunday, May 17, 2026

R-squared

R2 is a good way to test how well your model describes the data. It is literally

1 - (variance you predict / total variance)

It is therefore model agnostic. (For more about the maths of variance, see a previous post).

A value of 1.0 means the model describes the data perfectly, 0.0 means it is the same as guessing the mean and less than zero means it's worse than useless.

So, here's an interesting question: if my prediction for the next value in a series is the last known value, what is R2? In theory, it should be 1 and there is a mathematical argument for this. 

Some general points.

1. The formula for covariance is Σ(Xt-X̅)(Yt-Ȳ)/N
Expand it and you'll see it's the same as E[XY] - E[X]E[Y].

2. We assume that a value is the sum of previous shocks mutliplied by a factor that geometrically reduces with each step. We'll call this φ, the autoregressive coefficient.

So:

y= εt + φεt-1 + φ2εt-2 + ... where |φ|<1

And with a self-substitution, this becomes:

y= εt + φ yt-1

We generally add a constant to this formula but for mathematical simplicity the rest of the argument assumes we've centred the variable on its mean - ie E[Y]=0.

3. The covariance of two unrelated distributions is 0
1. note that the joint probability distribution of x and y if they're independent is f(x,y) = f(x)f(y)
2. integrate the expected values, that is integrate xy.f(x,y) = x f(x) y f(y)
3. When you do that, you'll find E[XY] = E[X]E[Y].
4. Substituing this into the formula for covariance above, Cov(X,Y) = 0 if X and Y are independent.

4. The covariance of two distributions that are the same is the variance.
Substitute X=Y into the formula for covariance and you get Cov(X,Y)=Cov(X,X)=E[X2]-E[X]2=Var(X) as you can see in a previous blog post.

5. Take that formula in step 2, multiply both sides by yt-1 and take the covariance of all the terms. 
Cov(εtyt-1) is zero because of step 3.

Cov((yt-1)2)=Var(Y)= E[Y2]-E[Y]2 = E[Y2] because E[Y]=0 since we centred Y on its mean.
So, rearranging φ=Cov(yt,yt-1)/Var(yt-1)

For a random walk:

y= εt + yt-1

Necessarily φ=1 because the most recent step is the last plus some randomness.

So, the variance in our model is Var(ε). This is because the above formula can equivalently be written as:

y= y0 + ε1 + ε2 + ε3 +...

and choosing y0=0 then:

Var(yt) = Var(ε1) + Var(ε2) + Var(ε3)+...
Var(yt) = N Var(ε)

The total variance changes over time. Centering our walk at zero, y= N εt after N steps. So, here we find the variance is N Var(ε).

Putting these two values into the equation at the top of this post,

R2 = 1 - Var(ε) / N Var(ε) = 1 - (1/N)

Naturally, as N tends to infinity, R2 tends to 1.

So, a model that predicts the next value based on the last is a good one because R2 is 1, right? No! This is a spurious regression. They may be useful for diagnostics (maybe) but not for models.

Linear Regression tip 

I built a decent sales model with:

MAE      :         1439240.3746               6.3954 % of mean
MAPE     :              17.0928 %
R²       :               0.9646

If you plot the residuals against the target values, you should see a random scatter plot around y=0.

This was not the case for me on my first try. There was a pattern. The residuals tended to be below zero for small target values and above zero for large. Basically, the plots were saying I was overestimating low values and underestimating large values. 

Since this was a linear regression, adding some non-linearity really helped. I both squared and loged all my regressors:

MAE      :         1293045.8398               5.1369 % of mean
MAPE     :              10.7136 %
R²       :               0.9717

This was great but the improvement didn't appear to actually come from the non-linearity. Instead, it was this line removing lots of rows:

        df = df.replace([np.inf, -np.inf], np.nan).dropna()

We need this because linear regression is intolerant to anything but a real number. Why removing the lines that had negative values (and consequently infinite loged regressors) remains a mystery. 

But it also caused another problem. As more columns were added, the probability of a row being excluded because it had a nan increased. Therefore, this little line of Python in our data pipeline had the unintended consequence of reducing the number of rows as the number of columns increased!


Friday, May 15, 2026

Polaris and Cloud tokens

Polaris rather pleasingly mints cloud tokens that are subscoped to a directory in a bucket or blob container for AWS and GCP. That is, even if the token has been hijacked, the blast radius is limited by:
  • the token only allowing access to a single directory and its subfolders, not the whole bucket
  • the token is no good after X minutes (where the default valu of X is 60)
There's currently an outstanding ticket to give subscoping to Azure.

The code for vending for the different clouds belongs in the implementations of PolarisStorageIntegration.getSubscopedCreds and this is where the tokens are created. You could put breakpoints in the breakpoints of:

com.google.auth.oauth2.AccessToken
com.azure.core.credential.AccessToken
software.amazon.awssdk.auth.credentials.AwsSessionCredentials 

and grab the credentials and use them on the command line (that is, entirely outside of Polaris) thus:

# AWS
AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... AWS_SESSION_TOKEN=...  aws s3 ls s3://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

# Azure
az storage blob list  --account-name $STORAGE_ACCOUNT   --container-name $CONTAINER --sas-token $SAS_TOKEN --prefix YOUR_DIRECTORY

# GCP
CLOUDSDK_AUTH_ACCESS_TOKEN=ya29... gcloud storage ls gs://YOUR_BUCKET/DIRECTORY_FOR_TOKEN

But even if you did, the tokens no longer work after 60 minutes and in the case of AWS and GCP, you cannot even view directories for which the token was not defined.

Wednesday, May 6, 2026

The state of Polaris

Huzzah! My PR for Apache Polaris has been accepted and merged with its main branch! Here are some miscellaneous notes I made as I looked at what needed to be done and how to test my code.

Federated catalogs

How Polaris handles vended credentials in federated catalogs is still an ongoing concern [Polaris mailing list]. The issue concerns who has say over what is vended. If the external catalog does not allows user X but the Polaris instance that defers to it does, is user X allowed to use that data or not?

In the ticket "Does Polaris support credential vending for external REST Catalogs?", Polaris maintainer, Alex Dutra, says:
"When the client requests credential vending, Polaris forwards the request to the remote catalog, but mints temporary credentials itself and vends them to the client. IOW, a PolarisStorageConfigurationInfo must have been configured when declaring the external catalog in Polaris, and it's this storage config that will be used for vending credentials."
Integration tests 

My GcpCatalogFederationIntegrationIT lives in runtime/service/src/cloudTest/ and unlike their counterparts in runtime/service/src/, they need to run against an already started Polaris instance (the latter start their own and run out of the box).

Note, you'll have to set:

polaris.features."ENABLE_CATALOG_FEDERATION"=true
polaris.features."ALLOW_OVERLAPPING_CATALOG_URLS"=true

in runtime/defaults/src/main/resources/application.properties and

"-Dpolaris.bootstrap.credentials=POLARIS,test-admin,test-secret",

in runtime/server/build.gradle.kts if you're running your Polaris from the source code because ServerManager hard codes the ClientCredentials.

Note that your Polaris will need to Google credentials for my code - for example:

export GOOGLE_APPLICATION_CREDENTIALS=/home/henryp/gcp.json 

Now you can run Polaris with:

./gradlew --stop && ./gradlew run

Monday, April 20, 2026

Cloud Topologies and Azure

All clouds follow a similar pattern when it comes to networking. All of them have:
  • a VPN (virtual private network) that is an entirely isolated environment
  • subnets that are divided into public and private IP spaces by using non-overlapping CIDR blocks
  • an internet gateway that allows access to the outside world and performs the NAT
  • Route tables that permit traffic flows by typically bringing together subnets and security groups. If subnets are rooms in a building, route tables are the corridors between them and the Network ACLs (NACLs) are the bouncers on the door. Security groups are more identity based and they do a similar job to NACLs albeit at the vNIC level.
  • a load balancer that allows incoming traffic from the internet. This differs from the load balancer as it's more the receptionist directing people rather than the security guard on the front door.
  • the computers/VMs on which Kubernetes runs, called nodes.
The terminology may slightly differ but this is true for all the main cloud providers

Azure

Now, bearing in mind that this is the general concept for all cloud/K8s clusters, the recipe in Azure is:
  1. Create the Resource Group
  2. Create an identity for this group.
  3. Create a network for this group and identity.
Now, when it comes to the network, the steps are:
  1. Create the Virtual Network compatible with the resource group
  2. Create the subnet for the Virtual Network
  3. Create a Network Security Group for the resource group. This defines the inbound and outbound rules. Funnily enough, the rules in the NSG are of higher priority the lower the number.
  4. We associate the subnet with the Network Security Group
  5. We assign a role to the subnet.
Now we can set up the Kubernetes cluster if we:
  1. Assign the subnet to a K8s Node Pool
  2. Create a network profile
This last one is interesting. It's the "the reference to the network interface card" [Azure for Architects, Packt]. The network car is the actual physical hardware the packets must use but generally they're only referenced in Azure (and GCP) as virtual network interface cards (vNIC). In AWS, they're called ENIs.

Tuesday, April 7, 2026

Network security for numpties

Some concepts and terminology:

STUN (Session Traversal Utility for NAT)
STUN servers can be queried via the STUN protocol to give the IP address you are known as on the internet (that is, after NAT).

Hole Punching
Outbound packets implicitly create an open port to allow the expected reply. This is exploited to allow a long running conversation to take place between peers.
Some NAT blocks this technique.

DERP (Designated Encrypted Relay for Packets)
A server through which traffic can pass if hole punching is not available.
This is secure since the packets are encypted by the two peers, so the server just routes packets without being part of the conversation.

WireGuard
Uses UDP so fast.
Built into the Linux kernel using a very small number of lines of code.
Peers are not identified by their IP addresses but their public keys.

TailScale
Part open source, part paid service that allows you to have a Virtual Private Network that can span multiple physical locations.
Each device talks to each other using all the technologies mentioned above.
The server part is proprietary but there are open source alternatives (Headscale, NetBird, Nebula from Slack).
Tailscale differs from the commercial Prisma Access in that it's architecture is peer-to-peer whereas yout traffic in Prisma passes through their edge Service Edge where packets can be inspected for security reasons.

Remote Desktops
There are a few (Selkies, Apache Guacamole, Kasm, n.eko etc) but they all follow one of two paradigms: old fashioned RDP; and WebRTC. The latter doesn't expose the VM directly (by using the technologies above), and uses encryption natively.

Kubernetes Cilium
Cilium is considered more secure than Calico as it uses WireGuard (see above) and eBPF - where Linux filters packets at the kernel level and also reduces copying data into user space so it's more efficient. (Apparently, Calico can now be configured to use WireGuard).

Note that if you want to use Cilium in AWS and you're using EksCluster to create your Kubernetes cluster, you first need to kubectl delete ds both aws-node and kube-system.

OAuth logins
If you want to hide your Kubernetes service behind an OAuth page, you can use oauth2-proxy which starts a pod in your cluster that links to the OAuth provider defined by --oidc-issuer-url. In my case this is https://accounts.google.com and I've configured my Google account to redirect to my URL and have it generate the credentials under OAuth 2.0 Client IDs at the Google GUI.

Friday, March 20, 2026

Multi-cloud Devops tips

I'm using multiple clouds on a regular basis and constantly need to jump between them. So, here are some commands I use often:

Kubernetes

See all the clusters you have access to with:

kubectl config get-contexts

Your current one is highlighted with an asterisk. If you just want to see your current context, run:

kubectl config current-context

Change to another with:

kubectl config use-context NAME_OF_CONTEXT

With AWS, you might need to refresh your K8s config with:

aws eks update-kubeconfig --region $REGION --name $CLUSTER_NAME

after switching.

AWS

See who you currently are in AWS with:

aws sts get-caller-identity 

and just to check you have access to S3:

aws s3 ls s3://BUCKET_NAME/DIRECTORY/

Azure

See who you are with:

az ad signed-in-user show --query id -o tsv

or

az account show

And check access to BLOB storage with:

az storage blob list --account-name ACCOUNT_NAME --container-name BUCKET --output table --prefix acceptancetests/ --delimiter /

GCP

See who you are with:

gcloud auth list

and check storage access with:

gcloud storage ls gs://BUCKET_NAME/DIRECTORY

I've put all these commands here because I keep forgetting them.