Thursday, February 19, 2026

An unruly Terraform

If the Terraform state is out of synch with reality, you might need to change that state manually with something like:

tofu state list

followed by

tofu state rm XXX

I had to delete load balancers manually through the AWS Web Console and then also the EKS instance. I then had to manually delete any references to them from my JSON.

Tip: regularly delete the directory in which the Terraform lives as state gets kept there that the next run implicitly relies upon. The consequence if you don't is that after a major refactor, you run the configuration and everything looks fine. You check in thinking you've done a good job but there was an invisible dependency on the previous running and checking out to a fresh directory fails. So:

Delete all files regularly

I was getting lots of:

│ Error: Get "https://21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com/api/v1/namespaces/default/services/jupyter-lb": dial tcp: lookup 21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com on 127.0.0.1:53: no such host
 

even after blatting my Terrafrom cdktf.out/stacks directory. Turns out state files were accumulating in the root directory of my project (which contained cdktf.out). Once they too were blatted, things looked better.

Changing the cdktf.out.json file resulted in:

│ Error: Inconsistent dependency lock file
│ 
│ The following dependency selections recorded in the lock file are inconsistent with the current configuration:
│   - provider registry.opentofu.org/hashicorp/helm: required by this configuration but no version is selected
│ 
│ To update the locked dependency selections to match a changed configuration, run:
│   tofu init -upgrade

The solution was to run tofu init -upgrade

GCP

You might see this error when running Terraform on GCP:

│ Error: Error setting access_token
│ 
│   with data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C,
│   on cdk.tf.json line 25, in data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C:
│   25:       }
│ 
│ oauth2: "invalid_grant" "reauth related error (invalid_rapt)" "https://support.google.com/a/answer/9368756"

It's nothing really to do with TF but rather your GCP credentials. Login with gcloud auth application-default login and try again. D'oh.

AWS

aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values=$VPC --region $REGION

aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-security-groups --filters Name=vpc-id,Values=$VPC --region $REGION

This last one showed 3 security groups.

The reason that these AWS entities lingered is because my tofu destroy was always hanging. And the reason it never finished is that there were finalizers that prevented it. To avoid this, I needed to run:

kubectl patch installation default -p '{"metadata":{"finalizers":[]}}' --type=merge
kubectl patch service YOUR_LOAD_BALANCER -p '{"metadata":{"finalizers":null}}'  --type=merge

Also, CRDs need to be destroyed:

for CRD in $(kubectl get crds | awk '{print $1}') ; do {
    kubectl patch crd $CRD --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
    kubectl delete crd $CRD --force
} done

I would then run these scripts as a local-exec provisioned in a resource.

I asked on the DevOps Discord server how normal this was:
PhillHenry
I'm using Terraform to manage my AWS stack that (amongst other things) creates a load balancer using an aws-load-balancer-controller. I'm finding destroying the stack just hangs then times out after 20 minutes.

I've had to introduce bash scripts that patch finalizers in services and installations plus force delete CRDs. Finally, tofu detroy cleans everything up but I can't help feeling I'm doing it all wrong by having to add hacks.

Is this normal? If not, can somebody point me in the right direction over what I'm doing wrong? 
snuufix
It is normal with buggy providers, it's just sad that even AWS is one.
Redeploying

Redeploying a component was  simply a matter of running:

tofu apply -replace=kubernetes_manifest.sparkConnectManifest  --auto-approve

This is a great way to redeploy just my Spark Connect pod when I've changed the config:


Helm

If you want to find out what version of a Helm chart you're using when you forget to set it, this might help. It's where Helm caches the charts it downloads.

$ ls -ltr ~/.cache/helm/repository/
...
-rw-r--r-- 1 henryp henryp 107929 Nov 12 10:41 spark-kubernetes-operator-1.3.0.tgz
-rw-r--r-- 1 henryp henryp 317214 Nov 20 15:56 eks-index.yaml
-rw-r--r-- 1 henryp henryp    433 Nov 20 15:56 eks-charts.txt
-rw-r--r-- 1 henryp henryp  36607 Nov 24 09:15 aws-load-balancer-controller-1.15.0.tgz
-rw-r--r-- 1 henryp henryp 493108 Dec 11 14:47 kube-prometheus-stack-51.8.0.tgz
-rw-r--r-- 1 henryp henryp  38337 Dec 15 12:20 aws-load-balancer-controller-1.16.0.tgz
...

More Kubernetes notes

Networking

This was an interesting online conversation about how network packets set to the cluster's IP address are redirected to a pod in Kubernetes cluster.

Each pod has its own IP which is managed by the Container Networking Interface (CNI). Every node runs a kube-proxy which manages how a cluster IPs map to pod IPs. Pod IPs are updated dynamically and only include pods passing their health check.

The node receiving the TCP request does forward to the destination pod, but the mechanisms sort of depend on the CNI. In cloud environments like AWS and GCP, the CNI just sends it directly out on to the network and the network itself knows the pod IPs and takes care of it. Those are so-called VPC Native networking. 

Some CNIs have no knowledge of the existing network, they run an overlay inside the cluster that manages the transport and typically that's done with IP encapsulation and sending the encapsulated packet to the destination node. 

In VPC Native networking, your node just sends packets to the destination pod like a regular packet. The pods are fully understood and routable by the network itself. 

It works differently on-prem. It depends on your CNI. In an on-prem network, using most other CNIs, including microk8s which uses Calico, the network doesn't know anything about pod IPs. Calico sets up an overlay network which mimic a separate network to handle pod-to-pod communication.

In VPC Native networking, things that are outside your kubernetes cluster can communicate directly with k8s pods. GCP actually supports this, while AWS uses security groups to block this by default (but you can enable it). in overlay CNIs like Calico or Flannel, you have to be inside the cluster to talk to pods in the cluster.

[hubt - Discord]

Debugging

This proved very useful when a pod suddenly died:

kubectl get events -A --sort-by=.lastTimestamp | grep -i POD_NAME

Turned out:

The node was low on resource: ephemeral-storage. Threshold quantity: 2139512454, available: 1826840Ki. Container spark-kubernetes-driver was using 1551680Ki, request is 0, has larger consumption of ephemeral-storage.

This is about the best way to see what happened to a pod once it dies.

Saturday, February 14, 2026

GPU Programming pt 1.

This is a nice Java implementation of various LLMs that can also defer to TornadoVM to process the maths on a GPU. Looking at the code gives a good idea of the impedance mismatch between CPU and GPU programming as the Java code covers both. Here are some miscellaneous notes.

GPU Terminology

In TornadoVM's KernelContext you'll see various references to Global, Group and Local IDs.

The global ID is the overall ID of the thread. Note that it can be virtualized - that is, it can be larger than the number of threads your hardware supports.

The group ID refers to a logical grouping of threads. Note that a warp is a physical (ie, hardware dependent) grouping of threads. Work groups are made of integer multiples of warps. Warps always have 32 threads in NVidia hardware and process in lockstep. Work groups however can execute their warps asynchronously.

If a warp hits an if/else statement, the branches are executed sequentially and you lose parallelism!

Threads in a work group can share memory. The local ID is the ID of a thread within that group.

GPU algorithms

Writing algorithms is different in the world of GPUs. This Java/TornadoVM code that is turned into GPU code:

context.localBarrier();

for (int stride = localWorkGroupSize / 2; stride > 0; stride >>= 1) {
    if (localId < stride) {
        localSum[localId] += localSum[localId + stride];
    }
    context.localBarrier();
}

is actually reducing an array to an int using the GPUs threads. It first with half the warf of 32 threads. Starting with 16 threads, then 8, then 4 etc each thread takes 2 elements in the array and adds them. Only half the number of threads are needed on the next iteration and so on. All the other threads are "masked", that is, not used.

Mapping TornadoVM to GPU concepts

There is a line in fusedQKVMatmulX that basically says:

if (context.globalId < input_array_size) ...

Yeah, but what if the maximum globalId (the actual thread ID) is much lower than the array size? Do we ignore the rest of the array? 

The answer is no because globalId is a virtual ID and does not represent the physical limits of your hardware. As it happens, my RTX 4070 (Laptop) has 4608 CUDA cores whereas the model I am running (Llama-3.2-1B-Instruct-F16) has a hidden size of 4096 so it seems like it all fits into memory without resorting to virtualization tricks.

The functions above don't generally have loops in them. The reason is that the loop is implicit. Each GPU thread calls the function.

Graal- and TornadoVM

Note that TornadoVM heavily relies on GraalVM. If you look at the stack, you'll see the code in PTXCompiler.emitFrontEnd appears to exploit GraalVM's ability to examine the bytecode of the functions mentioned above. It does this so it can convert them into CUDA code.

Consequently, you'll never see any breakpoints hit in these TransformerComputeKernelsLayered functions.

Wednesday, February 11, 2026

My Polaris PR

I've had some issues with a federated (a.k.a catalog) in Polaris connecting to GCP so I raised this ticket outlining the problem

Having a bit of time to implement it, I've raised a PR. The first thing I had to do was get familiar with:

The Polaris Architecture

Note the DTOs are automatically generated (see spec/polaris-management-service.yml). See client/python/spec/README.md for full instructions, but running:

redocly bundle spec/polaris-catalog-service.yaml -o spec/generated/bundled-polaris-catalog-service.yaml

brings all the YML together and 

The reason for doing it this way is to generate both Python (with make client-regenerate) and Java (with ./gradlew :polaris-api-management-model:openApiGenerate) that are in lockstep with the spec.

So, the DTOs are auto generated but the DPOs are hand coded. This is because they are internal whereas DTOs are client facing and that client could be Java, Python or something else.

After making the change, then it's:

./gradlew assemble -x test && ./gradlew publishToMavenLocal -x test

to push it to the awaiting code in my project.

Git

Then as I make my changes, I keep pulling from original repo with:

git pull https://github.com/apache/polaris.git main --rebase

The --rebase at the end is saying "make my branch exactly the same as the original repo then add my deltas on to it at the end of its history."

Following the Polaris instructions, I noticedthat my origin was the Polaris Git repo (see this with git remote -v).I actually found it easier to run:

git remote set-url origin https://github.com/PhillHenry/polaris.git
git remote add upstream  https://github.com/apache/polaris
git push --force-with-lease origin 3451_federated_google_auth # this is the branch

to push my changes (and any from Apache) to my own branch.

Now, with:

$ git remote -v
origin  https://github.com/PhillHenry/polaris.git (fetch)
origin  https://github.com/PhillHenry/polaris.git (push)
upstream        https://github.com/apache/polaris (fetch)
upstream        https://github.com/apache/polaris (push)

I can keep my repo in synch with the original and ensure that my changes are always the last commits in the history with:

git fetch upstream
git fetch origin
git rebase upstream/main

as rebase flattens the history graph and rewrites the hash of commits (not the commits themselves).

To squash the commits, run:

git config --global core.editor "vim" # I prefer vim to emacs
git rebase -i HASH_OF_LAST_COMMIT_THAT_IS_NOT_YOURS

then edit the file such that the top line starts with pick and the subsequent list of commits begin with squash. Save it then you'll be prompted to write another file. Put the final, informative comment here. Save it too then push.

If you get into a pickle,

git reset --hard
rm -fr ".git/rebase-merge"

gets you back to where you were.

Once you're happy

Don't forget to 

./gradlew spotlessApply

Note, this will change the files on disk. Also, run:

./gradlew  build -x rat

For my 24 core Intel Ultra 9 185H:

BUILD SUCCESSFUL in 23m 52s

so, I don't want to do this too often...

Debugging

Polaris is heavily dependent on Quarkus which was throwing an HTTP 400 according to the logs but gave no further information. So, it's good at this point to put a breakpoint in org.jboss.resteasy.reactive.server.handlers.RequestDeserializeHandler as I suspected that it was related to my new DTOs. 

Google

Google by default stops an account from impersonating itself. 

So, to mitigate this in my integration tests, I've created two service accounts - one that my Polaris always runs as and the second to pretend to be the account that manages access to the external catalog. You get the Polaris SA to impersonate the external SA with:

gcloud iam service-accounts add-iam-policy-binding EXTERNAL_SA@PROJECT_ID.iam.gserviceaccount.com --member="serviceAccount:POLARIS_SA@PROJECT_ID.iam.gserviceaccount.com"  --role="roles/iam.serviceAccountTokenCreator"

An unexpected regression

Almost there, I came across this unexpected error:

2026-02-09 09:38:11,924 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] ... java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.iam.credentials.v1.stub.GrpcIamCredentialsStub
        at com.google.cloud.iam.credentials.v1.stub.IamCredentialsStubSettings.createStub(IamCredentialsStubSettings.java:145)       

The error was deep in some class initialization so I added this code:

      try {
          ProtoUtils.marshaller(GenerateAccessTokenRequest.getDefaultInstance());
      } catch (Throwable t) {
          t.getCause().printStackTrace();
          LOGGER.error( "Failed to create IAM credentials stub", t);
      }

which gave:

Caused by: com.google.protobuf.RuntimeVersion$ProtobufRuntimeVersionException: Detected incompatible Protobuf Gencode/Runtime versions when loading GenerateAccessTokenRequest: gencode 4.33.2, runtime 4.32.1. Runtime version cannot be older than the linked gencode version.
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersionImpl(RuntimeVersion.java:120)
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersion(RuntimeVersion.java:68)
        at com.google.cloud.iam.credentials.v1.GenerateAccessTokenRequest.<clinit>(GenerateAccessTokenRequest.java:32)
        ... 77 more
com.google.protobuf.RuntimeVersion$ProtobufRuntimeVersionException: Detected incompatible Protobuf Gencode/Runtime versions when loading GenerateAccessTokenRequest: gencode 4.33.2, runtime 4.32.1. Runtime version cannot be older than the linked gencode version.
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersionImpl(RuntimeVersion.java:120)
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersion(RuntimeVersion.java:68)
        at com.google.cloud.iam.credentials.v1.GenerateAccessTokenRequest.<clinit>(GenerateAccessTokenRequest.java:32)
        at org.apache.polaris.core.storage.gcp.GcpCredentialsStorageIntegration.createIamCredentialsClient(GcpCredentialsStorageIntegration.java:287)

Urgh. It appears that GenerateAccessTokenRequest (which is itself @com.google.protobuf.Generated) in JAR proto-google-cloud-iamcredentials-v1:2.83.0 says in its static initializer that it is associated with protobuf version 4.33.2. Meanwhile, RuntimeVersion in JAR protobuf-java:4.32.1 checks this against itself and obviously fails it.

Tuesday, February 3, 2026

Polaris Federation notes

Here are some miscellaneous notes I made as I work my way through groking Polaris Federation and Authorization.

First:

Polaris DevOps

Polaris has integration tests in the integration-tests/src/main/java/ directory not the test directory as you would have imagined. The reason for this is that they can be packaged as a JAR and used elsewhere in the codebase. 

The advantage to doing it this way is that the same tests can be run against a local Polaris, a Polaris in the cloud, a Polaris running in Docker etc.

So, if we take CatalogFederationIntegrationTest, we can see it subclassed in 
the spark-tests Gradle package where it can be run with:

./gradlew :polaris-runtime-spark-tests:intTest

If you try to run the superclass in its own module with Gradle, it cannot be found as it's not in the test directory. If you try to run it with your IDE, you'll find that classes to be wired in at runtime are missing. Running the subclass, CatalogFederationIT, starts a MinIO docker container against which it can run.

Federation

The DTOs (Data Transfer Objects) for creating catalogs etc live in 

org.apache.polaris.core.admin.model 

For example ExternalCatalog which can be serialized into JSON.

These are passed across the wire and are turned into DPOs (Data Persistence Objects) that live in 

org.apache.polaris.core.connection.iceberg

In the case of the IcebergRestConnectionConfigInfoDpo, this DPO object is not a mere aneamic domain model. It has the logic to, for instance, create the properties that will be used to instantiate the class that will govern authentication. It does this by delegating to this factory class:

org.apache.iceberg.rest.auth.AuthManagers

Notice that we have moved from Polaris to the world of Iceberg. The various AuthManagers implement access to OAuth2 providers, Google, SigV4 for AWS etc.

However, there is a mismatch. The AuthenticationParameters DTO classes don't fully align with the AuthManager classes. For instance, there doesn't appear to be a way of creating an external catalog with authorisation via org.apache.iceberg.gcp.auth.GoogleAuthManager.

So, after a day of investigating and trying to hack something together, it looks like this:
  • Iceberg can talk to Google no problem using org.apache.iceberg.gcp.auth.GoogleAuthManager.
  • However, there is currently no Polaris code to use GoogleAuthManager in an external catalog.
  • Instead, the only way to do it currently is to use the standard OAuth2 code.
  • However, Google does not completely follow the OAuth2 spec, hence this Iceberg ticket that lead to the writing of GoogleAuthManager and this StackOverflow post that says GCP does not support the grant_type that Iceberg's OAuth2Util uses.
This has now been raised in this Polaris ticket.

Saturday, January 31, 2026

Notes on Poetry

The dependencies in Poetry can get screwed. If hashes don't agree then blatting the poetry.lock file will not help. Instead, run:

poetry cache clear pypi --all
poetry cache clear --all .
rm -rf ~/.cache/pypoetry

When updating a dependency, run:

poetry lock
poetry install

This will install an environment in a subfolder of:

~/.cache/pypoetry/virtualenvs/

You can point your IDE at the Python executable underneath this.

Poetry is pretty nice when showing you dependencies. Running something like:

poetry show pandas

shows you everything Pandas needs and everything that depends on it.

This was necessary when decoding a bizarre error in a Jupyter notebook where an import was failing when it was clearly there. In this case, statsmodel and Pandas seem to be disagreeing. 

The code to put in the notebook to check it was using the right version of a library is:

import statsmodels
import pandas

print(statsmodels.__version__)
print(pandas.__version__)
print(statsmodels.__file__)

Now, compare this to the Python environment:

poetry run python - <<EOF
import statsmodels, pandas
print(statsmodels.__version__)
print(statsmodels.__file__)
print(pandas.__version__)
EOF

I had changed dependency versions but my IDE (PyCharm) did not recognise the change until I restarted it.

Some useful one-liners

Run your tests with:

poetry run pytest

The whereabouts of your tests can be found in your pyproject.toml file. It should look something like:

testpaths = ["tests"]
pythonpath = "src"

With this, your tests can import anything under the ROOT/src directory.

Add a dependency with something like:

poetry add ipykernel

This will update your metadata file.

Monday, January 12, 2026

The Federation: AWS Glue

Apache Polaris can act as a proxy to other catalogs. This still appears to be work in progress as the roadmap proposal has "Catalog Federation" as "Tentatively Planned" at least until release 1.5.

If you're running it from source, you'll need to enable:

polaris.features."ENABLE_CATALOG_FEDERATION"=true
polaris.features."SUPPORTED_CATALOG_CONNECTION_TYPES"=["ICEBERG_REST", "HIVE"]
polaris.features."SUPPORTED_EXTERNAL_CATALOG_AUTHENTICATION_TYPES"=["OAUTH", "BEARER", "SIGV4"]


Apache Polaris can be a proxy for an Iceberg REST endpoint. In each case, a org.apache.polaris.core.admin.model.ExternalCatalog is passed across the wire to create a catalog. Only the details differ.

AWS Glue

Glue is its own beast but it does offer an Iceberg REST endpoint. To use it, the AuthenticationParameters in the ExternalCatalog must be of type SigV4AuthenticationParameters.
"In IAM authentication, instead of using a password to authenticate against the [service], you create an authentication token that you include with your ... request. These tokens can be generated outside the [service] using AWS Signature Version 4(SigV4) and can be used in place of regular authentication." [1]
So, the SigV4AuthenticationParameters ends up taking the region, role ARN, etc. The role must be available to the Principal that is associated with the Polaris instance. In addition, there must be a --policy-document that allows the Action glue:GetCatalog.

Finally, the Glue database and table must be created with Parameters that contain iceberg.table.default.namespace and an IcebergInput block.

TL;DR - most of the work is in configuring AWS not the calling code.

[1] Security and Microservice Architecture on AWS