Friday, March 6, 2026

Permissions and Lakes

OpenID (authorisation) is built on top of OAuth (authentication). 
"It allows clients to verify the identity of the end user based on the authentication performed by an authorization server, as well as to obtain basic profile information about the end user in an interoperable and RESTlike manner." Zero Trust Networks (O'Reilly)
The (Java) KeyCloak is a common choice for an open source solution. KeyCloak is an Identity Provider (IdP).

Polaris can vend credentials but it (rather than the cloud IAM system) controls who gets what. It acts as an ACL for ACLs, if you like. This code shows how a request is associated with a realm and a realm with a credential.

If you're going to use Polaris in production, you'll probably need a certificate from a recognised Certificate Authority like Let's Encrypt. The reason is that all HTTPS clients need to make a call to a list of hard coded authorities who will sign off the certificate as genuine.

AWS users can use AWS Certificate Manager (ACM) to certify endpoints - rather than configuring Polaris to use SSL. You can have AWS manage the whole thing; or you can "install an externally signed private CA certificate on your subordinate CA. This CA certificate must be signed by a parent CA. Installing the certificate completes the creation and activation of the CA."

Either way, the idea is that the Elastic Load Balancer provides an HTTPS endpoint, does all the de/encryption gubbins and then forwards plain HTTP on to Polaris that sits securely in your Virtual Private Cloud.

To this end, it seems you must deploy the AWS Load Balancer Controller in your Kubernetes cluster much like the vpc-cni EksAddon.

Note that it can take a minute or two for a mapping from a domain name to an endpoint to be registered. Run:

dig A YOUR_DOMAIN_NAME +short

to see if your DNS is updated.

Fine Grained Access

OpenFGA adds Fine Grained Access control. This implementation is written in Go.

Kubernetes

Azure and GCP do things differently. It uses a sidecar that leverages Let's Encrypt.
"Uploading and managing TLS secrets can be difficult. In addition, certificates can often come at a significant cost. To help solve this problem, there is a nonprofit called “Let’s Encrypt” running a free Certificate Authority that is API-driven. Since it is API-driven, it is possible to set up a Kubernetes cluster that automatically fetches and installs TLS certificates for you. It can be tricky to set up, but when working, it’s very simple to use. The missing piece is an open source project called cert-manager created by Jetstack, a UK startup, onboarded to the CNCF." - Kubernetes Up & Running 3rd Ed., O'Reilly

Certificates and Challenges

In the context of Kubernetes, the certificate contains the secret name where the final certificate will be stored and a reference to an issuer. The result is a Kubernetes secret containing the actual public and private key for HTTPS.

The challenge proves to the CA that you own the domain. You can either use a HTTP-01 challenge where you host a token on a URL that uses the domain. 

Or you use DNS-01 where you ask your DNS provider to host a record containing the token (see below).

Domain Ownership Validation

AWS has Route53 that nicely integrates management of domain names with Kubernetes. That is, you can seemlessly have an EKS ingress assigned an AWS managed domain.

Google, however, have recently sold their domain name arm so you need to persuade them that the domain you own is really yours before they'll point it at your Kubernetes ingress. To do this, run:

gcloud certificate-manager dns-authorizations create ARBITRARY_STRING --domain="YOUR_DOMAIN" --project YOUR_PROJECT

Secrets

So much for security outside the cloud. Here is how you deal with it inside.

You'll need to install secrets-store-csi-driver-provider-aws which is a CRD that runs in Kubernetes and talks to AWS. It allows you to mount the secrets in your container as if they were any other filesystem.

I followed the instructions in the AWS CSI driver (above) but I could just not get Option 1 to work despite a lot of time checking that everything was OK. It's still a mystery but Option 2 worked first time. 

Restart a deployment with something like:

kubectl rollout restart deployment polaris

This is especially useful if you've updated the secrets.

Cloud and K8s

I've spent the week setting up HTTPS certificates and domain names for my Azure and GCP K8s clusters.

At one point, the GCP K8s installation just kept hanging with "Still creating...". It turned out that we just didn't allocate enough resources:
But not all hanging Kubernetes deployments are so easy to spot.

Automatically point domain names at K8s pod IPs

AWS is better integrated because you can buy the domain names through Route53. But for Azure and GCP K8s, we did the following.
  1. we bought a domain name from AWS via Route53. 
  2. we delegated the nameservers of this domain to Microsoft or Google.
  3. a Kubernetes sidecar starts up and contacts Let's Encrypt's API .
  4. Let's encrypt returns a token 
  5. the sidecar encodes with its private key and hosts it (a.k.a Key Authorization) on port 80.
  6. Let's encrypt reads that file and decodes it with the cluster's public key. Now it can grant a certificate.
The sidecar is called ACME (Automatic Certificate Management Environment) and is ephemeral:

$ kubectl get events -A --sort-by=.lastTimestamp | grep -i  acme
...
default         45m         Normal    Started                   pod/cm-acme-http-solver-6j5xp                                  Started container acmesolver
default         44m         Normal    Sync                      ingress/cm-acme-http-solver-lgl5w                              Scheduled for sync
default         43m         Normal    Killing                   pod/cm-acme-http-solver-6j5xp                                  Stopping container acmesolver

the outside world can talk to both a service and an ingress. Which you choose depends on what you want. Use Ingress for HTTPS.

An Ingress always talks to a service. Note that it is a logical abstraction and can have multiple ingresses for one domain. For example:

$ kubectl get ingress -A
NAMESPACE   NAME                        CLASS    HOSTS                   ADDRESS         PORTS     AGE
default     cm-acme-http-solver-ldflx   <none>   emryspolarisgcp.click   35.189.87.151   80        14m
default     polaris-ingress             nginx    emryspolarisgcp.click   35.189.87.151   80, 443   14m

is saying the emryspolarisgcp.click can point to different services depending on its ports. Here ACME is sticking around to finish the creation of certificates.

This was for GCP where the nameservers were incorrect (they changed every time we deployed the managed zones though Terraform). You might want to see them with:

gcloud dns managed-zones describe polaris-zone --project=afon-core

What can go wrong?

Traffic can be swallowed because:
  • network security groups (or lack of them)
  • misconfigured ports
  • selectors not pointing at the correct pods
Incoming traffic goes through the system in this order:
  1. ingress (optional - see above)
  2. service
  3. endpoint
  4. pod
kubectl get ingress shows the name and the ports of the front facing interface
kubectl describe ingress XXX shows the service to which traffic is sent

Note with an nginx load balancer, this service comes before the ingress:

$ dig A emryspolarisazure.click +short
20.108.199.92
$ kubectl get service -A
NAMESPACE       NAME                                               TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE
...                     50m
default         polaris-internal                                   ClusterIP      10.2.185.168   <none>          8181/TCP                     47m
ingress-nginx   nginx-ingress-ingress-nginx-controller             LoadBalancer   10.2.238.76    20.108.199.92   80:30157/TCP,443:32110/TCP   47m
ingress-nginx   nginx-ingress-ingress-nginx-controller-admission   ClusterIP      10.2.30.150    <none>          443/TCP                      47m
...

Debugging

If in doubt, port forward:

kubectl port-forward svc/polaris-internal 8080:8181 -n default

This will at least establish that the communication between your service and application is fine.

It's important to check that the firewall is at least expecting a connection for that IP address and port. Don't use curl for this as it is subject to network security rules and certificates being in place. So, run:

nc -zv 20.108.199.92 80

if you want to make sure that port is open as firewalls allow a TCP three way handshake even if the Network Security Group blacks further traffic.

Upon setting up the stack with tofu, the certificate doesn't look healthy and I can't access my site via HTTP.

$ kubectl get certificate polaris-tls   
NAME          READY   SECRET        AGE
polaris-tls   False   polaris-tls   28m
$ kubectl get challenges -A
NAMESPACE   NAME                                STATE     DOMAIN                    AGE
default     polaris-tls-1-648081749-304604175   invalid   emryspolarisazure.click   36m
$ kubectl describe certificate polaris-tls
...
Events:
  Type     Reason     Age   From                                       Message
  ----     ------     ----  ----                                       -------
  Normal   Issuing    23m   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal   Generated  23m   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "polaris-tls-rw6t9"
  Normal   Requested  23m   cert-manager-certificates-request-manager  Created new CertificateRequest resource "polaris-tls-1"
  Warning  Failed     21m   cert-manager-certificates-issuing          The certificate request has failed to complete and will be retried: Failed to wait for order resource "polaris-tls-1-648081749" to become ready: order is in "invalid" state:
$ kubectl describe challenge -A
...
Events:
  Type     Reason     Age   From                     Message
  ----     ------     ----  ----                     -------
  Normal   Started    37m   cert-manager-challenges  Challenge scheduled for processing
  Normal   Presented  37m   cert-manager-challenges  Presented challenge using HTTP-01 challenge mechanism
  Warning  Failed     35m   cert-manager-challenges  Accepting challenge authorization failed: acme: authorization error for emryspolarisazure.click: 400 urn:ietf:params:acme:error:connection: 51.132.211.134: Fetching http://emryspolarisazure.click/.well-known/acme-challenge/5yGd57VQUrjc2ns-Q-VEVIl3vl6WKFK4B2fQu643_TM: Timeout during connect (likely firewall problem)

Running:

kubectl delete certificate polaris-tls

did the trick as it forces the certificate to renew. Watch and wait for it to be ready with:

kubectl get certificate polaris-tls -w

You need to run this (or put the equivalent in your Terraform file):

kubectl annotate service nginx-ingress-ingress-nginx-controller   -n ingress-nginx   "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path=/healthz"

and that should now all work. A happy system should look like:

$ kubectl describe certificate polaris-tls
...
Events:
  Type    Reason     Age   From                                       Message
  ----    ------     ----  ----                                       -------
  Normal  Issuing    42m   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  42m   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "polaris-tls-thlvj"
  Normal  Requested  42m   cert-manager-certificates-request-manager  Created new CertificateRequest resource "polaris-tls-1"
  Normal  Issuing    40m   cert-manager-certificates-issuing          The certificate has been successfully issued

Thursday, March 5, 2026

Gradle cheat sheet

Some miscellaneous notes I found useful.

Building

Build with neither tests nor RAT:

gradle build -x rat -x test

List the projects with:

./gradlew projects

Viewing Dependencies

If you want to see the dependencies of a project, use:

gradlew :PROJECT_NAME:dependencies [--configuration runtimeClasspath|compileClasspath]

where PROJECT_NAME is what you get from listing the projects.

To examine a particular dependency:

./gradlew  :polaris-server:dependencyInsight --dependency protobuf-java --configuration runtimeClasspath

Defining Dependencies

Typically, you'll find a libs.versions.toml that defines libraries but does not by itself include them. What Gradle does do is autogenerate a Java class that becomes the lib object in your build.gradle.kts files and you can reference the entities in the .toml file as Java code.

Tests

If you want to run a single test:

gradlew :PROJECT_NAME:test --tests FQN

Tasks

View tasks for a particular project with, for example:

./gradlew :polaris-config-docs-site:tasks

Thursday, February 19, 2026

An unruly Terraform

If the Terraform state is out of synch with reality, you might need to change that state manually with something like:

tofu state list

followed by

tofu state rm XXX

I had to delete load balancers manually through the AWS Web Console and then also the EKS instance. I then had to manually delete any references to them from my JSON.

Tip: regularly delete the directory in which the Terraform lives as state gets kept there that the next run implicitly relies upon. The consequence if you don't is that after a major refactor, you run the configuration and everything looks fine. You check in thinking you've done a good job but there was an invisible dependency on the previous running and checking out to a fresh directory fails. So:

Delete all files regularly

I was getting lots of:

│ Error: Get "https://21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com/api/v1/namespaces/default/services/jupyter-lb": dial tcp: lookup 21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com on 127.0.0.1:53: no such host
 

even after blatting my Terrafrom cdktf.out/stacks directory. Turns out state files were accumulating in the root directory of my project (which contained cdktf.out). Once they too were blatted, things looked better.

Changing the cdktf.out.json file resulted in:

│ Error: Inconsistent dependency lock file
│ 
│ The following dependency selections recorded in the lock file are inconsistent with the current configuration:
│   - provider registry.opentofu.org/hashicorp/helm: required by this configuration but no version is selected
│ 
│ To update the locked dependency selections to match a changed configuration, run:
│   tofu init -upgrade

The solution was to run tofu init -upgrade

GCP

You might see this error when running Terraform on GCP:

│ Error: Error setting access_token
│ 
│   with data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C,
│   on cdk.tf.json line 25, in data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C:
│   25:       }
│ 
│ oauth2: "invalid_grant" "reauth related error (invalid_rapt)" "https://support.google.com/a/answer/9368756"

It's nothing really to do with TF but rather your GCP credentials. Login with gcloud auth application-default login and try again. D'oh.

AWS

aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values=$VPC --region $REGION

aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-security-groups --filters Name=vpc-id,Values=$VPC --region $REGION

This last one showed 3 security groups.

The reason that these AWS entities lingered is because my tofu destroy was always hanging. And the reason it never finished is that there were finalizers that prevented it. To avoid this, I needed to run:

kubectl patch installation default -p '{"metadata":{"finalizers":[]}}' --type=merge
kubectl patch service YOUR_LOAD_BALANCER -p '{"metadata":{"finalizers":null}}'  --type=merge

Also, CRDs need to be destroyed:

for CRD in $(kubectl get crds | awk '{print $1}') ; do {
    kubectl patch crd $CRD --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
    kubectl delete crd $CRD --force
} done

I would then run these scripts as a local-exec provisioned in a resource.

I asked on the DevOps Discord server how normal this was:
PhillHenry
I'm using Terraform to manage my AWS stack that (amongst other things) creates a load balancer using an aws-load-balancer-controller. I'm finding destroying the stack just hangs then times out after 20 minutes.

I've had to introduce bash scripts that patch finalizers in services and installations plus force delete CRDs. Finally, tofu detroy cleans everything up but I can't help feeling I'm doing it all wrong by having to add hacks.

Is this normal? If not, can somebody point me in the right direction over what I'm doing wrong? 
snuufix
It is normal with buggy providers, it's just sad that even AWS is one.
It appears I am not the only one:
The_Ketchup, CJO
This is mainly for my homelab to teardown when Im done for the day. So when the aws ingress controller makes an LB via K8s, terraform doesnt know about it so I have to manually go in and delete it in the aws console. Its not very clean. So I was thinking maybe if its managed under argocd it will know about it and delete it? Idk its kinda confusing. Maybe I jsut do kubectl delete ingress --all or something and THEN do terraform destroy?
Cuz right now it just wont delete my subnets since theres an LB in there when I do terraform destroy

Darkwind The Dark Duck
U could use AWS Nuke to clean anything remaining 😄
Redeploying

Redeploying a component was  simply a matter of running:

tofu apply -replace=kubernetes_manifest.sparkConnectManifest  --auto-approve

This is a great way to redeploy just my Spark Connect pod when I've changed the config:


Helm

If you want to find out what version of a Helm chart you're using when you forget to set it, this might help. It's where Helm caches the charts it downloads.

$ ls -ltr ~/.cache/helm/repository/
...
-rw-r--r-- 1 henryp henryp 107929 Nov 12 10:41 spark-kubernetes-operator-1.3.0.tgz
-rw-r--r-- 1 henryp henryp 317214 Nov 20 15:56 eks-index.yaml
-rw-r--r-- 1 henryp henryp    433 Nov 20 15:56 eks-charts.txt
-rw-r--r-- 1 henryp henryp  36607 Nov 24 09:15 aws-load-balancer-controller-1.15.0.tgz
-rw-r--r-- 1 henryp henryp 493108 Dec 11 14:47 kube-prometheus-stack-51.8.0.tgz
-rw-r--r-- 1 henryp henryp  38337 Dec 15 12:20 aws-load-balancer-controller-1.16.0.tgz
...

More Kubernetes notes

Networking

This was an interesting online conversation about how network packets set to the cluster's IP address are redirected to a pod in Kubernetes cluster.

Each pod has its own IP which is managed by the Container Networking Interface (CNI). Every node runs a kube-proxy which manages how a cluster IPs map to pod IPs. Pod IPs are updated dynamically and only include pods passing their health check.

The node receiving the TCP request does forward to the destination pod, but the mechanisms sort of depend on the CNI. In cloud environments like AWS and GCP, the CNI just sends it directly out on to the network and the network itself knows the pod IPs and takes care of it. Those are so-called VPC Native networking. 

Some CNIs have no knowledge of the existing network, they run an overlay inside the cluster that manages the transport and typically that's done with IP encapsulation and sending the encapsulated packet to the destination node. 

In VPC Native networking, your node just sends packets to the destination pod like a regular packet. The pods are fully understood and routable by the network itself. 

It works differently on-prem. It depends on your CNI. In an on-prem network, using most other CNIs, including microk8s which uses Calico, the network doesn't know anything about pod IPs. Calico sets up an overlay network which mimic a separate network to handle pod-to-pod communication.

In VPC Native networking, things that are outside your kubernetes cluster can communicate directly with k8s pods. GCP actually supports this, while AWS uses security groups to block this by default (but you can enable it). in overlay CNIs like Calico or Flannel, you have to be inside the cluster to talk to pods in the cluster.

[hubt - Discord]

Debugging

This proved very useful when a pod suddenly died:

kubectl get events -A --sort-by=.lastTimestamp | grep -i POD_NAME

Turned out:

The node was low on resource: ephemeral-storage. Threshold quantity: 2139512454, available: 1826840Ki. Container spark-kubernetes-driver was using 1551680Ki, request is 0, has larger consumption of ephemeral-storage.

This is about the best way to see what happened to a pod once it dies.

Saturday, February 14, 2026

GPU Programming pt 1.

This is a nice Java implementation of various LLMs that can also defer to TornadoVM to process the maths on a GPU. Looking at the code gives a good idea of the impedance mismatch between CPU and GPU programming as the Java code covers both. Here are some miscellaneous notes.

GPU Terminology

In TornadoVM's KernelContext you'll see various references to Global, Group and Local IDs.

The global ID is the overall ID of the thread. Note that it can be virtualized - that is, it can be larger than the number of threads your hardware supports.

The group ID refers to a logical grouping of threads. Note that a warp is a physical (ie, hardware dependent) grouping of threads. Work groups are made of integer multiples of warps. Warps always have 32 threads in NVidia hardware and process in lockstep. Work groups however can execute their warps asynchronously.

If a warp hits an if/else statement, the branches are executed sequentially and you lose parallelism!

Threads in a work group can share memory. The local ID is the ID of a thread within that group.

GPU algorithms

Writing algorithms is different in the world of GPUs. This Java/TornadoVM code that is turned into GPU code:

context.localBarrier();

for (int stride = localWorkGroupSize / 2; stride > 0; stride >>= 1) {
    if (localId < stride) {
        localSum[localId] += localSum[localId + stride];
    }
    context.localBarrier();
}

is actually reducing an array to an int using the GPUs threads. It first with half the warf of 32 threads. Starting with 16 threads, then 8, then 4 etc each thread takes 2 elements in the array and adds them. Only half the number of threads are needed on the next iteration and so on. All the other threads are "masked", that is, not used.

Mapping TornadoVM to GPU concepts

There is a line in fusedQKVMatmulX that basically says:

if (context.globalId < input_array_size) ...

Yeah, but what if the maximum globalId (the actual thread ID) is much lower than the array size? Do we ignore the rest of the array? 

The answer is no because globalId is a virtual ID and does not represent the physical limits of your hardware. As it happens, my RTX 4070 (Laptop) has 4608 CUDA cores whereas the model I am running (Llama-3.2-1B-Instruct-F16) has a hidden size of 4096 so it seems like it all fits into memory without resorting to virtualization tricks.

The functions above don't generally have loops in them. The reason is that the loop is implicit. Each GPU thread calls the function.

Graal- and TornadoVM

Note that TornadoVM heavily relies on GraalVM. If you look at the stack, you'll see the code in PTXCompiler.emitFrontEnd appears to exploit GraalVM's ability to examine the bytecode of the functions mentioned above. It does this so it can convert them into CUDA code.

Consequently, you'll never see any breakpoints hit in these TransformerComputeKernelsLayered functions.

Wednesday, February 11, 2026

My Polaris PR

I've had some issues with a federated (a.k.a catalog) in Polaris connecting to GCP so I raised this ticket outlining the problem

Having a bit of time to implement it, I've raised a PR. The first thing I had to do was get familiar with:

The Polaris Architecture

Note the DTOs are automatically generated (see spec/polaris-management-service.yml). See client/python/spec/README.md for full instructions, but running:

redocly bundle spec/polaris-catalog-service.yaml -o spec/generated/bundled-polaris-catalog-service.yaml

brings all the YML together and 

The reason for doing it this way is to generate both Python (with make client-regenerate) and Java (with ./gradlew :polaris-api-management-model:openApiGenerate) that are in lockstep with the spec.

So, the DTOs are auto generated but the DPOs are hand coded. This is because they are internal whereas DTOs are client facing and that client could be Java, Python or something else.

After making the change, then it's:

./gradlew assemble -x test && ./gradlew publishToMavenLocal -x test

to push it to the awaiting code in my project.

Git

Then as I make my changes, I keep pulling from original repo with:

git pull https://github.com/apache/polaris.git main --rebase

The --rebase at the end is saying "make my branch exactly the same as the original repo then add my deltas on to it at the end of its history."

Following the Polaris instructions, I noticedthat my origin was the Polaris Git repo (see this with git remote -v).I actually found it easier to run:

git remote set-url origin https://github.com/PhillHenry/polaris.git
git remote add upstream  https://github.com/apache/polaris
git push --force-with-lease origin 3451_federated_google_auth # this is the branch

to push my changes (and any from Apache) to my own branch.

Now, with:

$ git remote -v
origin  https://github.com/PhillHenry/polaris.git (fetch)
origin  https://github.com/PhillHenry/polaris.git (push)
upstream        https://github.com/apache/polaris (fetch)
upstream        https://github.com/apache/polaris (push)

I can keep my repo in synch with the original and ensure that my changes are always the last commits in the history with:

git fetch upstream
git fetch origin
git rebase upstream/main

as rebase flattens the history graph and rewrites the hash of commits (not the commits themselves).

To squash the commits, run:

git config --global core.editor "vim" # I prefer vim to emacs
git rebase -i HASH_OF_LAST_COMMIT_THAT_IS_NOT_YOURS

then edit the file such that the top line starts with pick and the subsequent list of commits begin with squash. Save it then you'll be prompted to write another file. Put the final, informative comment here. Save it too then push.

If you get into a pickle,

git reset --hard
rm -fr ".git/rebase-merge"

gets you back to where you were.

Once you're happy

Don't forget to 

./gradlew spotlessApply

Note, this will change the files on disk. Also, run:

./gradlew  build -x rat

For my 24 core Intel Ultra 9 185H:

BUILD SUCCESSFUL in 23m 52s

so, I don't want to do this too often...

Debugging

Polaris is heavily dependent on Quarkus which was throwing an HTTP 400 according to the logs but gave no further information. So, it's good at this point to put a breakpoint in org.jboss.resteasy.reactive.server.handlers.RequestDeserializeHandler as I suspected that it was related to my new DTOs. 

Google

Google by default stops an account from impersonating itself. 

So, to mitigate this in my integration tests, I've created two service accounts - one that my Polaris always runs as and the second to pretend to be the account that manages access to the external catalog. You get the Polaris SA to impersonate the external SA with:

gcloud iam service-accounts add-iam-policy-binding EXTERNAL_SA@PROJECT_ID.iam.gserviceaccount.com --member="serviceAccount:POLARIS_SA@PROJECT_ID.iam.gserviceaccount.com"  --role="roles/iam.serviceAccountTokenCreator"

An unexpected regression

Almost there, I came across this unexpected error:

2026-02-09 09:38:11,924 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] ... java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.iam.credentials.v1.stub.GrpcIamCredentialsStub
        at com.google.cloud.iam.credentials.v1.stub.IamCredentialsStubSettings.createStub(IamCredentialsStubSettings.java:145)       

The error was deep in some class initialization so I added this code:

      try {
          ProtoUtils.marshaller(GenerateAccessTokenRequest.getDefaultInstance());
      } catch (Throwable t) {
          t.getCause().printStackTrace();
          LOGGER.error( "Failed to create IAM credentials stub", t);
      }

which gave:

Caused by: com.google.protobuf.RuntimeVersion$ProtobufRuntimeVersionException: Detected incompatible Protobuf Gencode/Runtime versions when loading GenerateAccessTokenRequest: gencode 4.33.2, runtime 4.32.1. Runtime version cannot be older than the linked gencode version.
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersionImpl(RuntimeVersion.java:120)
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersion(RuntimeVersion.java:68)
        at com.google.cloud.iam.credentials.v1.GenerateAccessTokenRequest.<clinit>(GenerateAccessTokenRequest.java:32)
        ... 77 more
com.google.protobuf.RuntimeVersion$ProtobufRuntimeVersionException: Detected incompatible Protobuf Gencode/Runtime versions when loading GenerateAccessTokenRequest: gencode 4.33.2, runtime 4.32.1. Runtime version cannot be older than the linked gencode version.
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersionImpl(RuntimeVersion.java:120)
        at com.google.protobuf.RuntimeVersion.validateProtobufGencodeVersion(RuntimeVersion.java:68)
        at com.google.cloud.iam.credentials.v1.GenerateAccessTokenRequest.<clinit>(GenerateAccessTokenRequest.java:32)
        at org.apache.polaris.core.storage.gcp.GcpCredentialsStorageIntegration.createIamCredentialsClient(GcpCredentialsStorageIntegration.java:287)

Urgh. It appears that GenerateAccessTokenRequest (which is itself @com.google.protobuf.Generated) in JAR proto-google-cloud-iamcredentials-v1:2.83.0 says in its static initializer that it is associated with protobuf version 4.33.2. Meanwhile, RuntimeVersion in JAR protobuf-java:4.32.1 checks this against itself and obviously fails it.