Agile Java Man: 2025

Wednesday, December 31, 2025

Debugging JNI calls to the GPU

I'm playing aroung with a Java based LLM (code here). When running a JVM that calls the GPU using the TornadoVM, it crashed and in the log, I saw:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)

C [libcuda.so.1+0x1302b0]

C [libcuda.so.1+0x332420]

C [libtornado-ptx.so+0x64b8] Java_uk_ac_manchester_tornado_drivers_ptx_PTXStream_cuLaunchKernel+0x198

j uk.ac.manchester.tornado.drivers.ptx.PTXStream.cuLaunchKernel([BLjava/lang/String;IIIIIIJ[B[B)[[B+0 tornado.drivers.ptx@2.2.1-dev

...

Now, finding the Shared Object files (*.so), I called:

objdump -d /usr/lib/x86_64-linux-gnu/libcuda.so.1

objdump -d /usr/local/bin/Java/tornadovm-2.2.1-dev-ptx/lib/libtornado-ptx.so

and looked at the addresses in the stack dump.

First, libtornado-ptx.so. Note that the address (0x64b8) is the return address from a call, that is the next line after the call that went Pete Tong.

64b3: e8 b8 e1 ff ff call 4670 <cuLaunchKernel@plt>

64b8: 48 83 c4 30 add $0x30,%rsp

So, it's the call to cuLaunchKernel that is interesting.

33241b: e8 00 de df ff call 130220 <exit@plt+0x4e460>

332420: 5a pop %rdx

and the final (top most) stack frame:

1302ab: 4d 85 e4 test %r12,%r12

1302ae: 74 58 je 130308 <exit@plt+0x4e548>

1302b0: 41 8b 04 24 mov (%r12),%eax

The instruction test %x,%y is a common idiom in null checks (basically, it's x and y are ANDed and the je jumps if the Zero Flag is set - note that this flag is set if the result of the AND is non-zero or both x and y are zero).

So, it looks like we've essentially got what's equivalent to a NullPointerException in the machine code. Still looking at what's null... [Solved: had to use a model that is compatible with GPULlama3.java)

Monday, December 15, 2025

AWS cheatsheet

Various command lines that have helped me recently.

IAM

List the role's attached, inline and assumed (trust) policies with:

aws iam list-attached-role-policies --role-name $ROLE_NAME

aws iam list-role-policies --role-name $ROLE_NAME

Whoami with:

aws sts get-caller-identity

Policies are a collection of actions and services that can be assigned. List all homemade policies with:

aws iam list-policies --scope Local --query 'Policies[].Arn' --output table

Similarly, list all roles with:

aws iam list-roles --query 'Roles[].RoleName' --output table

List all the Actions for a policy with:

aws iam get-policy-version --policy-arn $POLICY_ARN --version-id $(aws iam get-policy --policy-arn $POLICY_ARN --query 'Policy.DefaultVersionId' --output text) --query 'PolicyVersion.Document.Statement[].Action' --output json | jq -r '.[]' | sort -u

List all the trust policies for a given role:

aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json

Note that assuming a role implies some temporary elevation of privileges while attaching a role is more about defining what a role actually is.

List everything attached to a policy:

aws iam list-entities-for-policy --policy-arn $POLICY_ARN

Instance profiles contain roles. They act as a bridge to securely pass an IAM role to an EC2 instance, enabling the instance to access other AWS services without needing to store long-term, hard-coded credentials like access keys. You can see them with:

aws iam list-instance-profiles-for-role --query 'AttachedPolicies[*].PolicyArn' --role-name $ROLE_NAME --query "InstanceProfiles[].InstanceProfileName" --output text

In short:

Trust policies say who can access a role.
Permission policies say what a role can do.

Note that this is why trust polices have only one action: sts:AssumeRole.

Secrets

See access to K8s secrets with:

kubectl logs -n kube-system -l app=csi-secrets-store-provider-aws -XXX

See an AWS secret with:

aws secretsmanager get-secret-value --secret-id $SECRET_ARN --region $REGION

Deleting them is interesting as they will linger unless told otherwise:

aws --region $REGION secretsmanager delete-secret --secret-id $SECRET_NAME --force-delete-without-recovery

Infra

To see why your EKS deployments aren't working:

kubectl get events --sort-by=.metadata.creationTimestamp | tail -20

Terraform seems to have a problem deleting load balancers in AWS. You can see them with:

aws elbv2 describe-load-balancers

List the load balancers:

aws elb describe-load-balancers --region $REGION

List the VPCs:

aws ec2 describe-vpcs --region $REGION

Glue

Create with:

aws glue create-database --database-input '{"Name": "YOUR_DB_NAME"}' --region $REGION

Create an Iceberg table with:

aws glue create-table \

--database-name YOUR_DB_NAME \

--table-input '

{

"Name": "TABLE_NAME",

"TableType": "EXTERNAL_TABLE",

"StorageDescriptor": {

"Location": "s3://ROOT_DIRECTORY_OF_TABLE/",

"Columns": [

{ "Name": "id", "Type": "int" },

...

{ "Name": "randomInt", "Type": "int" }

"SerdeInfo": {

"SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"

}

"Parameters": {

"iceberg.table.default.namespace": "YOUR_DB_NAME"

}

}' \

--open-table-format-input '

{

"IcebergInput": {

"MetadataOperation": "CREATE",

"Version": "2"

}

}' \

--region $REGION

Get all the databases with:

aws glue get-databases --query 'DatabaseList[*].Name' --output table

Get tables with:

aws glue get-tables --database-name YOUR_DB_NAME

Drop with:

aws glue delete-table --name TABLE_NAME --database-name YOUR_DB_NAME

Monday, November 24, 2025

AWS and HTTPs

This proved surprisingly hard but the take-away points are:

A Kubernetes service account must be given permissioned to access AWS infra structure
The Kubernetes cluster needs AWS specific pods to configure the K8s ingress such that it receives traffic from outside the cloud
The ingress is where the SSL de/encryption is performed.
Creating the certificate is easy when using the AWS web console and there you just associate it with the domain name.

The recipe

The following steps assume you have an ingress and a service already up and running. I did the mapping between the two in Terraform. What follows below is how to allow these K8s primitives to use AWS so they can be contacted by the outside world.

You need to associate an OpenID provider with the cluster and create a Kubernetes service account that is permissioned to use the AWS load balancer. Note that lines that are predominantly Kubernetes are blue and AWS lines are red.

eksctl utils associate-iam-oidc-provider --cluster $CLUSTERNAME --approve

curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json

aws iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy --policy-document file://iam_policy.json

eksctl create iamserviceaccount --cluster $CLUSTERNAME --namespace kube-system --name aws-load-balancer-controller --attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy --approve

kubectl describe sa aws-load-balancer-controller -n kube-system # check it's there

Then you need to configure Kubernetes to use the AWS load balancer.

helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=$CLUSTERNAME --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

However, I could see my replicasets were failing when I ran:

kubectl get rs -A

with something like:

Type Reason Age From Message

---- ------ ---- ---- -------

Warning FailedCreate 67s (x15 over 2m29s) replicaset-controller Error creating: pods "aws-load-balancer-controller-68f465f899-" is forbidden: error looking up service account kube-system/aws-load-balancer-controller: serviceaccount "aws-load-balancer-controller" not found

and there are no load balancer pods.

So, it seemed I need to:

kubectl apply -f aws-lbc-serviceaccount.yaml

where aws-lbc-serviceaccount.yaml is:

apiVersion: v1

kind: ServiceAccount

metadata:

name: aws-load-balancer-controller

namespace: kube-system

annotations:

eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole

labels:

app.kubernetes.io/component: controller

app.kubernetes.io/name: aws-load-balancer-controller

app.kubernetes.io/instance: aws-load-balancer-controller

The pods were now starting but quickly failing with errors like:

{"level":"error","ts":"2025-11-21T17:31:22Z","logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}

We can set it with:

helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller --namespace kube-system --set clusterName=$CLUSTE_NAME --set vpcId=$VPC_ID --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

and now the pods are running.

However, my domain name was still not resolving. So, run this to get the OIDC (OpenID Connector) issuer:

aws eks describe-cluster --name $CLUSTERNAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///"

Note that this value changes every time the cluster is created.

Then run:

aws iam create-role \

--role-name ${IAM_ROLE_NAME} \

--assume-role-policy-document file://lbc-trust-policy.json

where lbc-trust-policy.json is:

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Principal": {

"Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_ISSUER"

"Action": "sts:AssumeRoleWithWebIdentity",

"Condition": {

"StringEquals": {

"OIDC_ISSUER:aud": "sts.amazonaws.com",

"OIDC_ISSUER:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"

}

]

}

Get the ARN of that policy:

POLICY_ARN=$(aws iam list-policies --scope Local --query "Policies[?PolicyName=='AWSLoadBalancerControllerIAMPolicy'].Arn" --output text)

Create the role:

aws iam create-role --role-name AmazonEKSLoadBalancerControllerRole --assume-role-policy-document file://lbc-trust-policy.json

attach the policy:

aws iam attach-role-policy --role-name AmazonEKSLoadBalancerControllerRole --policy-arn ${POLICY_ARN}

then inform the cluster:

kubectl annotate serviceaccount aws-load-balancer-controller -n kube-system eks.amazonaws.com/role-arn="arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole" --overwrite

If you're logging the aws-load-balancer-controller-XXX pod, youll see it register this change if you restart the ingress with:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

then check its status with:

kubectl describe ingress $INGRESS_NAME

Note the ADDRESS. It will be of the form k8s-XXX.REGION.elb.amazonaws.com. Let's define it as:

INGRESS_HOSTNAME=$(kubectl get ingress $INGRESS_NAME -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Find the HostedZoneId of your load balancer with:

aws elbv2 describe-load-balancers --query "LoadBalancers[?DNSName=='INGRESS_HOSTNAME'].CanonicalHostedZoneId" --output text --region $REGION

Registering domain in Route 53

Create the A type DNS entry with:

HOSTED_ZONE_ID=$(aws route53 list-hosted-zones-by-name \

--dns-name $FQDN \

--query "HostedZones[0].Id" --output text | awk -F'/' '{print $3}')

aws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID" --change-batch file://route53_change.json

where route53_change.json is:

{

"Comment": "ALIAS record for EKS ALB Ingress",

"Changes": [

{

"Action": "UPSERT",

"ResourceRecordSet": {

"Name": "FQDN",

"Type": "A",

"AliasTarget": {

"HostedZoneId": "YOUR_HOST_ZONE_ID",

"DNSName": "INGRESS_HOSTNAME",

"EvaluateTargetHealth": false

}

]

}

After a few minutes, you'll see that IP address of the domain name and INGRESS_HOSTNAME are the same.

You can create your own hosted zone with:

aws route53 create-hosted-zone --name "polarishttps.emryspolaris.click" --caller-reference "$(date +%Y-%m-%d-%H-%M-%S)"

but this can lead to complications.

"Public-hosted zones have a route to internet-facing resources and resolve from the internet using global routing policies. Meanwhile, private hosted zones have a route to VPC resources and resolve from inside the VPC." - AWS for Solution Architects, O'Reilly

Certificate

We've now linked the domain name to an endpoint. Now we need to create a certificate. I did this through the AWS web console and after just a few clicks, it gave me the ARN.

You might need to wait a few minutes for it to become live but you can see the status of a certificate with:

aws acm describe-certificate --certificate-arn "$CERT_ARN" --region eu-west-1 --query "Certificate.Status" --output text

Tuesday, November 18, 2025

Snowflake and AWS

This is how you get Snowflake to talk to your AWS real estate. Before we start, get your AWS account ID with:

aws sts get-caller-identity

This will be used as your GLUE_CATALOG_ID (see below).

Now, you need to create in Snowflake a volume like this:

CREATE OR REPLACE EXTERNAL VOLUME YOUR_VOLUME_NAME

STORAGE_LOCATIONS = (

( NAME = 'eu-west-2'

STORAGE_PROVIDER = 'S3'

STORAGE_BASE_URL = 's3://ROOT_DIRECTORY_OF_TABLE/'

STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'

)

ALLOW_WRITES = FALSE;

You run this even though you have yet to create the role. Then run:

select system$verify_external_volume('YOUR_VOLUME_NAME');

This will give you some JSON that includes a STORAGE_AWS_IAM_USER_ARN. You never create this user. Snowflake does it itself. Its ARN is what you need to create a role in AWS that allows Snowflake's user to see data.

You create a role was created with an ordinary aws iam create-role --role-name S3ReadWriteRoleSF --assume-role-policy-document... using the ARN that we got from Snowflake, above. That is, our Snowflake instance has its own AWS user and you must give that user access to your real estate.

Now, give Snowflake access to your cloud assets with:

aws iam put-role-policy --role-name ROLE_NAME --policy-name GlueReadAccess --policy-document file://glue-read-policy.json

Where glue-read-policy.json just contains the Actions needed to talk to Glue.

Finally, we create the Glue catalog (note that this is not a REST catalog like Polaris) but Glue:

CREATE OR REPLACE CATALOG INTEGRATION CATALOG_NAME

CATALOG_SOURCE = GLUE

TABLE_FORMAT = ICEBERG

CATALOG_NAMESPACE = 'YOUR_DB_NAME'

GLUE_CATALOG_ID = 'GLUE_CATALOG_ID'

GLUE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'

GLUE_REGION = 'eu-west-2'

ENABLED = TRUE;

Now you bring all these threads together when you create a table with:

CREATE ICEBERG TABLE arbitrary_name

EXTERNAL_VOLUME = 'YOUR_VOLUME_NAME'

CATALOG = 'CATALOG_NAME'

CATALOG_TABLE_NAME = 'TABLE_NAME';

Create a REST catalog with:

CREATE OR REPLACE CATALOG INTEGRATION polaris_int

CATALOG_SOURCE = POLARIS

TABLE_FORMAT = ICEBERG

REST_CONFIG = (

CATALOG_URI = 'https://YOUR_HOST:8181/api/catalog/',

CATALOG_NAME = 'YOUR_CATALOG_NAME_IN_POLARIS'

)

REST_AUTHENTICATION = (

TYPE = BEARER

BEARER_TOKEN = 'TOKEN'

)

ENABLED = TRUE;

Note that the URI must be talking HTTPS not HTTP and the TOKEN is from Polaris.

Saturday, November 15, 2025

Debugging Google Cloud Kubernetes

A problem I was having when spinning up a K8s cluster and then trying to deploy my own Polaris was that the pod stuck in the Pending state. A quick kubectl describe pod gave the last event as "Pod didn't trigger scale-up:"

So, let's look at the events (a.k.a operations):

gcloud container operations list --project $PROJECT

Then to drill down on the operation of interest:

gcloud container operations describe operation-XXX --region $REGION --project $PROJECT

It seemed pretty quiet. The last two events were:

CREATE_CLUSTER began at 16:35:38 and ran to 16:41:37
DELETE_NODE_POOL started at 16:41:41 and ran to 16:46:02

So, that delete came hot on the heals of the cluster successfully being created. I looked at the logs with:

gcloud logging read "resource.labels.cluster_name=spark-cluster AND timestamp>=\"2025-11-14T16:41:35Z\" AND timestamp<=\"2025-11-14T16:41:42Z\"" --project=$PROJECT --limit 10 --order=desc

and one of these logs looked like this:

requestMetadata:

callerSuppliedUserAgent: google-api-go-client/0.5 Terraform/1.10.7 (+https://www.terraform.io)

Terraform-Plugin-SDK/2.36.0 terraform-provider-google/dev6,gzip(gfe)

...

response:

operationType: DELETE_NODE_POOL

This was saying that the DELETE_NODE_POOL originated from my own Terraform-Plugin-SDK! And the reason for that was my Terraform had:

"remove_default_node_pool": true

It did this because it then tried to create its own node pool. However, it seems that having 2 node pools at once exhausted the GCP quotas. My node failed to start but TF merrily went ahead and continued to delete the default pool.

You can see quotas with:

gcloud compute regions describe $REGION

and node pools with:

gcloud container node-pools describe default-pool --cluster $CLUSTER_NAME --region $REGION --project $PROJECT

Wednesday, November 5, 2025

Spark Operator

I've found that managing Spark clusters in Kubernetes is far easier using the Spark Operator. Here are some commands that helped me diagnose issues.

Dude, where's my appliction?

List your Spark applications with:

kubectl get sparkapplications

kubectl get sparkapplications spark-connector -o yaml

to see what might be causing trouble for, say, the connector.

It can be annoying when you can't delete a sparkapplication with

kubectl delete sparkapplication YOUR_APP

even though it's running. In my case, I thought a

kubectl rollout restart deployment spark-kubernetes-operator

left an orphaned cluster.

It's possible that you don't see anything even though there are Spark pods clearly there. In this case:

kubectl describe pod POD_NAME

and you should see something like:

...

Controlled By: StatefulSet/XXX

...

Great, so it looks like the Spark Operator has set the cluster up by delegating to Kubernetes primitives. Let's see them:

kubectl get statefulsets

and then we can just:

kubectl delete statefulset XXX

OK, so, dude, where's my cluster

But we're barking up the wrong tree. The YAML to create a cluster has kind: SparkCluster so we're using the wrong CRD with sparkapplications.

kubectl get crd | grep spark

sparkclusters.spark.apache.org 2025-11-04T10:52:56Z

...

Right, so now:

kubectl delete sparkclusters YOUR_CLUSTER

Python

As a little aside, I was seeing strange errors when running PySpark commands that appeared to be a versioning problems. A few commands that came in useful were:

import sys

print(sys.path)

to print where the Python executable was getting its libraries from and:

from pyspark.version import __version__

print(__version__)

to make sure we really did have the correct PySpark version.

As it happened, it was the wrong version of the Iceberg runtime in spark.jars.packages.

Monday, November 3, 2025

AWS, Kubernetes and more

Setting up a 3 node Kubernetes cluster in AWS is as simple as:

eksctl create cluster --name $CLUSTERNAME --nodes 3

but this really hides a huge amount of what is going on. Apart from IAM, eksctl automatically creates:

a new Virtual Private Cloud (VPC) in which sit the K8s control plane and workers. A VPC is "a logically isolated and secure network environment that is separate from the rest of the AWS cloud" [1]
two public subnets and two private subnets (best practice if you want high availability). By putting worker nodes in the private subnet, they cannot be maliciously scanned from the internet.
all necessary NAT Gateways to allow the private subnets to access the internet
Internet Gateways allowing the internet to talk to your public subnets.
Route Tables which are just rules for network traffic. It's the "Routers use a route table to determine the best path for data packets to take between networks" [2]

You can see some details with:

$ eksctl get cluster --name=$CLUSTERNAME --region=$REGION

NAME VERSION STATUS CREATED VPC SUBNETS SECURITYGROUPS PROVIDER

spark-cluster 1.32 ACTIVE 2025-10-27T10:36:02Z vpc-REDACTED subnet-REDACTED,subnet-REDACTED,subnet-REDACTED,subnet-REDACTED,subnet-READACTED,subnet-REDACTED sg-REDACTED EKS

Terraform

If you use Terraform, you might need to configure your local kubectl to talk to the EKS cluster by hand.

First, back up your old config with:

mv ~/.kube ~/.kube_bk

then run:

aws eks update-kubeconfig --name $CLUSTERNAME --region $REGION

But if you are running aws via Docker, this will have updated ~/.kube/config in the container, not the host. So, run:

docker run --rm -it -v ~/.aws:/root/.aws -v ~/.kube:/root/.kube amazon/aws-cli eks update-kubeconfig --name $CLUSTERNAME --region $REGION

Now it will write to your host's config but even then you'll have to change the command at the end of the file to point to a non-Docker version (yes, you'll have to install the AWS binary - preferably in a bespoke directory so you can continue using the Docker version).

Another issue I had was the connection to the new EKS cluster was different to my ~/.kube/config. This in itself was not a problem as you can put in (using Java and CDKTF):

LocalExecProvisioner.builder()

.when("create") // Run only when the resource is created

.command(String.format(

"aws eks update-kubeconfig --name %s --region %s",

CLUSTER_NAME,

AWS_REGION)

)

.type("local-exec")

.build()

which depends on the EksCluster and the DataAwsEksClusterAuth and in turn, the failing resource are to depend on it.

However, this introduced other problems.

First, I tried to get the reading of ~/.kube/config to depends_on the EKS cluster. That way, I'd only read it once the cluster was up and running, right? Well, no. This introduces a circular dependency as it's read before the cluster is started.

Any fiddling with the dependency tree leads to reading ~/.kube/config when it's stale. So, you need to initialize the Kubernetes details (which appears to be global and otherwise implicit) directly with:

String base64CertData = cluster.getCertificateAuthority().get(0).getData();

String encodedCert = com.hashicorp.cdktf.Fn.base64decode(base64CertData);

KubernetesProvider kubernetesProvider = KubernetesProvider.Builder.create(this, "kubernetes")

.host(cluster.getEndpoint())

.clusterCaCertificate(encodedCert)

.token(eksAuthData.getToken()) // Dynamically generated token

.build();

Strangely, you still need to define the environment variable, KUBE_CONFIG_PATH as some resources need it, albeit after it has been correctly amended with the current cluster's details.

Zombie Clusters

Running:

tofu destroy -auto-approve

just kept hanging. So, I ran:

tofu state list | grep -E "(nat_gateway|eip|eks_cluster)"

and found some EKS components running that I had to kill with:

tofu destroy -auto-approve -target=...

Finally, kubectl get pods barfed with no such host.

Load balancers

The next problem was the tofu destroy action was constantly saying:

aws_subnet.publicSubnet2: Still destroying... [id=subnet-XXX, 11m50s elapsed]

So, I ran:

aws ec2 describe-network-interfaces \

--filters "Name=subnet-id,Values=subnet-XXX" \

--query "NetworkInterfaces[].[NetworkInterfaceId, Description, InterfaceType, Status]" \

--output table

and got an ENI that I tried to delete with:

aws ec2 delete-network-interface --network-interface-id eni-XXX

only to be told that it was still in use. Ho, hum:

$ aws ec2 describe-network-interfaces \

--network-interface-ids eni-XXX \

--query "NetworkInterfaces[0].{ID:NetworkInterfaceId, Description:Description, Status:Status, Attachment:Attachment}" \

--output json

...

"InstanceOwnerId": "amazon-elb",

...

So, let's see what that load balancer is:

$ aws elb describe-load-balancers \

--query "LoadBalancerDescriptions[?contains(Subnets, 'subnet-XXX')].[LoadBalancerName]" \

--output text

which gives me its name and now I can kill it with:

aws elb delete-load-balancer --load-balancer-name NAME

Finally, the destroy just wasn't working, failing ultimately with:

│ Error: deleting EC2 VPC (vpc-XXX): operation error EC2: DeleteVpc, https response error StatusCode: 400, RequestID: 8412a305-..., api error DependencyViolation: The vpc 'vpc-XXX' has dependencies and cannot be deleted.

Just going into the web console and deleting it there was the simple but curious solution.

[1] Architecting AWS with Terraform

[2] The Self-Taught Cloud Computing Engineer

Thursday, October 23, 2025

Exactly Once Semantics in Iceberg/Kafka

Iceberg has a Kafka Connect component that ensures exactly one semantics despite there being two transactions for the two systems (Kafka and Iceberg) per write.

So, how does is this done? The answer is simple: idemptotency. This code here that prepares the data to be committed but by delegating to this code in MergingSnapshotProducer, it ensures there are no dupes.

Note the Coordinator commits metadata to Iceberg.

The Coordinator and the Worker dance a tango. The steps go like this:

The Worker writes Iceberg data to storage. Having saved its SinkRecords to its RecordWriter, it then goes on to processControlEvents.

This involves polling the internal topic for Kafka ConsumerRecords. If this worker is the in the consumer group, and the event is a START_COMMIT, it sends to its internal topic all the ContentFiles that it was responsible for writing, wrapped in a DataWritten object. It also sends a DataComplete object with these Events all as part of a single Kafka transaction.

In turn, if the Coordinator receives a DataComplete object, it calls the code that idempotently writes to Iceberg mentioned above within Iceberg transactions. That is, if the ContentFiles wrapped in the DataWritten object are already in the metadata, they are essentially ignored.

The Coordinator can also trigger a commit if it deems it to have timed out.

The key thing is the worker only acknowledges the SinkRecords it read from the external Kafka topic as part of the same transaction it uses to send the Events to its internal Kafka topic. That way, if the worker crashes after writing to Iceberg storage, those SinkRecords will be read again from Kafka and written again to Iceberg storage. However, the Kafka metadata will be updated exactly once - at the potential cost of some orphan files, it seems.

[In addition, Coordinator sends a CommitToTable and CommitComplete objects to its internal Kafka topic in an unrelated transaction. This appears to be for completeness as I can't see what purpose it serves.]

It's the Coordinator that, in a loop, sent a StartCommit object onto the internal Kafka topic in the first place. It only does this having deemed the commit ready (see previous blog post).

And so the cycle is complete.

Monday, October 20, 2025

Spark and K8s on AWS

This is a high-level overview of my work creating a Spark cluster running in a managed AWS Kubernetes cluster (EKS) giving Spark the permissions to write to cross cloud storage. I might write further

EKS

Create a Kubernetes cluster with:

eksctl create cluster --name spark-cluster --nodes 3

Note that this can take about 15 minutes.

Note that you can have several K8s contexts on your laptop. You can see them with:

kubectl config get-contexts

and choose one with:

kubectl config use-context <your-cluster-context-name>

then you can run kubectl get pods -A and see K8s in the cloud.

A couple of caveats: I had to install aws-iam-authenticator by building it from scratch. This also required me to install GoLang. Binaries are installed in ~/go/bin.

Authorization

The hard thing about setting up a Kubernetes cluster in AWS is configuring permissions. This guide helped me but it was strictly the section Option 2: Using EKS Pod Identity that helped me.

Basically, you have to configure a bridge between the K8s cluster and AWS. This is done through through a CSI (Container Storage Interface), a standard K8s mechanism for storage but in this case it's a means for storing secrets.

The high-level recipe for using secrets is:

Create the secret with aws secretsmanager create-secret ...
Create a policy with aws iam create-policy... This reference the secret's ARNs in step 1.
Create a role with aws iam create-role... that allows a role to be assumed via STS.
Attach the policy in step 2 with the role in step 3 with aws iam attach-role-policy...
Create a SecretProviderClass with kubectl apply -f... that references the secrets created in step 1
Associate your K8s Deployment with the SecretProviderClass using its volumes.

Thanks to the installation of the CSI addons, secrets can be mounted in a Polaris container that are then used to vend credentials to Spark.

Diagnosing Spark in the Cluster

Connecting to Spark from my laptop produced these logs:

25/09/23 18:15:58 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...

25/09/23 18:15:58 INFO TransportClientFactory: Successfully created connection to a457XXX.eu-west-2.elb.amazonaws.com/3.9.78.227:7077 after 26 ms (0 ms spent in bootstraps)

25/09/23 18:16:18 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...

25/09/23 18:16:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...

where the connections were timing out. So I logged in with:

kubectl exec -it prod-master-0 -- /bin/bash

and looked at the actual logs under /opt/spark/ which were far more illuminating.

Spark Kubernetes Operator

I used the Spark operator to configure a Spark cluster for me. However, examples/prod-cluster-with-three-workers.yaml seemed to be out of synch with the CRDs installed by Helm(?). The apiVersion seems to need to be:

apiVersion: spark.apache.org/v1alpha1

This change meant I could then start my Spark cluster.

Aside: to restart the whole cluster, use:

kubectl delete -f spark_cluster.yaml

kubectl apply -f spark_cluster.yaml

AWS CLI annoyances

Running the aws command rather annoyingly adds control characters. I spent a frustrated hour wondering why the exact text rendered from its output could not be used as input for another command.

To remove the invisible control characters, run something like:

aws secretsmanager list-secrets --query "SecretList[?Name=='CloudDetails'].ARN" --output text --no-cli-pager | tr -d '[:cntrl:]' | tr -d '\n'

where in this example, I trying to find the ARN of a particular secret.

Wednesday, October 15, 2025

Configuring Polaris Part 1

To vend credentials, Polaris needs an AWS (or other cloud provider) account. But what if you want to talk to several AWS accounts? Well this ticket suggests an interesting workaround. It's saying "yeah, use just one AWS account but if you need to use others, set up a role that allows access to other AWS accounts, accounts outside the one that role lives in."

We are working in a cross cloud environment. We talk to not just AWS but GCP and Azure clouds. We happen to host Polaris in AWS but this choice was arbitrary. We can give Polaris the ability to vend credentials for all clouds no matter where it sits.

Integration with Spark

It's the spark.sql.catalog.YOUR_CATALOG.warehouse SparkConf value that identifies the Polaris catalog.

The YOUR_CATALOG defines the namespace. In fact, the top level value, spark.sql.catalog.YOUR_CATALOG, tells Spark which catalog to use (Hive, Polaris, etc).

So, basically, your config should look something like:

spark.sql.catalog.azure.oauth2.token POLARIS_ACCESS_TOKEN

spark.sql.catalog.azure.client_secret s3cr3t

spark.sql.catalog.azure.uri http://localhost:8181/api/catalog

spark.sql.catalog.azure.token POLARIS_ACCESS_TOKEN

spark.sql.catalog.azure.type rest

spark.sql.catalog.azure.scope PRINCIPAL_ROLE:ALL

spark.sql.catalog.azure.client_id root

spark.sql.catalog.azure.warehouse azure

spark.sql.catalog.azure.header.X-Iceberg-Access-Delegation vended-credentials

spark.sql.catalog.azure.credential root:s3cr3t

spark.sql.catalog.azure.cache-enabled false

spark.sql.catalog.azure.rest.auth.oauth2.scope PRINCIPAL_ROLE:ALL

spark.sql.catalog.azure org.apache.iceberg.spark.SparkCatalog

This is the config specific to my Azure catalog. AWS and GCP would have very similar config.

One small issue [GitHub] is that I needed the Iceberg runtime lib to be the first in the Maven dependencies.

Local Debugging

Put:

"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005",

in build.gradle.kts in the

tasks.named<QuarkusRun>("quarkusRun") {

jvmArgs =

listOf(

section then run with:

./gradlew --stop && ./gradlew run

then you'll then be able to remotely debug by attaching to port 5005.

Configuring Polaris in a remote environment

Note that Polaris heavily uses Quarkus. "Quarkus aggregates configuration properties from multiple sources, applying them in a specific order of precedence." [docs]. First, java -D... properties, environment variables, application.properties (first on the local filepath then in the dependencies) and finally the hard-coded values.

Polaris in integration tests

Naturally, you're going to want to write a suite of regression tests. This is where the wonderful TestContainers shines. You can fire up a Docker container of Polaris in Java code.

There are some configuration issues. AWS and Azure are easy to configure within Polaris. You must just pass them the credentials as environment variables. GCP is a little harder as it's expecting a file of JSON containing its credentials (the Application Default Credentials file). Fortunately, TestContainers allows you to copy that file over once the container has started running.

myContainer = new GenericContainer<>("apache/polaris:1.1.0-incubating")

// AWS

.withEnv("AWS_ACCESS_KEY_ID", AWS_ACCESS_KEY_ID)

.withEnv("AWS_SECRET_ACCESS_KEY", AWS_SECRET_ACCESS_KEY)

// Azure

.withEnv("AZURE_CLIENT_SECRET", AZURE_CLIENT_SECRET)

.withEnv("AZURE_CLIENT_ID", AZURE_CLIENT_ID)

.withEnv("AZURE_TENANT_ID", AZURE_TENANT_ID)

// Polaris

.withEnv("POLARIS_ID", POLARIS_ID)

.withEnv("POLARIS_SECRET", POLARIS_SECRET)

.withEnv("POLARIS_BOOTSTRAP_CREDENTIALS", format("POLARIS,%s,%s", POLARIS_ID, POLARIS_SECRET))

// GCP

.withEnv("GOOGLE_APPLICATION_CREDENTIALS", GOOGLE_FILE)

.waitingFor(Wait.forHttp("/q/health").forPort(8182).forStatusCode(200));

;

myContainer.setPortBindings(List.of("8181:8181", "8182:8182"));

myContainer.start();

myContainer.copyFileToContainer(Transferable.of(googleCreds.getBytes()), GOOGLE_FILE);

The other thing you want for a reliable suite of tests is to wait until Polaris starts. Fortunately, Polaris is cloud native and offers a health endpoint which TestContainers can poll.

Polaris in EKS

I found I had to mix both AWS's own library (software.amazon.awssdk:eks:2.34.6) with the official Kubernetes library (io.kubernetes:client-java:24.0.0) before I could interrogate the Kubernetes cluster in AWS from my laptop and look at the logs of the Polaris container.

EksClient eksClient = EksClient.builder()

.region(REGION)

.credentialsProvider(DefaultCredentialsProvider.create())

.build();

DescribeClusterResponse clusterInfo = eksClient.describeCluster(

DescribeClusterRequest.builder().name(clusterName).build());

AWSCredentials awsCredentials = new BasicAWSCredentials(

AWS_ACCESS_KEY_ID,

AWS_SECRET_ACCESS_KEY);

var authentication = new EKSAuthentication(new STSSessionCredentialsProvider(awsCredentials),

region.toString(),

clusterName);

ApiClient client = new ClientBuilder()

.setBasePath(clusterInfo.cluster().endpoint())

.setAuthentication(authentication)

.setVerifyingSsl(true)

.setCertificateAuthority(Base64.getDecoder().decode(clusterInfo.cluster().certificateAuthority().data()))

.build();

Configuration.setDefaultApiClient(client);

Now you'll be able to query and monitor Polaris from outside AWS's Kubernetes offering, EKS.

Wednesday, September 17, 2025

Configuring Polaris for Azure

Azure Config

To dispense tokens, you need to register what Azure calls an app. You do this with:

az ad app create \

--display-name "MyMultiTenantApp" \

--sign-in-audience "AzureADMultipleOrgs"

The sign-in-audience is an enum that determines if we're allowing single- or multi-tenant access (plus some Microsoft specific accounts).

Create the service principle with:

az ad sp create --id <appId>

where <appId> was spat out by the create command.

Finally, you need to assign roles to the app:

az role assignment create --assignee <appId> --role "Storage Blob Data Contributor" --scope /subscriptions/YOUR_SUBSCRIPTION/resourceGroups/YOUR_RESOURCE_GROUP

One last thing: you might need the credentials for this app so you can pass them to Polaris. You can do this with:

az ad app credential reset --id <appId>

and it will tell you the password. This is AZURE_CLIENT_SECRET (see below).

Polaris

First, Polaris needs these environment variables set to access Azure:

AZURE_TENANT_ID

AZURE_CLIENT_ID

AZURE_CLIENT_SECRET

Note that these are the credentials of the user connecting to Azure and not those of the principal that will dispense tokens. That is, they're your app's credentials not your personal ones.

Configuring the Catalog

The values perculiar to Azure that you need to add to the storageConfigInfo that is common to all cloud providers. These are:

tenantId. You can find this by running az account show --query tenantId.
multiTenantAppName. This is the Application (client) ID that was generated when the app was created. You can see it in Microsoft Entra ID -> App Registrations -> All Applications in the Azure portal or using the CLI: az ad app list, find your app with the name you created above and use its appId.
consentUrl. I'm not entirely sure what this is but can be generated with APPID=$(az ad app list --display-name "MyMultiTenantApp" --query "[0].appId" -o tsv) && echo "https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=$APPID&response_type=code&redirect_uri=http://localhost:3000/redirect&response_mode=query&scope=https://graph.microsoft.com/.default&state=12345"

Find out which url to use:

$ az storage account show --name odinconsultants --resource-group my_resource --query "{kind:kind, isHnsEnabled:isHnsEnabled}" -o table

Kind IsHnsEnabled

--------- --------------

StorageV2 True

HNS stands for hierarchical nested structure.

For StorageV2 and HNS equal to True, use abfss in the allowedLocations part of the JSON sent to /api/management/v1/catalogs.

Testing the credentials

Check the validity of the SAS token with:

az storage blob list --account-name odinconsultants --container-name myblob --sas-token $SAS_TOKEN --output table

We get SAS_TOKEN by putting a break point in Iceberg's ADLSOutputStream constructor.

Iceberg bug?

Iceberg asserts that the key in the map of Azure properties sent by Polaris for the expiry time is the string is adls.sas-token-expires-at-ms.PROJECT_NAME. But Polaris is emitting the string expiration-time. I've raised a bug in the Iceberg project here (14069).

I also raised a concurrency bug here (14070) but closed it when I realised it had been fixed in the main branch even if it wasn't in the latest (1.9.2) release.

Anyway, the upshot is that my workaround is to mask the Iceberg code by putting this (almost identical) class first in my classpath.

Thursday, September 11, 2025

Configuring Polaris for GCP

After successfully configuring Polaris to vend AWS credentials, I've moved on to the Google Cloud Platform.

As far as the Spark client and Polaris are concerned, there's not much difference to AWS. The only major change is that upon creating the Catalog, you need to put in your JSON a gcsServiceAccount. This refers to the service account you need to create.

You create it with something like:

$ gcloud iam service-accounts create my-vendable-sa --display-name "Vendable Service Account"

and then add to it the user for whom it will deputize:

$ gcloud iam service-accounts add-iam-policy-binding my-vendable-sa@philltest.iam.gserviceaccount.com --member="user:phillhenry@gmail.com" --role="roles/iam.serviceAccountTokenCreator"

where philltest is the project and phillhenry is the account for which the token will proxy.

~~$ gcloud auth application-default login --impersonate-service-account=my-vendable-sa@philltest.iam.gserviceaccount.com~~

~~Note it will warn you to run something like gcloud auth application-default set-quota-project philltest or whatever your project is called.~~

This will open your browser. After you sign in, it will write a credential in JSON file (it will tell you where). You must point the environment variable GOOGLE_APPLICATION_CREDENTIALS to this file before you run Polaris.

~~Note that it's the application-default switch that writes the credential as JSON so it can be used by your applications.~~

This creates a token for you but not that for a service account. For that you need:

gcloud iam service-accounts keys create ~/service-account.json --iam-account my-vendable-sa@philltest.iam.gserviceaccount.com

You might need your friendly admin to run:

gcloud org-policies set-policy allow-key-creation.yaml

where that YAML file contains:

name: projects/fjords-450014/policies/iam.disableServiceAccountKeyCreation

spec:

rules:

- enforce: false

Not so fast...

Running my Spark code just barfed when it tried to write to GCS with:

my-vendable-sa@philltest.iam.gserviceaccount.com does not have serviceusage.services.use access

Note that if the error message mentions your user (in my case PhillHenry@gmail.com) not the service account, you've pointed GOOGLE_APPLICATION_CREDENTIALS at the wrong application_default_credentials.json file.

Basically, I could write to the bucket using the command line but not using the token. I captured the token by putting a breakpoint here in Iceberg's parsing of the REST response from Polaris. Using the token (BAD_ACCESS_TOKEN), I ran:

$ curl -H "Authorization: Bearer ${BAD_ACCESS_TOKEN}" https://storage.googleapis.com/storage/v1/b/odinconsultants_phtest

{

"error": {

"code": 401,

"message": "Invalid Credentials",

"errors": [

{

"message": "Invalid Credentials",

"domain": "global",

"reason": "authError",

"locationType": "header",

"location": "Authorization"

}

]

}

The mistake I was making was that the command line was associated with me (PhillHenry) not the service account - that's why I could upload on CLI. Check your CLOUDSDK_CONFIG environment variable to see which credential files you're using.

Seems that the service account must have the role to access the account:

$ gcloud projects get-iam-policy philltest --flatten="bindings[].members" --format="table(bindings.role)" --filter="my-vendable-sa@philltest.iam.gserviceaccount.com"

ROLE

roles/serviceusage.serviceUsageAdmin

roles/storage.admin

roles/storage.objectAdmin

roles/storage.objectCreator

This appears to fix it because the service account must be able to see the project before it can see the project's buckets [SO].

Now, Spark is happily writing to GCS.