Tuesday, February 3, 2026

Polaris Federation notes

Here are some miscellaneous notes I made as I work my way through groking Polaris Federation and Authorization.

First:

Polaris DevOps

Polaris has integration tests in the integration-tests/src/main/java/ directory not the test directory as you would have imagined. The reason for this is that they can be packaged as a JAR and used elsewhere in the codebase. 

The advantage to doing it this way is that the same tests can be run against a local Polaris, a Polaris in the cloud, a Polaris running in Docker etc.

So, if we take CatalogFederationIntegrationTest, we can see it subclassed in 
the spark-tests Gradle package where it can be run with:

./gradlew :polaris-runtime-spark-tests:intTest

If you try to run the superclass in its own module with Gradle, it cannot be found as it's not in the test directory. If you try to run it with your IDE, you'll find that classes to be wired in at runtime are missing. Running the subclass, CatalogFederationIT, starts a MinIO docker container against which it can run.

Federation

The DTOs (Data Transfer Objects) for creating catalogs etc live in 

org.apache.polaris.core.admin.model 

For example ExternalCatalog which can be serialized into JSON.

These are passed across the wire and are turned into DPOs (Data Persistence Objects) that live in 

org.apache.polaris.core.connection.iceberg

In the case of the IcebergRestConnectionConfigInfoDpo, this DPO object is not a mere aneamic domain model. It has the logic to, for instance, create the properties that will be used to instantiate the class that will govern authentication. It does this by delegating to this factory class:

org.apache.iceberg.rest.auth.AuthManagers

Notice that we have moved from Polaris to the world of Iceberg. The various AuthManagers implement access to OAuth2 providers, Google, SigV4 for AWS etc.

However, there is a mismatch. The AuthenticationParameters DTO classes don't fully align with the AuthManager classes. For instance, there doesn't appear to be a way of creating an external catalog with authorisation via org.apache.iceberg.gcp.auth.GoogleAuthManager.

So, after a day of investigating and trying to hack something together, it looks like this:
  • Iceberg can talk to Google no problem using org.apache.iceberg.gcp.auth.GoogleAuthManager.
  • However, there is currently no Polaris code to use GoogleAuthManager in an external catalog.
  • Instead, the only way to do it currently is to use the standard OAuth2 code.
  • However, Google does not completely follow the OAuth2 spec, hence this Iceberg ticket that lead to the writing of GoogleAuthManager and this StackOverflow post that says GCP does not support the grant_type that Iceberg's OAuth2Util uses.
This has no been raised in this Polaris ticket.

Saturday, January 31, 2026

Notes on Poetry

The dependencies in Poetry can get screwed. If hashes don't agree then blatting the poetry.lock file will not help. Instead, run:

poetry cache clear pypi --all
poetry cache clear --all .
rm -rf ~/.cache/pypoetry

When updating a dependency, run:

poetry lock
poetry install

This will install an environment in a subfolder of:

~/.cache/pypoetry/virtualenvs/

You can point your IDE at the Python executable underneath this.

Poetry is pretty nice when showing you dependencies. Running something like:

poetry show pandas

shows you everything Pandas needs and everything that depends on it.

This was necessary when decoding a bizarre error in a Jupyter notebook where an import was failing when it was clearly there. In this case, statsmodel and Pandas seem to be disagreeing. 

The code to put in the notebook to check it was using the right version of a library is:

import statsmodels
import pandas

print(statsmodels.__version__)
print(pandas.__version__)
print(statsmodels.__file__)

Now, compare this to the Python environment:

poetry run python - <<EOF
import statsmodels, pandas
print(statsmodels.__version__)
print(statsmodels.__file__)
print(pandas.__version__)
EOF

I had changed dependency versions but my IDE (PyCharm) did not recognise the change until I restarted it.

Some useful one-liners

Run your tests with:

poetry run pytest

The whereabouts of your tests can be found in your pyproject.toml file. It should look something like:

testpaths = ["tests"]
pythonpath = "src"

With this, your tests can import anything under the ROOT/src directory.

Add a dependency with something like:

poetry add ipykernel

This will update your metadata file.

Monday, January 12, 2026

The Federation: AWS Glue

Apache Polaris can act as a proxy to other catalogs. This still appears to be work in progress as the roadmap proposal has "Catalog Federation" as "Tentatively Planned" at least until release 1.5.

If you're running it from source, you'll need to enable:

polaris.features."ENABLE_CATALOG_FEDERATION"=true
polaris.features."SUPPORTED_CATALOG_CONNECTION_TYPES"=["ICEBERG_REST", "HIVE"]
polaris.features."SUPPORTED_EXTERNAL_CATALOG_AUTHENTICATION_TYPES"=["OAUTH", "BEARER", "SIGV4"]


Apache Polaris can be a proxy for an Iceberg REST endpoint. In each case, a org.apache.polaris.core.admin.model.ExternalCatalog is passed across the wire to create a catalog. Only the details differ.

AWS Glue

Glue is its own beast but it does offer an Iceberg REST endpoint. To use it, the AuthenticationParameters in the ExternalCatalog must be of type SigV4AuthenticationParameters.
"In IAM authentication, instead of using a password to authenticate against the [service], you create an authentication token that you include with your ... request. These tokens can be generated outside the [service] using AWS Signature Version 4(SigV4) and can be used in place of regular authentication." [1]
So, the SigV4AuthenticationParameters ends up taking the region, role ARN, etc. The role must be available to the Principal that is associated with the Polaris instance. In addition, there must be a --policy-document that allows the Action glue:GetCatalog.

Finally, the Glue database and table must be created with Parameters that contain iceberg.table.default.namespace and an IcebergInput block.

TL;DR - most of the work is in configuring AWS not the calling code.

[1] Security and Microservice Architecture on AWS

Tuesday, January 6, 2026

Cross account AWS Permissions

You can have one AWS account see the contents of the S3 bucket of another entirely separate account if you configure it correct. Note that S3 buckets are unique across the whole AWS estate, irrespective of who owns it. This was for historical reasons, it seems.

Anyway, to have account Emrys (say) read the bucket of account Odin (say), run the commands below.

Note, that all of this can be run from the same command line if you have Emrys as your [default] account and [odin] as Odin's in ~/.aws/config and credentials. You'll need source_profile = odin in config to point to the correct credentials.

First, we say create a role in Odin that Emrys will assume:

aws iam create-role --role-name S3ReadWriteRoleEmrys --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
              "AWS": "arn:aws:iam::EMRYS_ID:root"
      } ,
      "Action": "sts:AssumeRole"
    }
  ]
}' --profile odin 

Then we create a policy and attach it to the role.

aws iam create-policy --policy-name ReadOdinS3IAMPolicy  --policy-document file://emrys-policy.json --profile odin

aws iam attach-role-policy   --role-name S3ReadWriteRoleEmrys   --policy-arn arn:aws:iam::ODIN_ID:policy/ReadOdinS3IAMPolicy --profile odin

Note that emrys-policy.json is just a collection of s3 Actions that act on a Resource that is Odin's bucket - nothing special.

Then in the Emrys real estate, we 

aws iam create-policy \
  --policy-name AssumeOdinRole \
  --policy-document file://assume-account-odin-role.json

aws iam attach-user-policy \
  --user-name MY_USER \
  --policy-arn arn:aws:iam::EMRYS_ID:policy/AssumeOdinRole

where assume-account-odin-role.json just contains the sts:AssumeRole for Odin's S3ReadWriteRoleEmrys.

Finally, we get the temporary credentials to read the bucket with:

aws sts assume-role \
  --role-arn arn:aws:iam::ODIN_ID:role/S3ReadWriteRoleEmrys \
  --role-session-name s3-access

Just paste the output of this into your AWS environment variables.

"For a principal to be able to switch roles, there needs to be an IAM policy attached to it that allows the principal to call the AssumeRole action on STS [Security Token Service].

"In addition, IAM roles can also have a special type of a resource-based policy attached to them, called an IAM role trust policy. An IAM role trust policy can be written just like any other resource-based policy, by specifying the principals that are allowed to assume the role in question... Any principal that is allowed to access a role is called a trusted entity for that role." [1]

Note that the key in the Prinicipal map is significant as it defines the category of the identity. It can be 
  • a Service that is allowed to assume a role (eg, EKS)
  • AWS which indicates an IAM user or role or an assumed role (see below). Note that root does not indicate the most powerful user as in Unix. On the contrary, it means anybody legitimately associated with this account.
  • Federated which means it's a provider external to the native AWS ecosystem.
STS is the system an identity must apply if it wishes to assume a role. This system checks that the identity is indeed allowed to do this.

An assumed role looks like this:

arn:aws:sts::123456789012:assumed-role/S3ReadWriteRoleSF/snowflake

where S3ReadWriteRoleSF is the normal, IAM role name and snowflake is the session name. This session name is merely a tag and has no intrinsic permissions (although it may be used in Condition/StringEquals). This will be set in --role-session-name (see above) when assuming the role.

[1] Security and Microservice Architecture on AWS

Wednesday, December 31, 2025

Debugging JNI calls to the GPU

I'm playing aroung with a Java based LLM (code here). When running a JVM that calls the GPU using the TornadoVM, it crashed and in the log, I saw:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1302b0]
C  [libcuda.so.1+0x332420]
C  [libtornado-ptx.so+0x64b8]  Java_uk_ac_manchester_tornado_drivers_ptx_PTXStream_cuLaunchKernel+0x198
j  uk.ac.manchester.tornado.drivers.ptx.PTXStream.cuLaunchKernel([BLjava/lang/String;IIIIIIJ[B[B)[[B+0 tornado.drivers.ptx@2.2.1-dev
...

Now, finding the Shared Object files (*.so), I called: 

objdump -d /usr/lib/x86_64-linux-gnu/libcuda.so.1 
objdump -d /usr/local/bin/Java/tornadovm-2.2.1-dev-ptx/lib/libtornado-ptx.so

and looked at the addresses in the stack dump.

First, libtornado-ptx.so. Note that the address (0x64b8) is the return address from a call, that is the next line after the call that went Pete Tong. 

    64b3:       e8 b8 e1 ff ff          call   4670 <cuLaunchKernel@plt>
    64b8:       48 83 c4 30             add    $0x30,%rsp

So, it's the call to cuLaunchKernel that is interesting.

  33241b:       e8 00 de df ff          call   130220 <exit@plt+0x4e460>
  332420:       5a                      pop    %rdx

and the final (top most) stack frame:

  1302ab:       4d 85 e4                test   %r12,%r12
  1302ae:       74 58                   je     130308 <exit@plt+0x4e548>
  1302b0:       41 8b 04 24             mov    (%r12),%eax

The instruction test %x,%y is a common idiom in null checks (basically, it's x and y are ANDed and the je jumps if the Zero Flag is set - note that this flag is set if the result of the AND is non-zero or both x and y are zero).

So, it looks like we've essentially got what's equivalent to a NullPointerException in the machine code. Still looking at what's null... [Solved: had to use a model that is compatible with GPULlama3.java)

Monday, December 15, 2025

AWS cheatsheet

Various command lines that have helped me recently.

IAM

List the role's attached, inline and assumed (trust) policies with:

aws iam list-attached-role-policies --role-name $ROLE_NAME

aws iam list-role-policies --role-name $ROLE_NAME

Whoami with:

aws sts get-caller-identity 

Policies are a collection of actions and services that can be assigned. List all homemade policies with:

aws iam list-policies --scope Local --query 'Policies[].Arn' --output table

Similarly, list all roles with:

aws iam list-roles --query 'Roles[].RoleName' --output table

List all the Actions for a policy with:

aws iam get-policy-version --policy-arn $POLICY_ARN --version-id $(aws iam get-policy --policy-arn $POLICY_ARN --query 'Policy.DefaultVersionId' --output text) --query 'PolicyVersion.Document.Statement[].Action'   --output json | jq -r '.[]' | sort -u

List all the trust policies for a given role:

aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json

Note that assuming a role implies some temporary elevation of privileges while attaching a role is more about defining what a role actually is.

List everything attached to a policy:

aws iam list-entities-for-policy --policy-arn $POLICY_ARN

Instance profiles contain roles. They act as a bridge to securely pass an IAM role to an EC2 instance, enabling the instance to access other AWS services without needing to store long-term, hard-coded credentials like access keys. You can see them with:

aws iam list-instance-profiles-for-role --query 'AttachedPolicies[*].PolicyArn' --role-name $ROLE_NAME --query "InstanceProfiles[].InstanceProfileName" --output text

In short: 
  • Trust policies say who can access a role. 
  • Permission policies say what a role can do.
Note that this is why trust polices have only one action: sts:AssumeRole.

Secrets

See access to K8s secrets with:

kubectl logs -n kube-system -l app=csi-secrets-store-provider-aws -XXX

See an AWS secret with:

aws secretsmanager get-secret-value --secret-id $SECRET_ARN --region $REGION

Deleting them is interesting as they will linger unless told otherwise:

aws --region $REGION secretsmanager  delete-secret --secret-id $SECRET_NAME --force-delete-without-recovery

Infra

To see why your EKS deployments aren't working:

kubectl get events --sort-by=.metadata.creationTimestamp | tail -20

Terraform seems to have a problem deleting load balancers in AWS. You can see them with:

aws elbv2 describe-load-balancers

List the load balancers:

aws elb describe-load-balancers --region $REGION

List the VPCs:

aws ec2 describe-vpcs --region $REGION

Glue

Create with:

aws glue create-database  --database-input '{"Name": "YOUR_DB_NAME"}'  --region $REGION

Create an Iceberg table with:

aws glue create-table \
    --database-name YOUR_DB_NAME \
    --table-input '
        {
            "Name": "TABLE_NAME",
            "TableType": "EXTERNAL_TABLE",
            "StorageDescriptor": {
                "Location": "s3://ROOT_DIRECTORY_OF_TABLE/",
                "Columns": [
                    { "Name": "id", "Type": "int" },
...
                    { "Name": "randomInt", "Type": "int" }
                ],
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
                }
            },
            "Parameters": {
                "iceberg.table.default.namespace": "YOUR_DB_NAME"
            }
        }' \
    --open-table-format-input '
        {
            "IcebergInput": {
                "MetadataOperation": "CREATE",
                "Version": "2" 
            }
        }' \
    --region $REGION

Get all the databases with:

aws glue get-databases --query 'DatabaseList[*].Name' --output table

Get tables with:

aws glue get-tables --database-name YOUR_DB_NAME

Drop with:

aws glue delete-table --name TABLE_NAME --database-name YOUR_DB_NAME

Monday, November 24, 2025

AWS and HTTPs


This proved surprisingly hard but the take-away points are:
  • A Kubernetes service account must be given permissioned to access AWS infra structure
  • The Kubernetes cluster needs AWS specific pods to configure the K8s ingress such that it receives traffic from outside the cloud
  • The ingress is where the SSL de/encryption is performed.
  • Creating the certificate is easy when using the AWS web console and there you just associate it with the domain name.
The recipe

The following steps assume you have an ingress and a service already up and running. I did the mapping between the two in Terraform. What follows below is how to allow these K8s primitives to use AWS so they can be contacted by the outside world.

You need to associate an OpenID provider with the cluster and create a Kubernetes service account that is permissioned to use the AWS load balancer. Note that lines that are predominantly Kubernetes are blue and AWS lines are red.

eksctl utils associate-iam-oidc-provider --cluster $CLUSTERNAME --approve

curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json

aws iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy     --policy-document file://iam_policy.json

eksctl create iamserviceaccount --cluster $CLUSTERNAME --namespace kube-system   --name aws-load-balancer-controller --attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy --approve

kubectl describe sa aws-load-balancer-controller -n kube-system # check it's there

Then you need to configure Kubernetes to use the AWS load balancer. 

helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=$CLUSTERNAME --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

However, I could see my replicasets were failing when I ran:

kubectl get rs -A

with something like:

  Type     Reason        Age                   From                   Message
  ----     ------        ----                  ----                   -------
  Warning  FailedCreate  67s (x15 over 2m29s)  replicaset-controller  Error creating: pods "aws-load-balancer-controller-68f465f899-" is forbidden: error looking up service account kube-system/aws-load-balancer-controller: serviceaccount "aws-load-balancer-controller" not found

and there are no load balancer pods.

So, it seemed I need to: 

kubectl apply -f aws-lbc-serviceaccount.yaml

where aws-lbc-serviceaccount.yaml is:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-load-balancer-controller
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: aws-load-balancer-controller
    app.kubernetes.io/instance: aws-load-balancer-controller

The pods were now starting but quickly failing with errors like:

{"level":"error","ts":"2025-11-21T17:31:22Z","logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}

We can set it with:

helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller   --namespace kube-system --set clusterName=$CLUSTE_NAME --set vpcId=$VPC_ID --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

and now the pods are running.

However, my domain name was still not resolving. So, run this to get the OIDC (OpenID Connector) issuer:

aws eks describe-cluster --name $CLUSTERNAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///"

Note that this value changes every time the cluster is created.

Then run:

aws iam create-role \
    --role-name ${IAM_ROLE_NAME} \
    --assume-role-policy-document file://lbc-trust-policy.json

where lbc-trust-policy.json is:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_ISSUER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "OIDC_ISSUER:aud": "sts.amazonaws.com",
          "OIDC_ISSUER:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
        }
      }
    }
  ]
}

Get the ARN of that policy:

POLICY_ARN=$(aws iam list-policies --scope Local --query "Policies[?PolicyName=='AWSLoadBalancerControllerIAMPolicy'].Arn" --output text)

Create the role:

aws iam create-role --role-name AmazonEKSLoadBalancerControllerRole --assume-role-policy-document file://lbc-trust-policy.json

attach the policy:

aws iam attach-role-policy --role-name AmazonEKSLoadBalancerControllerRole --policy-arn ${POLICY_ARN}

then inform the cluster:

kubectl annotate serviceaccount aws-load-balancer-controller -n kube-system eks.amazonaws.com/role-arn="arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole" --overwrite

If you're logging the aws-load-balancer-controller-XXX pod, youll see it register this change if you restart the ingress with:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

then check its status with:

kubectl describe ingress $INGRESS_NAME

Note the ADDRESS. It will be of the form k8s-XXX.REGION.elb.amazonaws.com. Let's define it as:

INGRESS_HOSTNAME=$(kubectl get ingress $INGRESS_NAME  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Find the HostedZoneId of your load balancer with:

aws elbv2 describe-load-balancers   --query "LoadBalancers[?DNSName=='INGRESS_HOSTNAME'].CanonicalHostedZoneId" --output text   --region $REGION

Registering domain in Route 53

Create the A type DNS entry with:

HOSTED_ZONE_ID=$(aws route53 list-hosted-zones-by-name \
    --dns-name $FQDN \
    --query "HostedZones[0].Id" --output text | awk -F'/' '{print $3}')

aws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID"  --change-batch file://route53_change.json

where route53_change.json is:

{
      "Comment": "ALIAS record for EKS ALB Ingress",
      "Changes": [
        {
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "FQDN",
            "Type": "A",
            "AliasTarget": {
              "HostedZoneId": "YOUR_HOST_ZONE_ID",
              "DNSName": "INGRESS_HOSTNAME",
              "EvaluateTargetHealth": false
            }
          }
        }
      ]
    }

After a few minutes, you'll see that IP address of the domain name and INGRESS_HOSTNAME are the same.

You can create your own hosted zone with:

aws route53 create-hosted-zone --name "polarishttps.emryspolaris.click"     --caller-reference "$(date +%Y-%m-%d-%H-%M-%S)"

but this can lead to complications. 
"Public-hosted zones have a route to internet-facing resources and resolve from the internet using global routing policies. Meanwhile, private hosted zones have a route to VPC resources and resolve from inside the VPC." - AWS for Solution Architects, O'Reilly

Certificate

We've now linked the domain name to an endpoint. Now we need to create a certificate. I did this through the AWS web console and after just a few clicks, it gave me the ARN.

You might need to wait a few minutes for it to become live but you can see the status of a certificate with:

aws acm describe-certificate --certificate-arn "$CERT_ARN" --region eu-west-1 --query "Certificate.Status"  --output text