Tuesday, January 6, 2026

Cross account AWS Permissions

You can have one AWS account see the contents of the S3 bucket of another entirely separate account if you configure it correct. Note that S3 buckets are unique across the whole AWS estate, irrespective of who owns it. This was for historical reasons, it seems.

Anyway, to have account Emrys (say) read the bucket of account Odin (say), run the commands below.

Note, that all of this can be run from the same command line if you have Emrys as your [default] account and [odin] as Odin's in ~/.aws/config and credentials. You'll need source_profile = odin in config to point to the correct credentials.

First, we say create a role in Odin that Emrys will assume:

aws iam create-role --role-name S3ReadWriteRoleEmrys --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
              "AWS": "arn:aws:iam::EMRYS_ID:root"
      } ,
      "Action": "sts:AssumeRole"
    }
  ]
}' --profile odin 

Then we create a policy and attach it to the role.

aws iam create-policy --policy-name ReadOdinS3IAMPolicy  --policy-document file://emrys-policy.json --profile odin

aws iam attach-role-policy   --role-name S3ReadWriteRoleEmrys   --policy-arn arn:aws:iam::ODIN_ID:policy/ReadOdinS3IAMPolicy --profile odin

Note that emrys-policy.json is just a collection of s3 Actions that act on a Resource that is Odin's bucket - nothing special.

Then in the Emrys real estate, we 

aws iam create-policy \
  --policy-name AssumeOdinRole \
  --policy-document file://assume-account-odin-role.json

aws iam attach-user-policy \
  --user-name MY_USER \
  --policy-arn arn:aws:iam::EMRYS_ID:policy/AssumeOdinRole

where assume-account-odin-role.json just contains the sts:AssumeRole for Odin's S3ReadWriteRoleEmrys.

Finally, we get the temporary credentials to read the bucket with:

aws sts assume-role \
  --role-arn arn:aws:iam::ODIN_ID:role/S3ReadWriteRoleEmrys \
  --role-session-name s3-access

Just paste the output of this into your AWS environment variables.

"For a principal to be able to switch roles, there needs to be an IAM policy attached to it that allows the principal to call the AssumeRole action on STS [Security Token Service].

"In addition, IAM roles can also have a special type of a resource-based policy attached to them, called an IAM role trust policy. An IAM role trust policy can be written just like any other resource-based policy, by specifying the principals that are allowed to assume the role in question... Any principal that is allowed to access a role is called a trusted entity for that role." [1]

Note that the key in the Prinicipal map is significant as it defines the category of the identity. It can be 
  • a Service that is allowed to assume a role (eg, EKS)
  • AWS which indicates an IAM user or role or an assumed role (see below). Note that root does not indicate the most powerful user as in Unix. On the contrary, it means anybody legitimately associated with this account.
  • Federated which means it's a provider external to the native AWS ecosystem.
STS is the system an identity must apply if it wishes to assume a role. This system checks that the identity is indeed allowed to do this.

An assumed role looks like this:

arn:aws:sts::123456789012:assumed-role/S3ReadWriteRoleSF/snowflake

where S3ReadWriteRoleSF is the normal, IAM role name and snowflake is the session name. This session name is merely a tag and has no intrinsic permissions (although it may be used in Condition/StringEquals). This will be set in --role-session-name (see above) when assuming the role.

[1] Security and Microservice Architecture on AWS

Wednesday, December 31, 2025

Debugging JNI calls to the GPU

I'm playing aroung with a Java based LLM (code here). When running a JVM that calls the GPU using the TornadoVM, it crashed and in the log, I saw:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1302b0]
C  [libcuda.so.1+0x332420]
C  [libtornado-ptx.so+0x64b8]  Java_uk_ac_manchester_tornado_drivers_ptx_PTXStream_cuLaunchKernel+0x198
j  uk.ac.manchester.tornado.drivers.ptx.PTXStream.cuLaunchKernel([BLjava/lang/String;IIIIIIJ[B[B)[[B+0 tornado.drivers.ptx@2.2.1-dev
...

Now, finding the Shared Object files (*.so), I called: 

objdump -d /usr/lib/x86_64-linux-gnu/libcuda.so.1 
objdump -d /usr/local/bin/Java/tornadovm-2.2.1-dev-ptx/lib/libtornado-ptx.so

and looked at the addresses in the stack dump.

First, libtornado-ptx.so. Note that the address (0x64b8) is the return address from a call, that is the next line after the call that went Pete Tong. 

    64b3:       e8 b8 e1 ff ff          call   4670 <cuLaunchKernel@plt>
    64b8:       48 83 c4 30             add    $0x30,%rsp

So, it's the call to cuLaunchKernel that is interesting.

  33241b:       e8 00 de df ff          call   130220 <exit@plt+0x4e460>
  332420:       5a                      pop    %rdx

and the final (top most) stack frame:

  1302ab:       4d 85 e4                test   %r12,%r12
  1302ae:       74 58                   je     130308 <exit@plt+0x4e548>
  1302b0:       41 8b 04 24             mov    (%r12),%eax

The instruction test %x,%y is a common idiom in null checks (basically, it's x and y are ANDed and the je jumps if the Zero Flag is set - note that this flag is set if the result of the AND is non-zero or both x and y are zero).

So, it looks like we've essentially got what's equivalent to a NullPointerException in the machine code. Still looking at what's null... [Solved: had to use a model that is compatible with GPULlama3.java)

Monday, December 15, 2025

AWS cheatsheet

Various command lines that have helped me recently.

IAM

List the role's attached, inline and assumed (trust) policies with:

aws iam list-attached-role-policies --role-name $ROLE_NAME

aws iam list-role-policies --role-name $ROLE_NAME

Whoami with:

aws sts get-caller-identity 

Policies are a collection of actions and services that can be assigned. List all homemade policies with:

aws iam list-policies --scope Local --query 'Policies[].Arn' --output table

Similarly, list all roles with:

aws iam list-roles --query 'Roles[].RoleName' --output table

List all the Actions for a policy with:

aws iam get-policy-version --policy-arn $POLICY_ARN --version-id $(aws iam get-policy --policy-arn $POLICY_ARN --query 'Policy.DefaultVersionId' --output text) --query 'PolicyVersion.Document.Statement[].Action'   --output json | jq -r '.[]' | sort -u

List all the trust policies for a given role:

aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json

Note that assuming a role implies some temporary elevation of privileges while attaching a role is more about defining what a role actually is.

List everything attached to a policy:

aws iam list-entities-for-policy --policy-arn $POLICY_ARN

Instance profiles contain roles. They act as a bridge to securely pass an IAM role to an EC2 instance, enabling the instance to access other AWS services without needing to store long-term, hard-coded credentials like access keys. You can see them with:

aws iam list-instance-profiles-for-role --query 'AttachedPolicies[*].PolicyArn' --role-name $ROLE_NAME --query "InstanceProfiles[].InstanceProfileName" --output text

In short: 
  • Trust policies say who can access a role. 
  • Permission policies say what a role can do.
Note that this is why trust polices have only one action: sts:AssumeRole.

Secrets

See access to K8s secrets with:

kubectl logs -n kube-system -l app=csi-secrets-store-provider-aws -XXX

See an AWS secret with:

aws secretsmanager get-secret-value --secret-id $SECRET_ARN --region $REGION

Deleting them is interesting as they will linger unless told otherwise:

aws --region $REGION secretsmanager  delete-secret --secret-id $SECRET_NAME --force-delete-without-recovery

Infra

To see why your EKS deployments aren't working:

kubectl get events --sort-by=.metadata.creationTimestamp | tail -20

Terraform seems to have a problem deleting load balancers in AWS. You can see them with:

aws elbv2 describe-load-balancers

List the load balancers:

aws elb describe-load-balancers --region $REGION

List the VPCs:

aws ec2 describe-vpcs --region $REGION

Glue

Create with:

aws glue create-database  --database-input '{"Name": "YOUR_DB_NAME"}'  --region $REGION

Create an Iceberg table with:

aws glue create-table \
    --database-name YOUR_DB_NAME \
    --table-input '
        {
            "Name": "TABLE_NAME",
            "TableType": "EXTERNAL_TABLE",
            "StorageDescriptor": {
                "Location": "s3://ROOT_DIRECTORY_OF_TABLE/",
                "Columns": [
                    { "Name": "id", "Type": "int" },
...
                    { "Name": "randomInt", "Type": "int" }
                ],
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
                }
            },
            "Parameters": {
                "iceberg.table.default.namespace": "YOUR_DB_NAME"
            }
        }' \
    --open-table-format-input '
        {
            "IcebergInput": {
                "MetadataOperation": "CREATE",
                "Version": "2" 
            }
        }' \
    --region $REGION

Get all the databases with:

aws glue get-databases --query 'DatabaseList[*].Name' --output table

Get tables with:

aws glue get-tables --database-name YOUR_DB_NAME

Drop with:

aws glue delete-table --name TABLE_NAME --database-name YOUR_DB_NAME

Monday, November 24, 2025

AWS and HTTPs


This proved surprisingly hard but the take-away points are:
  • A Kubernetes service account must be given permissioned to access AWS infra structure
  • The Kubernetes cluster needs AWS specific pods to configure the K8s ingress such that it receives traffic from outside the cloud
  • The ingress is where the SSL de/encryption is performed.
  • Creating the certificate is easy when using the AWS web console and there you just associate it with the domain name.
The recipe

The following steps assume you have an ingress and a service already up and running. I did the mapping between the two in Terraform. What follows below is how to allow these K8s primitives to use AWS so they can be contacted by the outside world.

You need to associate an OpenID provider with the cluster and create a Kubernetes service account that is permissioned to use the AWS load balancer. Note that lines that are predominantly Kubernetes are blue and AWS lines are red.

eksctl utils associate-iam-oidc-provider --cluster $CLUSTERNAME --approve

curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json

aws iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy     --policy-document file://iam_policy.json

eksctl create iamserviceaccount --cluster $CLUSTERNAME --namespace kube-system   --name aws-load-balancer-controller --attach-policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy --approve

kubectl describe sa aws-load-balancer-controller -n kube-system # check it's there

Then you need to configure Kubernetes to use the AWS load balancer. 

helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=$CLUSTERNAME --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

However, I could see my replicasets were failing when I ran:

kubectl get rs -A

with something like:

  Type     Reason        Age                   From                   Message
  ----     ------        ----                  ----                   -------
  Warning  FailedCreate  67s (x15 over 2m29s)  replicaset-controller  Error creating: pods "aws-load-balancer-controller-68f465f899-" is forbidden: error looking up service account kube-system/aws-load-balancer-controller: serviceaccount "aws-load-balancer-controller" not found

and there are no load balancer pods.

So, it seemed I need to: 

kubectl apply -f aws-lbc-serviceaccount.yaml

where aws-lbc-serviceaccount.yaml is:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-load-balancer-controller
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: aws-load-balancer-controller
    app.kubernetes.io/instance: aws-load-balancer-controller

The pods were now starting but quickly failing with errors like:

{"level":"error","ts":"2025-11-21T17:31:22Z","logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}

We can set it with:

helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller   --namespace kube-system --set clusterName=$CLUSTE_NAME --set vpcId=$VPC_ID --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

and now the pods are running.

However, my domain name was still not resolving. So, run this to get the OIDC (OpenID Connector) issuer:

aws eks describe-cluster --name $CLUSTERNAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///"

Note that this value changes every time the cluster is created.

Then run:

aws iam create-role \
    --role-name ${IAM_ROLE_NAME} \
    --assume-role-policy-document file://lbc-trust-policy.json

where lbc-trust-policy.json is:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::AWS_ACCOUNT_ID:oidc-provider/OIDC_ISSUER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "OIDC_ISSUER:aud": "sts.amazonaws.com",
          "OIDC_ISSUER:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
        }
      }
    }
  ]
}

Get the ARN of that policy:

POLICY_ARN=$(aws iam list-policies --scope Local --query "Policies[?PolicyName=='AWSLoadBalancerControllerIAMPolicy'].Arn" --output text)

Create the role:

aws iam create-role --role-name AmazonEKSLoadBalancerControllerRole --assume-role-policy-document file://lbc-trust-policy.json

attach the policy:

aws iam attach-role-policy --role-name AmazonEKSLoadBalancerControllerRole --policy-arn ${POLICY_ARN}

then inform the cluster:

kubectl annotate serviceaccount aws-load-balancer-controller -n kube-system eks.amazonaws.com/role-arn="arn:aws:iam::AWS_ACCOUNT_ID:role/AmazonEKSLoadBalancerControllerRole" --overwrite

If you're logging the aws-load-balancer-controller-XXX pod, youll see it register this change if you restart the ingress with:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

then check its status with:

kubectl describe ingress $INGRESS_NAME

Note the ADDRESS. It will be of the form k8s-XXX.REGION.elb.amazonaws.com. Let's define it as:

INGRESS_HOSTNAME=$(kubectl get ingress $INGRESS_NAME  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Find the HostedZoneId of your load balancer with:

aws elbv2 describe-load-balancers   --query "LoadBalancers[?DNSName=='INGRESS_HOSTNAME'].CanonicalHostedZoneId" --output text   --region $REGION

Registering domain in Route 53

Create the A type DNS entry with:

HOSTED_ZONE_ID=$(aws route53 list-hosted-zones-by-name \
    --dns-name $FQDN \
    --query "HostedZones[0].Id" --output text | awk -F'/' '{print $3}')

aws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID"  --change-batch file://route53_change.json

where route53_change.json is:

{
      "Comment": "ALIAS record for EKS ALB Ingress",
      "Changes": [
        {
          "Action": "UPSERT",
          "ResourceRecordSet": {
            "Name": "FQDN",
            "Type": "A",
            "AliasTarget": {
              "HostedZoneId": "YOUR_HOST_ZONE_ID",
              "DNSName": "INGRESS_HOSTNAME",
              "EvaluateTargetHealth": false
            }
          }
        }
      ]
    }

After a few minutes, you'll see that IP address of the domain name and INGRESS_HOSTNAME are the same.

You can create your own hosted zone with:

aws route53 create-hosted-zone --name "polarishttps.emryspolaris.click"     --caller-reference "$(date +%Y-%m-%d-%H-%M-%S)"

but this can lead to complications. 
"Public-hosted zones have a route to internet-facing resources and resolve from the internet using global routing policies. Meanwhile, private hosted zones have a route to VPC resources and resolve from inside the VPC." - AWS for Solution Architects, O'Reilly

Certificate

We've now linked the domain name to an endpoint. Now we need to create a certificate. I did this through the AWS web console and after just a few clicks, it gave me the ARN.

You might need to wait a few minutes for it to become live but you can see the status of a certificate with:

aws acm describe-certificate --certificate-arn "$CERT_ARN" --region eu-west-1 --query "Certificate.Status"  --output text

Tuesday, November 18, 2025

Snowflake and AWS

This is how you get Snowflake to talk to your AWS real estate. Before we start, get your AWS account ID with:

aws sts get-caller-identity 

This will be used as your GLUE_CATALOG_ID (see below).

Now, you need to create in Snowflake a volume like this:

CREATE OR REPLACE EXTERNAL VOLUME YOUR_VOLUME_NAME
  STORAGE_LOCATIONS = (
    ( NAME = 'eu-west-2'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = 's3://ROOT_DIRECTORY_OF_TABLE/'
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'
    )
  )
  ALLOW_WRITES = FALSE;  

You run this even though you have yet to create the role. Then run:

select system$verify_external_volume('YOUR_VOLUME_NAME');

This will give you some JSON that includes a STORAGE_AWS_IAM_USER_ARN. You never create this user. Snowflake does it itself. Its ARN is what you need to create a role in AWS that allows Snowflake's user to see data.

You create a role was created with an ordinary aws iam create-role --role-name S3ReadWriteRoleSF --assume-role-policy-document... using the ARN that we got from Snowflake, above. That is, our Snowflake instance has its own AWS user and you must give that user access to your real estate.

Now, give Snowflake access to your cloud assets with:

aws iam put-role-policy --role-name ROLE_NAME --policy-name GlueReadAccess --policy-document file://glue-read-policy.json

Where glue-read-policy.json just contains the Actions needed to talk to Glue.

Finally, we create the Glue catalog (note that this is not a REST catalog like Polaris) but Glue:

CREATE OR REPLACE CATALOG INTEGRATION CATALOG_NAME
  CATALOG_SOURCE = GLUE
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'YOUR_DB_NAME'
  GLUE_CATALOG_ID = 'GLUE_CATALOG_ID'
  GLUE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'
  GLUE_REGION = 'eu-west-2'
  ENABLED = TRUE;

Now you bring all these threads together when you create a table with:

CREATE ICEBERG TABLE arbitrary_name 
  EXTERNAL_VOLUME = 'YOUR_VOLUME_NAME'
  CATALOG = 'CATALOG_NAME'
  CATALOG_TABLE_NAME = 'TABLE_NAME';

Create a REST catalog with:

CREATE OR REPLACE CATALOG INTEGRATION polaris_int
    CATALOG_SOURCE = POLARIS
    TABLE_FORMAT = ICEBERG
    REST_CONFIG = (
        CATALOG_URI = 'https://YOUR_HOST:8181/api/catalog/v1/',
        CATALOG_NAME = 'YOUR_CATALOG_NAME_IN_POLARIS'
        )
    REST_AUTHENTICATION = (
        TYPE = BEARER
        BEARER_TOKEN = 'TOKEN'
    )
    ENABLED = TRUE;

Note that the URI must be talking HTTPS not HTTP and the TOKEN is from Polaris.

Saturday, November 15, 2025

Debugging Google Cloud Kubernetes

A problem I was having when spinning up a K8s cluster and then trying to deploy my own Polaris was that the pod stuck in the Pending state. A quick kubectl describe pod gave the last event as "Pod didn't trigger scale-up:"

So, let's look at the events (a.k.a operations):

gcloud container operations list --project $PROJECT

Then to drill down on the operation of interest:

gcloud container operations describe operation-XXX --region $REGION --project $PROJECT

It seemed pretty quiet. The last two events were:
  • CREATE_CLUSTER began at 16:35:38 and ran to 16:41:37
  • DELETE_NODE_POOL started at 16:41:41 and ran to 16:46:02
So, that delete came hot on the heals of the cluster successfully being created. I looked at the logs with:

gcloud logging read "resource.labels.cluster_name=spark-cluster AND timestamp>=\"2025-11-14T16:41:35Z\" AND timestamp<=\"2025-11-14T16:41:42Z\"" --project=$PROJECT --limit 10 --order=desc 

and one of these logs looked like this:

  requestMetadata:
    callerSuppliedUserAgent: google-api-go-client/0.5 Terraform/1.10.7 (+https://www.terraform.io)
      Terraform-Plugin-SDK/2.36.0 terraform-provider-google/dev6,gzip(gfe)
...
  response:
    operationType: DELETE_NODE_POOL

This was saying that the DELETE_NODE_POOL originated from my own Terraform-Plugin-SDK! And the reason for that was my Terraform had:

        "remove_default_node_pool": true

It did this because it then tried to create its own node pool. However, it seems that having 2 node pools at once exhausted the GCP quotas. My node failed to start but TF merrily went ahead and continued to delete the default pool.

You can see quotas with:

gcloud compute regions describe $REGION

and node pools with:

gcloud container node-pools describe default-pool --cluster $CLUSTER_NAME --region $REGION --project $PROJECT

Wednesday, November 5, 2025

Spark Operator

I've found that managing Spark clusters in Kubernetes is far easier using the Spark Operator. Here are some commands that helped me diagnose issues.

Dude, where's my appliction?

List your Spark applications with:

kubectl get sparkapplications

or

kubectl get sparkapplications spark-connector -o yaml

to see what might be causing trouble for, say, the connector.

It can be annoying when you can't delete a sparkapplication with

kubectl delete sparkapplication YOUR_APP

even though it's running. In my case, I thought

kubectl rollout restart deployment spark-kubernetes-operator

left an orphaned cluster.

It's possible that you don't see anything even though there are Spark pods clearly there. In this case:

kubectl describe pod POD_NAME

and you should see something like:

...
Controlled By:  StatefulSet/XXX
...

Great, so it looks like the Spark Operator has set the cluster up by delegating to Kubernetes primitives. Let's see them:

kubectl get statefulsets

and then we can just:

kubectl delete statefulset XXX

OK, so, dude, where's my cluster

But we're barking up the wrong tree. The YAML to create a cluster has kind: SparkCluster so we're using the wrong CRD with sparkapplications.

kubectl get crd | grep spark
sparkclusters.spark.apache.org                              2025-11-04T10:52:56Z
...

Right, so now:

kubectl delete sparkclusters YOUR_CLUSTER

Python

As a little aside, I was seeing strange errors when running PySpark commands that appeared to be a versioning problems. A few commands that came in useful were:

import sys
print(sys.path)

to print where the Python executable was getting its libraries from and:

from pyspark.version import __version__
print(__version__)

to make sure we really did have the correct PySpark version. 

As it happened, it was the wrong version of the Iceberg runtime in spark.jars.packages.