Wednesday, December 31, 2025

Debugging JNI calls to the GPU

I'm playing aroung with a Java based LLM (code here). When running a JVM that calls the GPU using the TornadoVM, it crashed and in the log, I saw:

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x1302b0]
C  [libcuda.so.1+0x332420]
C  [libtornado-ptx.so+0x64b8]  Java_uk_ac_manchester_tornado_drivers_ptx_PTXStream_cuLaunchKernel+0x198
j  uk.ac.manchester.tornado.drivers.ptx.PTXStream.cuLaunchKernel([BLjava/lang/String;IIIIIIJ[B[B)[[B+0 tornado.drivers.ptx@2.2.1-dev
...

Now, finding the Shared Object files (*.so), I called: 

objdump -d /usr/lib/x86_64-linux-gnu/libcuda.so.1 
objdump -d /usr/local/bin/Java/tornadovm-2.2.1-dev-ptx/lib/libtornado-ptx.so

and looked at the addresses in the stack dump.

First, libtornado-ptx.so. Note that the address (0x64b8) is the return address from a call, that is the next line after the call that went Pete Tong. 

    64b3:       e8 b8 e1 ff ff          call   4670 <cuLaunchKernel@plt>
    64b8:       48 83 c4 30             add    $0x30,%rsp

So, it's the call to cuLaunchKernel that is interesting.

  33241b:       e8 00 de df ff          call   130220 <exit@plt+0x4e460>
  332420:       5a                      pop    %rdx

and the final (top most) stack frame:

  1302ab:       4d 85 e4                test   %r12,%r12
  1302ae:       74 58                   je     130308 <exit@plt+0x4e548>
  1302b0:       41 8b 04 24             mov    (%r12),%eax

The instruction test %x,%y is a common idiom in null checks (basically, it's x and y are ANDed and the je jumps if the Zero Flag is set - note that this flag is set if the result of the AND is non-zero or both x and y are zero).

So, it looks like we've essentially got what's equivalent to a NullPointerException in the machine code. Still looking at what's null... [Solved: had to use a model that is compatible with GPULlama3.java)

Monday, December 15, 2025

AWS cheatsheet

Various command lines that have helped me recently.

IAM

List the role's attached, inline and assumed (trust) policies with:

aws iam list-attached-role-policies --role-name $ROLE_NAME

aws iam list-role-policies --role-name $ROLE_NAME

Whoami with:

aws sts get-caller-identity 

Policies are a collection of actions and services that can be assigned. List all homemade policies with:

aws iam list-policies --scope Local --query 'Policies[].Arn' --output table

Similarly, list all roles with:

aws iam list-roles --query 'Roles[].RoleName' --output table

List all the Actions for a policy with:

aws iam get-policy-version --policy-arn $POLICY_ARN --version-id $(aws iam get-policy --policy-arn $POLICY_ARN --query 'Policy.DefaultVersionId' --output text) --query 'PolicyVersion.Document.Statement[].Action'   --output json | jq -r '.[]' | sort -u

List all the trust policies for a given role:

aws iam get-role --role-name $ROLE_NAME --query 'Role.AssumeRolePolicyDocument' --output json

Note that assuming a role implies some temporary elevation of privileges while attaching a role is more about defining what a role actually is.

List everything attached to a policy:

aws iam list-entities-for-policy --policy-arn $POLICY_ARN

Instance profiles contain roles. They act as a bridge to securely pass an IAM role to an EC2 instance, enabling the instance to access other AWS services without needing to store long-term, hard-coded credentials like access keys. You can see them with:

aws iam list-instance-profiles-for-role --query 'AttachedPolicies[*].PolicyArn' --role-name $ROLE_NAME --query "InstanceProfiles[].InstanceProfileName" --output text

Secrets

See access to K8s secrets with:

kubectl logs -n kube-system -l app=csi-secrets-store-provider-aws -XXX

See an AWS secret with:

aws secretsmanager get-secret-value --secret-id $SECRET_ARN --region $REGION

Deleting them is interesting as they will linger unless told otherwise:

aws --region $REGION secretsmanager  delete-secret --secret-id $SECRET_NAME --force-delete-without-recovery

Infra

To see why your EKS deployments aren't working:

kubectl get events --sort-by=.metadata.creationTimestamp | tail -20

Terraform seems to have a problem deleting load balancers in AWS. You can see them with:

aws elbv2 describe-load-balancers

List the load balancers:

aws elb describe-load-balancers --region $REGION

List the VPCs:

aws ec2 describe-vpcs --region $REGION

Glue

Create with:

aws glue create-database  --database-input '{"Name": "YOUR_DB_NAME"}'  --region $REGION

Create an Iceberg table with:

aws glue create-table \
    --database-name YOUR_DB_NAME \
    --table-input '
        {
            "Name": "TABLE_NAME",
            "TableType": "EXTERNAL_TABLE",
            "StorageDescriptor": {
                "Location": "s3://ROOT_DIRECTORY_OF_TABLE/",
                "Columns": [
                    { "Name": "id", "Type": "int" },
...
                    { "Name": "randomInt", "Type": "int" }
                ],
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
                }
            },
            "Parameters": {
                "iceberg.table.default.namespace": "YOUR_DB_NAME"
            }
        }' \
    --open-table-format-input '
        {
            "IcebergInput": {
                "MetadataOperation": "CREATE",
                "Version": "2" 
            }
        }' \
    --region $REGION

Get all the databases with:

aws glue get-databases --query 'DatabaseList[*].Name' --output table

Get tables with:

aws glue get-tables --database-name YOUR_DB_NAME

Drop with:

aws glue delete-table --name TABLE_NAME --database-name YOUR_DB_NAME