Agile Java Man

Saturday, March 14, 2026

What happens in an LLM? (part 1)

A nice overview that's detailed but not too intricate is here [blog of SteelPh0enix AKA Wojciech Olech]

Note that when using a fully trained LLM, things are conceptually much simpler because it is more or less just a feedforward network. That is, the weights are immutable. State lives outside of the ANN and is updated by the output after each token runs through the feedforward network.

Attention!

The attention mechanism of an RNN takes the encodings and for each token, augments the input vector both forwards and backwards. "The rationale behind this is to capture additional information since current inputs may have a dependence on sequence elements that came either before or after it in a sentence, or both." [1] The two vectors are then concatenated to make one long vector. "We can consider this concatenated hidden state as the annotation of the source word since it contains the information of the jth word in both directions." [1]

The self-attention mechanism for each word has three vectors: query, key and and value. It compares the query of each word to the key of the others. This process is done in parallel ("multi-head attention") using different Q/K/V weights and the results combined.

The transformer architecture has superceded RNNs.

The cat sat on the mat

Conceptually, q, represents the query (eg, the word "sat" is looking for something to in the sentence to do the sitting); k is the key where the word "cat" is saying I am a noun that can sit; and v links the verb looking for a noun and the noun looking for a verb.

This is similar to when we apply singular value decomposition to a document/term space and create a concept space.

Basic self attention

Imagine T input vectors x⁽ⁱ⁾.

Tokens are embedded in a space of size d.

The T output vectors of self-attention are vectors z⁽ⁱ⁾.

These vectors are calculated thus:

z^{(i) =}Σ_j=1^T α_ij x⁽ⁱ⁾

where α is a matrix of the dot products of all the x vectors with softmax applied to it (remember, softmax does not change the relative sizes of the logits but does convert them into probabilities).

Multi-headed attention is the same algorithm calculated for multiple heads (that is, mutliple swimlanes of datat that represent nuances in language structure).

Self attention with learnable parameters

Here we project each x vector onto U_k, U_v and U_q. Note that Q=xU_q etc and U_q etc are fixed for a feed forward network. That is, somebody has done the hard work of calculating them during training.

The projections onto q and v are then multiplied together and the result is put into a matrix ω_ij where i is a token and j any other token.

The matrix ω is divided by (typically) √d and softmaxed.

That is:

Attention(Q, V, K) = softmax ( Q K^T / √d_k ) V

Tiling

Tiling is a technique when performing matrix operations on data that won't fit into memory.

"With naïve algorithm, to compute each element of the result, we gonna need to fetch S elements from both matrices. The output matrix has S² elements, therefore the total count of loaded elements is 2S³." [Penny Xu's blog]

This breaks down as 2 [vectors - one from each matrix] * S [the size of those vectors] * S²[the number of elements that are the result of all these dot products - that is, the size of the matrix].

"With 32×32 tiling, to compute each 32×32 tile of the result, we gonna need to fetch S/32 tiles from both matrices. The output size in tiles is (S/32)², the total count of loaded [elements] is 2*(S/32)³. Each 32×32 tile contains 32² elements, the total count of loaded elements is therefore (32²)*2*(S/32)³ = (2/32)*S³. Therefore, the tiling reduced global memory bandwidth by the factor of 32, which is a huge performance win." [ibid]

In other words, having broken the matrix down into a grid of size 32×32, and each block in the grid involving 2*(S/32)³ operations, the total number of operations is this first number times the second - that is, (2/32)*S³.

Note that the resulting matrix is tiled also. So, if we doing the matrix multiplication of C=AB, the total memory needed is one tile of each of A, B and C.

There's a further optimization. If we're looking for the maximum value (which is very common in neural nets where we typically employ the softmax function), we only need to store one value per tile per column/row.

[1] Machine Learning with PyTorch and SciKitLearn.

Friday, March 6, 2026

Permissions and Lakes

OpenID (authorisation) is built on top of OAuth (authentication).

"It allows clients to verify the identity of the end user based on the authentication performed by an authorization server, as well as to obtain basic profile information about the end user in an interoperable and RESTlike manner." Zero Trust Networks (O'Reilly)

The (Java) KeyCloak is a common choice for an open source solution. KeyCloak is an Identity Provider (IdP).

Polaris can vend credentials but it (rather than the cloud IAM system) controls who gets what. It acts as an ACL for ACLs, if you like. This code shows how a request is associated with a realm and a realm with a credential.

If you're going to use Polaris in production, you'll probably need a certificate from a recognised Certificate Authority like Let's Encrypt. The reason is that all HTTPS clients need to make a call to a list of hard coded authorities who will sign off the certificate as genuine.

AWS users can use AWS Certificate Manager (ACM) to certify endpoints - rather than configuring Polaris to use SSL. You can have AWS manage the whole thing; or you can "install an externally signed private CA certificate on your subordinate CA. This CA certificate must be signed by a parent CA. Installing the certificate completes the creation and activation of the CA."

Either way, the idea is that the Elastic Load Balancer provides an HTTPS endpoint, does all the de/encryption gubbins and then forwards plain HTTP on to Polaris that sits securely in your Virtual Private Cloud.

To this end, it seems you must deploy the AWS Load Balancer Controller in your Kubernetes cluster much like the vpc-cni EksAddon.

Note that it can take a minute or two for a mapping from a domain name to an endpoint to be registered. Run:

dig A YOUR_DOMAIN_NAME +short

to see if your DNS is updated.

Fine Grained Access

OpenFGA adds Fine Grained Access control. This implementation is written in Go.

Kubernetes

Azure and GCP do things differently. It uses a sidecar that leverages Let's Encrypt.

"Uploading and managing TLS secrets can be difficult. In addition, certificates can often come at a significant cost. To help solve this problem, there is a nonprofit called “Let’s Encrypt” running a free Certificate Authority that is API-driven. Since it is API-driven, it is possible to set up a Kubernetes cluster that automatically fetches and installs TLS certificates for you. It can be tricky to set up, but when working, it’s very simple to use. The missing piece is an open source project called cert-manager created by Jetstack, a UK startup, onboarded to the CNCF." - Kubernetes Up & Running 3rd Ed., O'Reilly

Certificates and Challenges

In the context of Kubernetes, the certificate contains the secret name where the final certificate will be stored and a reference to an issuer. The result is a Kubernetes secret containing the actual public and private key for HTTPS.

The challenge proves to the CA that you own the domain. You can either use a HTTP-01 challenge where you host a token on a URL that uses the domain.

Or you use DNS-01 where you ask your DNS provider to host a record containing the token (see below).

Domain Ownership Validation

AWS has Route53 that nicely integrates management of domain names with Kubernetes. That is, you can seemlessly have an EKS ingress assigned an AWS managed domain.

Google, however, have recently sold their domain name arm so you need to persuade them that the domain you own is really yours before they'll point it at your Kubernetes ingress. To do this, run:

gcloud certificate-manager dns-authorizations create ARBITRARY_STRING --domain="YOUR_DOMAIN" --project YOUR_PROJECT

Secrets

So much for security outside the cloud. Here is how you deal with it inside.

You'll need to install secrets-store-csi-driver-provider-aws which is a CRD that runs in Kubernetes and talks to AWS. It allows you to mount the secrets in your container as if they were any other filesystem.

I followed the instructions in the AWS CSI driver (above) but I could just not get Option 1 to work despite a lot of time checking that everything was OK. It's still a mystery but Option 2 worked first time.

Restart a deployment with something like:

kubectl rollout restart deployment polaris

This is especially useful if you've updated the secrets.

Cloud and K8s

I've spent the week setting up HTTPS certificates and domain names for my Azure and GCP K8s clusters.

At one point, the GCP K8s installation just kept hanging with "Still creating...". It turned out that we just didn't allocate enough resources:

But not all hanging Kubernetes deployments are so easy to spot.

Automatically point domain names at K8s pod IPs

AWS is better integrated because you can buy the domain names through Route53. But for Azure and GCP K8s, we did the following.

we bought a domain name from AWS via Route53.
we delegated the nameservers of this domain to Microsoft or Google.
a Kubernetes sidecar starts up and contacts Let's Encrypt's API .
Let's encrypt returns a token
the sidecar encodes with its private key and hosts it (a.k.a Key Authorization) on port 80.
Let's encrypt reads that file and decodes it with the cluster's public key. Now it can grant a certificate.

The sidecar is called ACME (Automatic Certificate Management Environment) and is ephemeral:

$ kubectl get events -A --sort-by=.lastTimestamp | grep -i acme

...

default 45m Normal Started pod/cm-acme-http-solver-6j5xp Started container acmesolver

default 44m Normal Sync ingress/cm-acme-http-solver-lgl5w Scheduled for sync

default 43m Normal Killing pod/cm-acme-http-solver-6j5xp Stopping container acmesolver

the outside world can talk to both a service and an ingress. Which you choose depends on what you want. Use Ingress for HTTPS.

An Ingress always talks to a service. Note that it is a logical abstraction and can have multiple ingresses for one domain. For example:

$ kubectl get ingress -A

NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE

default cm-acme-http-solver-ldflx <none> emryspolarisgcp.click 35.189.87.151 80 14m

default polaris-ingress nginx emryspolarisgcp.click 35.189.87.151 80, 443 14m

is saying the emryspolarisgcp.click can point to different services depending on its ports. Here ACME is sticking around to finish the creation of certificates.

This was for GCP where the nameservers were incorrect (they changed every time we deployed the managed zones though Terraform). You might want to see them with:

gcloud dns managed-zones describe polaris-zone --project=afon-core

What can go wrong?

Traffic can be swallowed because:

network security groups (or lack of them)
misconfigured ports
selectors not pointing at the correct pods

Incoming traffic goes through the system in this order:

ingress (optional - see above)
service
endpoint
pod

kubectl get ingress shows the name and the ports of the front facing interface

kubectl describe ingress XXX shows the service to which traffic is sent

Note with an nginx load balancer, this service comes before the ingress:

$ dig A emryspolarisazure.click +short

20.108.199.92

$ kubectl get service -A

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

... 50m

default polaris-internal ClusterIP 10.2.185.168 <none> 8181/TCP 47m

ingress-nginx nginx-ingress-ingress-nginx-controller LoadBalancer 10.2.238.76 20.108.199.92 80:30157/TCP,443:32110/TCP 47m

ingress-nginx nginx-ingress-ingress-nginx-controller-admission ClusterIP 10.2.30.150 <none> 443/TCP 47m

...

Debugging

If in doubt, port forward:

kubectl port-forward svc/polaris-internal 8080:8181 -n default

This will at least establish that the communication between your service and application is fine.

It's important to check that the firewall is at least expecting a connection for that IP address and port. Don't use curl for this as it is subject to network security rules and certificates being in place. So, run:

nc -zv 20.108.199.92 80

if you want to make sure that port is open as firewalls allow a TCP three way handshake even if the Network Security Group blacks further traffic.

Upon setting up the stack with tofu, the certificate doesn't look healthy and I can't access my site via HTTP.

$ kubectl get certificate polaris-tls

NAME READY SECRET AGE

polaris-tls False polaris-tls 28m

$ kubectl get challenges -A

NAMESPACE NAME STATE DOMAIN AGE

default polaris-tls-1-648081749-304604175 invalid emryspolarisazure.click 36m

$ kubectl describe certificate polaris-tls

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Issuing 23m cert-manager-certificates-trigger Issuing certificate as Secret does not exist

Normal Generated 23m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-rw6t9"

Normal Requested 23m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"

Warning Failed 21m cert-manager-certificates-issuing The certificate request has failed to complete and will be retried: Failed to wait for order resource "polaris-tls-1-648081749" to become ready: order is in "invalid" state:

$ kubectl describe challenge -A

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Started 37m cert-manager-challenges Challenge scheduled for processing

Normal Presented 37m cert-manager-challenges Presented challenge using HTTP-01 challenge mechanism

Warning Failed 35m cert-manager-challenges Accepting challenge authorization failed: acme: authorization error for emryspolarisazure.click: 400 urn:ietf:params:acme:error:connection: 51.132.211.134: Fetching http://emryspolarisazure.click/.well-known/acme-challenge/5yGd57VQUrjc2ns-Q-VEVIl3vl6WKFK4B2fQu643_TM: Timeout during connect (likely firewall problem)

Running:

kubectl delete certificate polaris-tls

did the trick as it forces the certificate to renew. Watch and wait for it to be ready with:

kubectl get certificate polaris-tls -w

You need to run this (or put the equivalent in your Terraform file):

kubectl annotate service nginx-ingress-ingress-nginx-controller -n ingress-nginx "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path=/healthz"

and that should now all work. A happy system should look like:

$ kubectl describe certificate polaris-tls

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Issuing 42m cert-manager-certificates-trigger Issuing certificate as Secret does not exist

Normal Generated 42m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-thlvj"

Normal Requested 42m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"

Normal Issuing 40m cert-manager-certificates-issuing The certificate has been successfully issued

Thursday, March 5, 2026

Gradle cheat sheet

Some miscellaneous notes I found useful.

Building

Build with neither tests nor RAT:

gradle build -x rat -x test

List the projects with:

./gradlew projects

Viewing Dependencies

If you want to see the dependencies of a project, use:

gradlew :PROJECT_NAME:dependencies [--configuration runtimeClasspath|compileClasspath]

where PROJECT_NAME is what you get from listing the projects.

To examine a particular dependency:

./gradlew :polaris-server:dependencyInsight --dependency protobuf-java --configuration runtimeClasspath

Defining Dependencies

Typically, you'll find a libs.versions.toml that defines libraries but does not by itself include them. What Gradle does do is autogenerate a Java class that becomes the lib object in your build.gradle.kts files and you can reference the entities in the .toml file as Java code.

Tests

If you want to run a single test:

gradlew :PROJECT_NAME:test --tests FQN

Tasks

View tasks for a particular project with, for example:

./gradlew :polaris-config-docs-site:tasks

Thursday, February 19, 2026

An unruly Terraform

If the Terraform state is out of synch with reality, you might need to change that state manually with something like:

tofu state list

followed by

tofu state rm XXX

I had to delete load balancers manually through the AWS Web Console and then also the EKS instance. I then had to manually delete any references to them from my JSON.

Tip: regularly delete the directory in which the Terraform lives as state gets kept there that the next run implicitly relies upon. The consequence if you don't is that after a major refactor, you run the configuration and everything looks fine. You check in thinking you've done a good job but there was an invisible dependency on the previous running and checking out to a fresh directory fails. So:

Delete all files regularly

I was getting lots of:

│ Error: Get "https://21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com/api/v1/namespaces/default/services/jupyter-lb": dial tcp: lookup 21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com on 127.0.0.1:53: no such host

│

even after blatting my Terrafrom cdktf.out/stacks directory. Turns out state files were accumulating in the root directory of my project (which contained cdktf.out). Once they too were blatted, things looked better.

Changing the cdktf.out.json file resulted in:

│ Error: Inconsistent dependency lock file

│

│ The following dependency selections recorded in the lock file are inconsistent with the current configuration:

│ - provider registry.opentofu.org/hashicorp/helm: required by this configuration but no version is selected

│

│ To update the locked dependency selections to match a changed configuration, run:

│ tofu init -upgrade

The solution was to run tofu init -upgrade

GCP

You might see this error when running Terraform on GCP:

│ Error: Error setting access_token

│

│ with data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C,

│ on cdk.tf.json line 25, in data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C:

│ 25: }

│

│ oauth2: "invalid_grant" "reauth related error (invalid_rapt)" "https://support.google.com/a/answer/9368756"

It's nothing really to do with TF but rather your GCP credentials. Login with gcloud auth application-default login and try again. D'oh.

AWS

aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values=$VPC --region $REGION

aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-security-groups --filters Name=vpc-id,Values=$VPC --region $REGION

This last one showed 3 security groups.

The reason that these AWS entities lingered is because my tofu destroy was always hanging. And the reason it never finished is that there were finalizers that prevented it. To avoid this, I needed to run:

kubectl patch installation default -p '{"metadata":{"finalizers":[]}}' --type=merge

kubectl patch service YOUR_LOAD_BALANCER -p '{"metadata":{"finalizers":null}}' --type=merge

Also, CRDs need to be destroyed:

for CRD in $(kubectl get crds | awk '{print $1}') ; do {

kubectl patch crd $CRD --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

kubectl delete crd $CRD --force

} done

I would then run these scripts as a local-exec provisioned in a resource.

I asked on the DevOps Discord server how normal this was:

PhillHenry
I'm using Terraform to manage my AWS stack that (amongst other things) creates a load balancer using an aws-load-balancer-controller. I'm finding destroying the stack just hangs then times out after 20 minutes.

I've had to introduce bash scripts that patch finalizers in services and installations plus force delete CRDs. Finally, tofu detroy cleans everything up but I can't help feeling I'm doing it all wrong by having to add hacks.

Is this normal? If not, can somebody point me in the right direction over what I'm doing wrong?

snuufix
It is normal with buggy providers, it's just sad that even AWS is one.

It appears I am not the only one:

The_Ketchup, CJO
This is mainly for my homelab to teardown when Im done for the day. So when the aws ingress controller makes an LB via K8s, terraform doesnt know about it so I have to manually go in and delete it in the aws console. Its not very clean. So I was thinking maybe if its managed under argocd it will know about it and delete it? Idk its kinda confusing. Maybe I jsut do kubectl delete ingress --all or something and THEN do terraform destroy?
Cuz right now it just wont delete my subnets since theres an LB in there when I do terraform destroy

Darkwind The Dark Duck
U could use AWS Nuke to clean anything remaining 😄

Redeploying

Redeploying a component was simply a matter of running:

tofu apply -replace=kubernetes_manifest.sparkConnectManifest --auto-approve

This is a great way to redeploy just my Spark Connect pod when I've changed the config:

Helm

If you want to find out what version of a Helm chart you're using when you forget to set it, this might help. It's where Helm caches the charts it downloads.

$ ls -ltr ~/.cache/helm/repository/

...

-rw-r--r-- 1 henryp henryp 107929 Nov 12 10:41 spark-kubernetes-operator-1.3.0.tgz

-rw-r--r-- 1 henryp henryp 317214 Nov 20 15:56 eks-index.yaml

-rw-r--r-- 1 henryp henryp 433 Nov 20 15:56 eks-charts.txt

-rw-r--r-- 1 henryp henryp 36607 Nov 24 09:15 aws-load-balancer-controller-1.15.0.tgz

-rw-r--r-- 1 henryp henryp 493108 Dec 11 14:47 kube-prometheus-stack-51.8.0.tgz

-rw-r--r-- 1 henryp henryp 38337 Dec 15 12:20 aws-load-balancer-controller-1.16.0.tgz

...

More Kubernetes notes

Networking

This was an interesting online conversation about how network packets set to the cluster's IP address are redirected to a pod in Kubernetes cluster.

Each pod has its own IP which is managed by the Container Networking Interface (CNI). Every node runs a kube-proxy which manages how a cluster IPs map to pod IPs. Pod IPs are updated dynamically and only include pods passing their health check.
The node receiving the TCP request does forward to the destination pod, but the mechanisms sort of depend on the CNI. In cloud environments like AWS and GCP, the CNI just sends it directly out on to the network and the network itself knows the pod IPs and takes care of it. Those are so-called VPC Native networking.

Some CNIs have no knowledge of the existing network, they run an overlay inside the cluster that manages the transport and typically that's done with IP encapsulation and sending the encapsulated packet to the destination node.

In VPC Native networking, your node just sends packets to the destination pod like a regular packet. The pods are fully understood and routable by the network itself.

It works differently on-prem. It depends on your CNI. In an on-prem network, using most other CNIs, including microk8s which uses Calico, the network doesn't know anything about pod IPs. Calico sets up an overlay network which mimic a separate network to handle pod-to-pod communication.
In VPC Native networking, things that are outside your kubernetes cluster can communicate directly with k8s pods. GCP actually supports this, while AWS uses security groups to block this by default (but you can enable it). in overlay CNIs like Calico or Flannel, you have to be inside the cluster to talk to pods in the cluster.

[hubt - Discord]

Debugging

This proved very useful when a pod suddenly died:

kubectl get events -A --sort-by=.lastTimestamp | grep -i POD_NAME

Turned out:

The node was low on resource: ephemeral-storage. Threshold quantity: 2139512454, available: 1826840Ki. Container spark-kubernetes-driver was using 1551680Ki, request is 0, has larger consumption of ephemeral-storage.

This is about the best way to see what happened to a pod once it dies.

Saturday, February 14, 2026

GPU Programming pt 1.

This is a nice Java implementation of various LLMs that can also defer to TornadoVM to process the maths on a GPU. Looking at the code gives a good idea of the impedance mismatch between CPU and GPU programming as the Java code covers both. Here are some miscellaneous notes.

GPU Terminology

In TornadoVM's KernelContext you'll see various references to Global, Group and Local IDs.

The global ID is the overall ID of the thread. Note that it can be virtualized - that is, it can be larger than the number of threads your hardware supports.

The group ID refers to a logical grouping of threads. Note that a warp is a physical (ie, hardware dependent) grouping of threads. Work groups are made of integer multiples of warps. Warps always have 32 threads in NVidia hardware and process in lockstep. Work groups however can execute their warps asynchronously.

If a warp hits an if/else statement, the branches are executed sequentially and you lose parallelism!

Threads in a work group can share memory. The local ID is the ID of a thread within that group.

GPU algorithms

Writing algorithms is different in the world of GPUs. This Java/TornadoVM code that is turned into GPU code:

context.localBarrier();

for (int stride = localWorkGroupSize / 2; stride > 0; stride >>= 1) {

if (localId < stride) {

localSum[localId] += localSum[localId + stride];

}

context.localBarrier();

}

is actually reducing an array to an int using the GPUs threads. It first with half the warf of 32 threads. Starting with 16 threads, then 8, then 4 etc each thread takes 2 elements in the array and adds them. Only half the number of threads are needed on the next iteration and so on. All the other threads are "masked", that is, not used.

Mapping TornadoVM to GPU concepts

There is a line in fusedQKVMatmulX that basically says:

if (context.globalId < input_array_size) ...

Yeah, but what if the maximum globalId (the actual thread ID) is much lower than the array size? Do we ignore the rest of the array?

The answer is no because globalId is a virtual ID and does not represent the physical limits of your hardware. As it happens, my RTX 4070 (Laptop) has 4608 CUDA cores whereas the model I am running (Llama-3.2-1B-Instruct-F16) has a hidden size of 4096 so it seems like it all fits into memory without resorting to virtualization tricks.

The functions above don't generally have loops in them. The reason is that the loop is implicit. Each GPU thread calls the function.

Graal- and TornadoVM

Note that TornadoVM heavily relies on GraalVM. If you look at the stack, you'll see the code in PTXCompiler.emitFrontEnd appears to exploit GraalVM's ability to examine the bytecode of the functions mentioned above. It does this so it can convert them into CUDA code.

Consequently, you'll never see any breakpoints hit in these TransformerComputeKernelsLayered functions.

Agile Java Man

Saturday, March 14, 2026

What happens in an LLM? (part 1)

Friday, March 6, 2026

Permissions and Lakes

Cloud and K8s

Thursday, March 5, 2026

Gradle cheat sheet

Thursday, February 19, 2026

An unruly Terraform

More Kubernetes notes

Saturday, February 14, 2026

GPU Programming pt 1.

Blog Archive

About Me