Wednesday, April 16, 2025

Kafka in Kubernetes

If you ask Strimzi to set up a Kafka cluster out of the box, you'll see this when connecting:

$ kafka-topics.sh --bootstrap-server 10.152.183.163:9092 --list
...
[2025-04-04 16:35:59,464] WARN [AdminClient clientId=adminclient-1] Error connecting to node my-cluster-dual-role-0.my-cluster-kafka-brokers.kafka.svc:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
java.net.UnknownHostException: my-cluster-dual-role-0.my-cluster-kafka-brokers.kafka.svc
...

Where that IP address comes from:

$ kubectl get service -A
NAMESPACE     NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE...
kafka         my-cluster-kafka-bootstrap            ClusterIP   10.152.183.163   <none>        9091/TCP,9092/TCP,9093/TCP                     21d

It appears that Strimzi does not expose external Kafka ports by default. So, add an external port with:

kubectl get kafka my-cluster -n kafka -o yaml > /tmp/kafka.yaml

then edit /tmp/kafka.yaml adding:

    - name: external
      port: 32092
      tls: false
      type: nodeport

in the spec/kafka/listeners block and apply it with:

kubectl apply -n kafka -f /tmp/kafka.yaml

Now I can see:

$ kubectl get svc -n kafka
NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE   SELECTOR
...
my-cluster-kafka-bootstrap            ClusterIP   10.152.183.163   <none>        9091/TCP,9092/TCP,9093/TCP                     21d
my-cluster-kafka-external-bootstrap   NodePort    10.152.183.27    <none>        32092:32039/TCP                                11d

It appears that Strimzi has created a new service for us - hurrah! 

However, making a call to Kafka still fails. And this is because of the very architecture of Kubernetes. I am indeed communicating with a Kafka broker within Kubernetes but then it's forwarding me to another domain name, my-cluster-dual-role-0.my-cluster-kafka-brokers.kafka.svc. The host knows nothing about this Kubernetes domain name. Incidentally, the same happens in Kafka for a pure Docker configuration.

Kubernetes pods resolve their domain names using the internal DNS.

$ kubectl exec -it my-cluster-dual-role-1 -n kafka -- cat /etc/resolv.conf
Defaulted container "kafka" out of: kafka, kafka-init (init)
search kafka.svc.cluster.local svc.cluster.local cluster.local home
nameserver 10.152.183.10
options ndots:5

This nameserver is kube-dns (I'm using Microk8s):

$ kubectl get svc -n kube-system
NAME       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   21d

and we can query it from the host:

$ nslookup my-cluster-dual-role-external-0.kafka.svc.cluster.local 10.152.183.10
Server: 10.152.183.10
Address: 10.152.183.10#53

Name: my-cluster-dual-role-external-0.kafka.svc.cluster.local
Address: 10.152.183.17

Now, to get the host to use the Kubernetes DNS for K8s domain names, I had to:

$ sudo apt update
$ sudo apt install dnsmasq
$ sudo vi /etc/dnsmasq.d/k8s.conf

This was a new file and needed:

# Don't clash with systemd-resolved which listens on loopback address 127.0.0.53:
listen-address=127.0.0.1
bind-interfaces
# Rewrite .svc to .svc.cluster.local
address=/.svc/10.152.183.10
server=/svc.cluster.local/10.152.183.10

That listen-address line was because sudo ss -ulpn | grep :53 showed both dnsmasq and systemd-resolved were fighting over the same port.

I also had to add:

[Resolve]
DNS=127.0.0.1
FallbackDNS=8.8.8.8
Domains=~svc.cluster.local

to /etc/systemd/resolved.conf to tell it to defer to dnsMasq first for domains ending with svc.cluster.local. Finally, restarting 

$ sudo systemctl restart systemd-resolved
$ sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
$ sudo systemctl restart dnsmasq

Now let's use that external port we configured at the top of the post:

$ kubectl get svc -n kafka
NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE
...
my-cluster-kafka-external-bootstrap   NodePort    10.152.183.27    <none>        32092:32039/TCP          
$ ./kafka-topics.sh --bootstrap-server 10.152.183.27:32092 --create --topic my-new-topic  --partitions 3  --replication-factor 2
Created topic my-new-topic.
$ ./kafka-topics.sh --bootstrap-server 10.152.183.27:32092 --list
my-new-topic

Banzai!

Tuesday, April 15, 2025

What is a (local) LLM?

The short answer: just a bunch of matrices and config files.

The long answer: if you download an LLM from, say, HuggingFace, you'll see a directory structure a little like this:

$ tree ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
├── refs
│   └── main
└── snapshots
    └── 5b8c30c51e0641d36a2f9007c88425c93db0cb07
        ├── config.json
        ├── generation_config.json
        ├── model-00001-of-00004.safetensors
        ├── model-00002-of-00004.safetensors
        ├── model-00003-of-00004.safetensors
        ├── model-00004-of-00004.safetensors
        ├── model.safetensors.index.json
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── tokenizer.json

[slightly modified for clarity].

You can see the actual matrices with something like:

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("mlabonne/TwinLlama-3.1-8B")
embedding_layer = model.get_input_embeddings()  # This contains token ID → vector mappings
embedding_weights = embedding_layer.weight  # Shape: (vocab_size, embedding_dim)
print(embedding_weights.size())

Output:

torch.Size([128256, 4096])

But if you want to get even more raw, you can just read the files directly:

from safetensors.torch import load_file

state_dict = load_file("model-00001-of-00004.safetensors")
embeddings = state_dict['model.embed_tokens.weight']
print(embeddings.size())

Output:

torch.Size([128256, 4096])

That is the size of the vocabulary and we can demonstrate this with:

$ grep -rl 128256 ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
...
config.json:  "vocab_size": 128256 

You can see what the actual tokens (note: not words) are in tokenizer.json.

File formats

Safe Tensors are a HuggingFace format for storing tensors. The "safe" properties come from the possible security implications of PyTorch's .pt and .bin formats that allow the pickling of code in the files.

GGUF (Giant GGML Unified Format) is a format used by Llama (and compatible) models. The format stores everything (both weights and metadata) in one file, is good for CPU processing and allows quantization.

You can convert from Safe Tensors to GGUF by running something like this from the llama.cpp repo:

python  convert_hf_to_gguf.py --outtype f16  --outfile /home/henryp/Temp/models--mlabonne--TwinLlama-3.1-8B-DPO.gguf /home/henryp/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B-DPO/snapshots/4a76d14414118c00fbbaed96cf1dd313a10f1409/

That directory being where the model's config.json lives.

Tensor files store their weights in various formats. FP16 and BF16 are both 16-bit encodings with the difference being which bits they allocate to the mantissa and exponent. BF16 has a larger range (same as FP32) but lower precision. FP16 is the traditional format for GPUs with more modern GPUs now also supporting BF16. 

You can even just use integers like INT8 and Q4_0, the latter being a 4-bit integer with the _0 refering to which technique was used to quantize it (eg, scaling). These different strategies trade off accuracy, speed, complexity etc.

The model

There's a nice Java implementation of Llama.cpp here. It's very nicely written and even allows you to compile the Java code to a native binary (use the latest GraalVM). This class represents the weights that are loaded from disk. In the case of a LLama model, the GGUF file is read from disk, parsed and just becomes an object this class.

Monday, April 14, 2025

Interpreting GLMs

How do you check the quality of your Generalized Linear Models?

You cannot use ROC and AUC curves for linear regression models by their very nature (they don't give yes/no answers). So one way is to use the Concordance Index (a.k.a. C-Index) instead. It works by taking pairs of observations where one has experienced the outcome and the other didn't and asking how the model did. With the results, you want a value greater than 0.5 but how much greater is (as always) context-dependent. In the social sciences, 0.75 may be great but pants in the hard sciences.

VIF indicates how much the collinearity of factors impact each other. It's quite domain specific but a value of less than 10 is considered moderate for prediction and less than 5 for inference.

Because of the mathematics behind them, the R-squared used in Ordinary Least Squares does not apply to a general GLM. Instead, Pseudo R-squared is used that tends to have a lower value. It is not recommended to use this metric for goodness of fit in GLMs.

Poisson and Binomial commonly have high deviance. Lower deviance indicates a better fit. The ratio of this to the degrees of freedom (the number of observations minus the number of parameters) are better for assessing the goodness of fit. Values moderately above 1.0 indicate overdispersion where the variance in the data is greater than what the model assumes. This matters less for Negative Binomial. Underdispersion indicates the data has clusters.

The alpha on a negative binomial should be about 1.0. This is the default in StatsModels. More than that indicates overdispersion.

Another metric is the ratio of Pearson chi squared to residual degrees of freedom. This should typically be from 0.5-0.8 to 1.2-2.0 depending on your tolerance. Again, high values indicate overdispersion, low underdispersion.

Chi-squared can tell you there is a relationship between categorical variables but it doesn't tell you its strength. For that use Cramer's V. This gives you a value between 0.0 and 1.0. Interpretation of this is domain specific but the lower range for starting to look suspiciously at relationships starts in the 0.3+ area. 

Wednesday, April 2, 2025

Storage on Kubernetes, Part 2

Having opted for installing MicroK8s and its storage and network addons, I then tried to add another node. Adding another node is still proving problematic but storage was much easier than raw Kubernetes.

To make Helm recognise MicroK8s rather than kubectl, run this:

alias kubectl='microk8s.kubectl'
export KUBECONFIG=/var/snap/microk8s/current/credentials/client.config

And install Strimzi as I outline in a previous post.

When I tried to see what the logs were of the strimzi-cluster-operator, it errored with "tls: failed to verify certificate: x509: certificate is valid for" ... I've still not worked this out yet but this Github ticket helped me. Apparently, I need to call kubectl --insecure-skip-tls-verify-backend=true logs...  for reasons I don't entirely understand yet. [Addendum: looks like I needed to run sudo microk8s refresh-certs --cert ca.crt as my certificates used to connect to the containers were apparently out of date].

Anyway, when I deployed my slightly modifed version of kafka-with-dual-role-nodes.yaml, changing the storage type from jbod [Just a Bunch of Disks] to persistent-claim, the Strimzi operator was puking "The size is mandatory for a persistent-claim storage". Looking at the code, it seems Strimzi is expecting a child config node of size

Now, I just need some YAML to create a MicroK8s PVC (which I stole from SO) and now 3 node Kafka cluster appears to be up and running, writing their data to the host machine's /var/snap/microk8s/common/default-storage path.

Gotcha!

I took my laptop to a coffee shop and worked there and everything was broken again! Kubelite was puking with:

$ journalctl -f -u snap.microk8s.daemon-kubelite -f --no-hostname --no-pager

...
Apr 02 11:15:15 microk8s.daemon-kubelite[309439]: W0402 11:15:15.083568  309439 logging.go:55] [core] [Channel #5 SubChannel #6]grpc: addrConn.createTransport failed to connect to {Addr: "unix:///var/snap/microk8s/7964/var/kubernetes/backend/kine.sock:12379", ServerName: "kine.sock:12379", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/snap/microk8s/7964/var/kubernetes/backend/kine.sock:12379: connect: connection refused"

kine acts as an in-process or standalone service that translates Kubernetes API interactions into operations on the dqlite database. It essentially makes dqlite behave like etcd from the Kubernetes API server's perspective.

Debugging gives:

$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
...

but this is a lie! Kubelite (and daemon-k8s-dqlite) say they're running but this is not the whole story. They're constantly restarting because there is a problem.

$ journalctl -f -u snap.microk8s.daemon-k8s-dqlite.service -f --no-hostname --no-pager
...
Apr 02 10:57:50 microk8s.daemon-k8s-dqlite[274391]: time="2025-04-02T10:57:50+01:00" level=fatal msg="Failed to create server" error="failed to create dqlite app: listen to 192.168.1.253:19001: listen tcp 192.168.1.253:19001: bind: cannot assign requested address"
...

That's not my current IP address! That's my address back in the office! I've rebooted my laptop since being connected to that network so I don't know where it's getting it from.

I could change the files in /var/snap/microk8s/current/ but a rerturn to the office made everything work again.

Logging tips

See what a service is doing with, for example:

journalctl -u snap.microk8s.daemon-cluster-agent

and check a status with for example:

systemctl status snap.microk8s.daemon-kubelite


Friday, March 28, 2025

Iceberg Tidbits

Some miscellaneous Iceberg notes.

Cleaning up

There is a logic to this. Russell Spitzer, of the Apache team says:
run rewrite first
expire second
you don't need to run remove_orphans unless something broke [Discord]
Here is a quick BDD I wrote that illustrates what's going on. Basically: 

  1. rewrite_data_files puts all the data in as few a number of files as possible. 
  2. expire_snapshots then deletes any files that are surplus.
Fewer files means less IO and a more efficient query.

Russell explains the reason we don't use remove_orphans  
To clarify, Remove Orphan Files is only for cleaning up files from failed writes or compactions that failed in a way that the process could not clean up.
Inside of Remove orphan files are 2 expensive remote steps
  • Listing everything in the directories owned by the table
  • Join of that list with metadata file paths
Then the list of files to delete is brought back to the driver where deletes are preformed [Discord]
This BDD demonstrates removing orphans through the Java API. I wanted to use CALL system.remove_orphan_files but couldn't. Instead, I got the error:

java.lang.IllegalArgumentException: Cannot remove orphan files with an interval less than 24 hours. Executing this procedure with a short interval may corrupt the table if other operations are happening at the same time. If you are absolutely confident that no concurrent operations will be affected by removing orphan files with such a short interval, you can use the Action API to remove orphan files with an arbitrary interval.
at org.apache.iceberg.spark.procedures.RemoveOrphanFilesProcedure.validateInterval(RemoveOrphanFilesProcedure.java:209)

which is a bit much for a simple BDD.

I forced Iceberg to break by having two threads try to write to the same table at the same time.

Pushdown

This is more Spark than just Iceberg but it's an interesting use case.

The IN clause is not pushed down, at least in Spark 3.5.1 - see some nice analysis here. TL;DR; instead of using IN, the most efficient query just converts it into a set of ANDs and ORs. Note that pushdown is (according to Russel Spitzer) pushing the actual function to the data and not mapping it to some other function that is semantically equivalent (as we see in that link).

This filter was not pushed down
Again, I've got a BDD to demonstrate this. Note that if the filters element is empty in a Physical Plan's BatchScan, your function has not been pushed down.

Monday, March 24, 2025

Hands on LLM training

Here are a collection of tips on tuning LLMs. A lot of these comments are taken from the Unsloth discord server.

What is Unsloth

Unsloth is a Python library that allows you to fine tune models without using silly amounts of compute. It does this by using smaller numeric precision.
"Almost everyone here is doing load_in_4bit. You load the model as a (dynamically) quantized 4bit version, then create an (uninitialized) bf16 QLoRA train said QLoRA merge the QLoRA onto the quantized model, generating an fp16 or bf16 model (up-quantized). [Optionally,] quantize it back down to your desired level" [Discord]
"LORAs are always fp16" [Discord]

AWQ is Activation-aware Weight Quantization "preserves a small fraction of the weights that are important for LLM performance to compress a model to 4-bits" HuggingFace.

Models

What size model can I fit into my GPU's VRAM?
A "safe bet would be to divide by 1.5 - 2x your total available memory (ram + vram) to get the B parameters vram / 1.5 or 2 to get the active params, so for example, I have 8gb vram i can load a 8b 4bit quantized model directly onto my gpu because of its relatively small approx 4-5gb vram footprint." [Discord]
This sentiment seems popular at the moment:

Mistral Small 2501

Hyperparameters

You want a rank "that is large enough to imitate the writing style and copy the knowledge from our instruction samples. You can increase this value to 64 or 128 if your results are underwhelming.

"The importance of the reference model is directly controlled via a beta parameter between 0 and 1. The reference model is ignored when beta is equal to 0, which means that the trained model can be very different from the SFT one. In practice, a value of 0.1 is the most popular one, but this can be tweaked" 
LLM Engineering Handbook

In the event that you're not "able to achieve consistently accurate answers to my questions... the model's responses are sometimes correct and sometimes incorrect, even for questions directly from the fine-tuning dataset", the advice is:
"You may want to up the rank and the alpha
Rank = how many weights it effects. Alpha is how strong they are effected. PEFT only effects the last few layers" [Discord]
(PEFT is Parameter Efficient Fine Tuning. LORA is just one of these methodologies.)
"Alpha should at least equal to the rank number, and rank should be bigger for smaller models/more complex datasets; it usually is between 4 and 64." [Discord]
Batchsize takes up a lot of VRAM so if you are having OOMs, choose smaller batches. This will mean training takes longer. To counter this, you can increase the of the gradient accumulation. This in effect batches the batches and writes back deltas to the matrix less often.

Data

Data makes or breaks an LLM.An old hand who goes by the name MrDragonFox on Discord has 
"Most problems are really in the data it self has to be prepped. 80% of the work is there" [Discord]
"You will need good data. ML 101: garbage in, garbage out. That is a science it it self." [Discord]
"Over 80% of the moat is in the data - once you have that - you can start with experimenting" [Discord]
It's a view echoed in LLM Engineer's Handbook: "In most use cases, creating an instruction dataset is the most difficult part of the fine-tuning process."

Cleaning the data is essential:
"Avoid abbreviations if you can. As a general rule, you shouldn't expect [a] model to understand and be able to do a task a human with the same context would be unable to accomplish" - etrotta, Discord
But quantity is not quality. In general, "if I have 30k samples (user-assistant), will finetuning a base or instruct model result in a strictly better model?"
"No. Fine tuning can worsen the model's performance if your data is bad or poorly formatted. Even if the data is good, it could cause the model to 'forget' things outside of the dataset you used to fine tune it (although that is a bit less likely with LoRA). Fine tuning will make the outputs closer to whatever you trained it on. For it to make a better model, "whatever you trained it on" must be better than what it's currently responding and follow the same structure you will use later." [Discord]
Overfitting and underfitting

What causes loss to make a staircase pattern when hitting a new epoch?"
"It's very normal, it's overfitting" [Discord]

"Underfitting is probably seen more often as a common phenomenon where a low rank model fails to generalize due to a lack of learnable params
"Generally speaking you are looking for a smooth descent of loss and a gradual convergence of around 0.5. Making data more diverse and novel compared to what the model has seen already is good to combat overfitting and reducing LR/epochs can help" [Discord]
Increasing rank and alpha should be included here.

Apparently this does not apply to RSLoRA (Rank Stabilisation LoRA) which addresses training instability and performance degradation issues that can arise in low-rank adaptation methods when fine-tuning on complex datasets..
"I think for RSLoRA, alpha is sqrt(hidden_size), not sqrt(r) as claimed in the blog post. You can have whatever LoRA size (rank) you want for GRPO, with or without RS. But note that for GRPO it's usually a good idea to have max_grad_norm=0.1 (or some other low value) because GRPO models tend to go insane if you overcook them. It's always either a=r or a=sqrt(hidden_size) if use_rslora=True"
If a fine tuned model doesn't zero shot a relatively simple question it was trained it on, you "might need more examples or deeper rank" [Discord]

Wednesday, March 19, 2025

Storage on Kubernetes Part 1

... is not easy. And it appears I'm not the only one to think so.
"Storage is hard. Fast, reliable storage is really hard" - Hubt on Discord
These are some notes I made while trying to set up a Kubernetes home lab on spare hardware.
 
K8s approach to Storage

Once more, K8s delegates to containers to manage storage.

"Applications that require persistent storage would fail to start because the cluster doesn’t yet have a storage driver. Like network drivers, several storage drivers are available for Kubernetes. The Container Storage Interface (CSI) provides the standard that storage drivers need to meet to work with Kubernetes. We’ll use Longhorn, a storage driver from Rancher; it’s easy to install and doesn’t require any underlying hard­ ware like extra block devices or access to cloud­based storage." [1] 

A prerequisite for Longhorn is that I need to run this on my boxes:

sudo apt install -y apt-transport-https open-iscsi nfs-common
sudo systemctl enable --now iscsid

Again, I need to allow all connections between my nodes, so time to fiddle with the firewall.

After installing Longhorn, running this reported that my replica count could not be satisfied:

kubectl -n longhorn-system logs -l app=longhorn-manager

with error "No available disk candidates to create a new replica of size 10737418240". Patching did not seem to help:

kubectl patch setting default-replica-count -n longhorn-system --type='merge' -p '{"value": "1"}'

Neither did:

kubectl edit storageclass longhorn

to edit the numberOfReplicas

(Note that Longhorn comes with a GUI that you can see if you port forward with:

kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

but this didn't help either).

So, instead, I downloaded the YAML, edited the numberOfReplicas by hand and deployed to the cluster.

Unfortunately, when I deleted my kafka and longhorn-service namespaces, the command would not terminated. It seemed that the kafka PVCs depended on the PVs that used Longhorn. 

Cyclic dependencies
I managed to finally kill the longhorn namespace that was constantly Terminating by manually deleting the PVs with kubectl edit and 

kubectl get namespace longhorn-system -o json > ns.json

deleting the finalizers in ns.json by hand and running:

kubectl replace --raw "/api/v1/namespaces/longhorn-system/finalize" -f ns.json

For the PVCs, I had to do the same things but since they depended on Longhorn webhooks, I needed to delete them first with:

kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
kubectl delete mutatingwebhookconfiguration <webhook-name>
kubectl delete validatingwebhookconfiguration <webhook-name>

Finally, 

kubectl get all -n <namespace>
kubectl get pvc -n <namespace>

indicated that everything had been cleaned up.

Phew! But now I'm back where I started and this was a lot of work (albeit a great way to understand Kubernetes). 

I then deployed Longhorn again only to have it complaining "failed to get backup target default: backuptarget.longhorn.io". Oh, boy.
"I like microk8s for having everything out of the box mostly running and turnable on with some little effort. Except metallb 😛" Darkwind on Discord
MicroK8s is a lightweight Kubernetes implementation that is great for CI/CD and (apparently) just works out-of-the-box. I might just install that...

[1] Book of Kubernetes

Tuesday, March 11, 2025

Cloud Maintenance

Although it's often referred to as "infrastructure as code", there is very little code in what most people call DevOps. It's mostly markup. This can cause maintenance issues. There are, however, ways of dealing with this situation.

CDK8s is the Cloud Development Kit for Kubernetes. It has Typescript, Java, Python and Go implementations.
"I wouldn’t head down the Helm path for that before I took a long look at CDK8s.  Helm templates are a nightmare for debugging in my experience. Instead having a real programming language backing up the templates is so much better...  It renders the YAML, it does not apply it. I use it with ArgoCD as my deployment mechanism. So I take the resulting YAML and check it into git for ArgoCD to apply to the cluster.  Execution of the CDK8s code and check into git is automatic as part of the CI jobs." [Reddit]
CDK8s is a templating framework that allows you to build K8s config files in a number of languages, including Java.

"KCL is an open-source, constraint-based record and functional programming language. It leverages mature programming language technology and practices to facilitate the writing of many complex configurations. KCL is designed to improve modularity, scalability, and stability around configuration, simplify logic writing, speed up automation, and create a thriving extension ecosystem." [DevOps.dev]. Its config is made up of schema, lambdas and rules [home] that constrain not just the structure of the document but also the values.

KCL is a new language whild CDK8s leverages popular languages that already exist.

Saturday, March 8, 2025

Automated documentation

I've never seen anybody go back to fixing documentation that goes out of date - but then I've only been software engineering for 30 years. Since nobody ever fixes it, a better strategy is to automate it.

To this end, there is OpenApi (formally Swagger) for documenting REST API endpoints. There are tools that convert the OpenAPI config files into Python, Java, Scala amongst others. You can go the other way and generate code from the OpenAPI config files (example for Python here). 

An example can be found in the Apache Polaris codebase where Gradle builds the classes given a YAML file. IntelliJ quite cleverly recognises it as an OpenApi file and allows you to test the API by invoking the service through the IDE.

IntelliJ's version of Postman

It will even generate the DTOs as defined in the YAML file.

Databricks' DABs

If you're writing Databricks code in an IDE that is to be run on an ad hoc basis (rather than some CI/CD pipeline) you might want to use the Databricks VSCode plugin. This will automatically build your Data Asset Bundle for you. Upon signing in, a databricks.yml file will be created at the root of your project. It contains the minimum amount of information to deploy your code to .bundle in your DBx home directory under a sub-folder called the bundle's name field.

You can also deploy bundles via VSCode. Configure a root_path under workspace in databricks.yml and when you press the cloud button on the BUNDLE RESOURCE EXPLORER pane withing the Databricks module:

Upload via the cloud button
the bundle will be uploaded to the workspace and directory specified. You can, of course, use the databricks CLI. But for ad hoc uploads, VS Code is very convenient. By default, deployments are to /Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}.

Configuration can be for development of production mode. The advantage of development is "turns off any schedules and automatic triggers for jobs and turns on development mode for Delta Live Tables pipelines. This lets developers execute and test their code without impacting production data or resources." [1]

[1] Data Engineering With Databricks, Cookbook.

Tuesday, February 11, 2025

Kafka on Kubernetes notes

I installed Strimzi which keeps a Kafka cluster in line inside a Kubernetes cluster. The instructions seem easy. 

helm repo add strimzi https://strimzi.io/charts/
helm repo update
kubectl delete namespace kafka
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator --namespace kafka --create-namespace

then run your configuration of choice:

kubectl apply -f kafka/kraft/kafka-ephemeral.yaml -n kafka

from the strimzi-kafka-operator/examples directory. Note: you'll have to change the cluster name in that YAML from my-cluster to your cluster name (kafka for me).

But I had a helluva job making it work. This post is how I solved the many problems.

What is Strimzi?

First, let's look at the architecture of Kubernetes.

Kubernetes is built on the notion of primitives that are constructed to create the whole. You can extend Kubernetes using a CustomResourceDefinition. A "CRD, allows us to add any new resource type to a cluster dynamically. We simply provide the API server with the name of the new resource type and a specification that’s used for validation, and immediately the API server will allow us to create, read, up­ date, and delete resources of that new type." [1]

Note that CRDs have no actual functionality of their own. For that you need Operators. CRDs are the plug into which Operators fit. They are notified of events to the CRD via an Informer. "The Kubernetes API server provides a declar­ative API, where the primary actions are to create, read, update, and delete resources in the cluster." [1] That is, you tell K8s what you want and it does it's best to achieve that aim. Your Operator is that workhorse.

In our example above, we deploy the CRD with helm install... and invoke its API with kubectl apply -f....

"DaemonSets are used to manage the creation of a particular Pod on all or a selected set of nodes in a cluster. If we configure a DaemonSet to create Pods on all nodes, then if new nodes are added to the cluster, new pods will be created to run on these new nodes." [2] This is useful for system related pods like Flannel, a CNI (Container Network Interface) plugin, which should run on each node in the cluster. Contrast this to ReplicaSets for which a typical use case is managing your application pods.

"The kube-dns Service connects to a DNS server Deployment called Core­DNS that listens for changes to Services in the Kubernetes cluster. CoreDNS updates the DNS server configuration as required to stay up to date with the current cluster configuration." [1]. CoreDNS gave me some issues too (see below).

BTW, you can also look at Cruise Control which is an open source Java that "helps run Apache Kafka clusters at large scale".

Debugging

And it's Flannel that started crashing for me with strange errors. It was largely by bloody mindedness that I fixed things. Here's a few things I learned along the way. 

The first is that I needed to start kubeadm with --pod-network-cidr=10.244.0.0/16 [SO] when I was following the instructions in a previous post (apparently, you need another CIDR if you use a different plugin like Calico). This prevents "failed to acquire lease" error messages.

Flannel was still complaining with a "cni0" already has an IP address different from 10.... error message [SO]. It appears that some network config from the previous installation needed to be rolled back. Well,  kubeadm reset does warn you that The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d.

So, to completely expunge all traces of the previous Kubernetes installation, I needed to run something like this script on all boxes:

sudo kubeadm reset && \
sudo rm -rf /etc/cni/net.d && \
sudo ipvsadm --clear && \
sudo systemctl stop kubelet && \
sudo systemctl stop docker && \
sudo systemctl stop containerd.service && \
rm -rf ~/.kube && echo Done!
sudo systemctl restart containerd && sudo systemctl restart kubelet

Bear in mind that this will completely blat your installation so you will need to run sudo kubeadm init... again.

Next I was getting infinite loops [GitHub coredns] in domain name resolution with causes the pod to crash with CrashLoopBackOff. This offical doc helped me. As I understand it, Kubernetes should not use /etc/resolv.conf as this will forward a resolution on to K8s which will then look up it's nameserver in this file and so on forever. Running:

sudo echo "resolvConf: /run/systemd/resolve/resolv.conf" >> /etc/kubernetes/kubelet.conf 
sudo systemctl restart kubelet

solved that. Note that /run/systemd/resolve/resolv.conf should contain something like:

nameserver 192.168.1.254
nameserver fe80::REDACTED:d575%2
search home

and no more. If you have more than 3 entries [GitHub], kubelet prints an error but it seems to continue anyway.

Next were "Failed to watch *v1.Namespace" in my coredns pods.

First I tried debugging but deploying a test pod with:

kubectl run -n kafka dns-test --image=busybox --restart=Never -- sleep 3600
kubectl exec -it -n kafka dns-test -- sh

(If you want to SSH into a pod that has one container, use [SO], add a -c CONTAINER_NAME.)

This confirmed that there was indeed a network issue as it could not contact the api-server either. Note that although BusyBox is convenient, you might prefer "alpine rather than busybox as ... we’ll want to use some DNS commands that require us to install a more full­ featured DNS client." [1]

Outside the containers, this worked on the master host:

nslookup kubernetes.default.svc.cluster.local 10.96.0.10

but not a worker box. It should as it's the virtual IP address of the core DNS (see this by running kubectl get svc -n kube-system)

The problem wasn't Kubernetes config at all but firewalls. Running this on my boxes:

sudo iptables -A INPUT -s IP_ADDRESS_OF_OTHER_BOX -j ACCEPT
sudo iptables -A FORWARD -s IP_ADDRESS_OF_OTHER_BOX  -j ACCEPT

(where IP_ADDRESS_OF_OTHER_BOX is for each box in the cluster) finally allowed my Strimzi Kafka cluster to start and all the other pods seemed happy too. Note there are security implications to these commands as they allow all traffic from IP_ADDRESS_OF_OTHER_BOX.

Nodes on logging

To get all the untruncated output of the various tools, you'll need:

kubectl get pods -A  --output=wide
journalctl  --no-pager  -xeu kubelet
systemctl status -l --no-pager  kubelet

And to test connections to the name servers use:

curl -v telnet://10.96.0.10:53

rather than ping as ping may be disabled.

This command shows all events in a given namespace:

kubectl get events -n kube-system

and this will print out something to execute showing the state of all pods:

for POD in $(kubectl get pods -A | awk '{print $2 " -n " $1}' | grep -v NAME) ; do { echo "kubectl describe pod $(echo $POD) | grep -A20 ^Events"  ;   } done

Simple scriptlets but they helped me so I making a note for future reference.

[1] Book of Kubernetes, Holm
[2] The Kubernetes Workshop

Monday, February 3, 2025

NVidia Drivers on Linux

Whenever I do an upgrade, it generally goes well except for the video drivers. Updating is generally:

sudo apt update
sudo apt upgrade
sudo do-release-upgrade

and addressing any issues interactively along the way (for instance, something had changed my tlp.conf).

However, even after an otherwise successul upgrade, I was having trouble with my monitor. Firstly, the laptop screen was blank while the external was OK (the upgrade had not installed NVidia drivers); then the other way around; then the colours were odd (driver version 525).

At first, I went on a wild goose chase with missing X11 config. I think the conf file is supposed to be absent and the advice in this old askubuntu post is old. The official NVidia site itself was also not helpful.

I had to blat all NVidia drivers with [LinuxMint]:

sudo apt purge *nvidia*
sudo apt autoremove

I also updated the kernel with [askubuntu]:

sudo update-initramfs -u

and the kernel headers:

sudo apt install linux-headers-$(uname -r)

then (as recommended at askubuntu):

sudo ubuntu-drivers install nvidia:535

Version 535 seems stable and rebooting gave me a working system. YMMV.

I had tried 525 as sometimes the release before last is more stable (or so I read) but no joy. To clean them up, I needed to run:

sudo sudo mv /var/lib/dpkg/info/nvidia-dkms-525.* ~/Temp

as not all versions seem to be stable.

Wednesday, January 29, 2025

Databricks in an ADF pipeline

ADF is a nasty but ubiquitous techology in the Azure space. It's low-code and that means a maintenance nightmare. What's more, if you try to do anything remotely clever, you'll quickly hit a brick wall.

Fortunately, you can make calls from ADF to Databricks notebooks where you can write code. In our case, this was to grab the indexing SQL from SQL Server and version control it in Git. At time of writing, there was no Activity in ADF to access Git.

To access Git you need a credential. You don't want to hardcode it into notebook as anybody who has access to it can see it. So, you store it as a secret with something like:

echo GIT_CREDENTIAL | databricks secrets put-secret YOUR_SCOPE YOUR_KEY
databricks secrets put-acl YOUR_SCOPE YOUR_USER READ

where YOUR_SCOPE is the namespace in which the secret lives (can be anything modulo some reserved strings); YOUR_KEY is the key for retrieval and YOUR_USER is the user onto whom you wish to bestow access.

Now, your notebook can retrieve this information with:

git_credential = dbutils.secrets.get(scope="YOUR_SCOPE", key="YOUR_KEY")

and although others might see this code, the value is only accessible at runtime and only when YOUR_USER is running it.

Next, we want to get an argument that is passed to the Databricks notebook from ADF. You can do this in Python with:

dbutils.widgets.text(MY_KEY, "")
parameter = dbutils.widgets.get(MY_KEY)

where MY_KEY is a base parameter in the calling ADF Notebook Activity, for example:

Passing arguments to a Databricks notebook from ADF

Finally, you pass a return value back to ADF with:

dbutils.notebook.exit(YOUR_RETURN_VALUE)

and use it back in the ADF workflow by referencing @activity("YOUR_NOTEBOOK_ACTIVITY_NAME").output.runOutput in this case where we want to run the SQL the notebook returned:

ADF using the value returned from the Databricks notebook

That's how ADF talks to Databricks and how Databricks replies. There's still the small matter of how the notebook invokes Git.

This proved unexpectedly complicated. To check the code out was easy:

import subprocess
print(subprocess.run(["git", "clone", f"https://NHSXEI:{git_credential.strip()}@dev.azure.com/YOUR_GIT_URL/{repo_name}"], capture_output=True))

but to run some arbitrary shell commands in that directory via Python was complicated as I could not cd into this new directory. This is because cd is a built-in [SO] of the shell not an executable. So, instead I used a hack: my Python wrote some dynamically generated shell script, wrote it to a file and executed it from a static piece of shell script. Bit ick but does the job.

Monday, January 27, 2025

Upgrading Kubernetes

I'm adding a new node to my home K8s cluster. Conveniently, I can find the command to run on the client who wants to join by running this on the master [SO]:

kubeadm token create --print-join-command

However, joining proved a problem because my new machine has Kubernets 1.32 and my old cluster is still on 1.28. K8s allows a difference of 1 but this was just too big a jump. 

So, time to upgrade the cluster.

First I had to update the executables by folowing the official documents to put the correct keyrings in place. I also had to overridr the held packages with --allow-change-held-packages as all my boxes are Ubuntu and can be upgraded in lockstep. This meant I didn't have to run:

sudo kubeadm upgrade apply 1.32
sudo kubeadm config images pull --kubernetes-version 1.32.1

However, I must have bodged something as getting the nodes showed the master was in a state of Ready,SchedulingDisabled [SO] where uncordon did the trick. I was also getting "Unable to connect to the server: tls: failed to verify certificate: x509" amongst other errors), so I rolled back [ServerFault] all config with:

sudo kubeadm reset

To be honest, I had to do this a few times as until I worked out that my bodge was an incorrect IP address in my /etc/hosts file for one of my nodes - d'oh.

Then I followed the orginal instructions I used to set up the cluster last year. But don't forget the patch I mention in a previous post. Also note that you must add KUBELET_EXTRA_ARGS="--cgroup-driver=cgroupfs" to /etc/default/kubelet and the JSON to /etc/docker/daemon.json on the worker nodes too

[Addendum. I upgrade an Ubuntu box from 20 to 22 and had my flannel and proxy pods on that box constantly crashing. Proxy was reporting "nodePortAddresses is unset; NodePort connections will be accepted on all local IPs. Consider using `--nodeport-addresses primary`". Following the instructions in the previoud paragraph a second time solved the problem as the upgrade had clearly blatted some config.]

I checked the services [SO] with:

systemctl list-unit-files | grep running | grep -i kube 

which showed the kubelet is running (enabled means it will restart upon the next reboot; you can have one without the other) and 

sudo systemctl status kubelet

Things seemed OK:

$ kubectl get pods --all-namespaces 
NAMESPACE      NAME                          READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-7zmz8         1/1     Running   4 (88m ago)   89m
kube-flannel   kube-flannel-ds-sh4nk         1/1     Running   9 (64m ago)   86m
kube-system    coredns-668d6bf9bc-748ql      1/1     Running   0             94m
kube-system    coredns-668d6bf9bc-zcxfp      1/1     Running   0             94m
kube-system    etcd-nuc                      1/1     Running   8             94m
kube-system    kube-apiserver-nuc            1/1     Running   8             94m
kube-system    kube-controller-manager-nuc   1/1     Running   6             94m
kube-system    kube-proxy-dd4gc              1/1     Running   0             94m
kube-system    kube-proxy-hlzhj              1/1     Running   9 (63m ago)   86m
kube-system    kube-scheduler-nuc            1/1     Running   6             94m

Note the nuc node name indicates it's running on my cluster's master and the other pods (flannelcoredns and kube-proxy) have an instance on each node in the cluster.

Note also that we'd expect two Flannel pods as there are two nodes in the cluster.

It's worth noting at this point that kubectl is a client side tool. In fact, you won't be able to see the master until you scp /etc/kubernetes/admin.conf on the master into your local ~/.kube/config.

Contrast this with kubeadm which is a cluster side tool.

Wednesday, January 22, 2025

Notes on LLMs

I've been learning to build LLMs the hard way. These are some notes I made on my journey.

Intro

I'm not going to train my LLM from scratch. Instead, I'm going to use somebody else's ("Pre Training") and tune it ("Supervised Fine Tuning"). The differences can be seen here [Microsoft] but suffice to say that Pre Training takes huge resources whereas Supervised Fine Tuning can be achieved on modest hardware.

The Environment

I'll be extensively using these Python libraries:
  • TRL "is a library created and maintained by Hugging Face to train LLMs using SFT and preference alignment." [1]
  • Unsloth "uses custom kernels to speed up training (2-5x) and reduce memory use (up to 80% less memory)... It is based on TRL and provides many utilities, such as automatically converting models into the GGUF quantization [see below] format." [1]
  • "Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code!" [HuggingFace] It offloads large matrices to disk or CPU
Preference alignment is an umbrella term. It "addresses the shortcomings of SFT by incorporating direct human or AI feedback into the training process" whereas SFT is just trained on a corpus of text learning its structure.

But on Ubuntu 20, my code gives:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

This appears to be referring to the Linux kernel [SO]

OOMEs

Using a newer kernel (6.5.0-1025-oem) gets me further but then barfs with:

  File "/home/henryp/venv/llms/lib/python3.10/site-packages/transformers/activations.py", line 56, in forward
    return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.17 GiB. GPU 0 has a total capacity of 7.75 GiB of which 529.62 MiB is free. Including non-PyTorch memory, this process has 7.22 GiB memory in use. Of the allocated memory 6.99 GiB is allocated by PyTorch, and 87.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Some advice I received was:
You need to reduce the batch size even more.
You can even try using different optimizers as optimizers hold a lot of memory - Use 8bitadam or adafactor.
You can try LoRA - Low rank adaptation.
You can also quantize the model to load in lower bit precisions [Discord]
"LoRA is a parameter-efficient technique for fine-tuning LLMs... The primary purpose of LoRA is to enable the fine-tuning of LLMs with significantly reduced
computational resources."

The idea is that when we fine tune the weights matrix (W), we'd semantically do this:

W' = W + M

But if M is large, that's a lot of memory we're talking about.

The trick with LoRA that instead of M we use two matrices of much smaller rank, A and B and replace M with AB. as long as A has the same number of rows as M and B the same number of columns, they can otherwise be much smaller (referred to as r in the library code). "Larger ranks may capture more diverse tasks but could lead to overfitting" [1, p214].

Sure, we lose some accuracy but when my code now reads:

model, tokenizer = FastLanguageModel.from_pretrained( # Unsloth
...
            load_in_4bit=True, # "You can activate QLoRA by setting load_in_4bit to True"  LLMEngineering, p251
        )

it runs to completion. Note that QLora combines LoRA with quantization. "The core of QLoRA’s approach involves quantizing the base model parameters to a custom 4-bit NormalFloat (NF4) data type, which significantly reduces memory usage." [1]

Note that I also tried to set a quantization_config and although the fine tuning stage worked, when I tried to use the model, but caused the code using the model to complain about non-zero probabilities for reasons I don't yet understand. 

Preference Alignments

Reinforcement Learning from Human Feedback "typically involves training a separate reward model and then using reinforcement learning algorithms like PPO to fine-tune the language model... This is typically done by presenting humans with different answers and asking them to indicate which one they prefer." [1]

Proximal Policy Optimization (PPO) "is one of the most popular RLHF algorithms. Here, the reward model is used to score the text that is generated by the trained model. This reward is regularized by an additional Kullback–Leibler (KL) divergence factor, ensuring that the distribution of tokens stays similar to the model before training (frozen model)." [1]

Direct Preference Optimization uses Kulback-Leibler [previous post] to compare an accepted answer with a rejected one mitigating the need for an RL algorithm.

The idea with preference alignment is that the adjustment to the underlying model is much smaller. Compare my trained model weights (M in the above equation):

$ du -sh saved_model/model_mlabonne
321M    saved_model/model_mlabonne

with the model weight on which it was based (W in the above equation):

$ du -sh ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
15G     /home/henryp/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/

and it's tiny.

[1] The LLM Engineers Handbook