Saturday, February 24, 2024

Home made Kubernetes cluster

When trying to run ArgoCD, I came across this problem that was stopping me from connecting. Using kubectl port-forward..., I was able to finally connect. But even then, if I ran:

$ kubectl get services --namespace argocd
NAME                                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
argocd-applicationset-controller          ClusterIP      10.98.20.142     <none>        7000/TCP,8080/TCP            19h
argocd-dex-server                         ClusterIP      10.109.252.231   <none>        5556/TCP,5557/TCP,5558/TCP   19h
argocd-metrics                            ClusterIP      10.106.130.22    <none>        8082/TCP                     19h
argocd-notifications-controller-metrics   ClusterIP      10.109.57.97     <none>        9001/TCP                     19h
argocd-redis                              ClusterIP      10.100.158.58    <none>        6379/TCP                     19h
argocd-repo-server                        ClusterIP      10.111.224.112   <none>        8081/TCP,8084/TCP            19h
argocd-server                             LoadBalancer   10.102.214.179   <pending>     80:30081/TCP,443:30838/TCP   19h
argocd-server-metrics                     ClusterIP      10.96.213.240    <none>        8083/TCP                     19h

Why was my EXTERNAL-IP still pending? It appears that this is a natural consequence of running my K8s cluster in Minikube [SO].

So, I decided to build my own Kubernetes cluster. This step-by-step guide proved really useful. I built a small cluster of 2 nodes on heterogeneous hardware. Note that although you can use different OSs and hardware, you really need to use the same version of K8s on all boxes (see this SO).

$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
adele   Ready    <none>          18h   v1.28.2   192.168.1.177   <none>        Ubuntu 18.04.6 LTS   5.4.0-150-generic   containerd://1.6.21
nuc     Ready    control-plane   18h   v1.28.2   192.168.1.148   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic    containerd://1.7.2

Great! However, Flannel did not seem to be working properly:

$ kubectl get pods --namespace kube-flannel -o wide 
NAME                    READY   STATUS             RESTARTS         AGE    IP              NODE    NOMINATED NODE   READINESS GATES
kube-flannel-ds-4g8gg   0/1     CrashLoopBackOff   34 (2m53s ago)   152m   192.168.1.148   nuc     <none>           <none>
kube-flannel-ds-r4xvt   0/1     CrashLoopBackOff   26 (3m11s ago)   112m   192.168.1.177   adele   <none>           <none>

And journalctl -fu kubelet was puking  "Error syncing pod, skipping" messages.

Aside: Flannel is a container on each node that coordinates the segmentation of the virtual network. For coordination, it can use etcd, which can be thought of like Zookeeper in the Java ecosystem. "Flannel does not control how containers are networked to the host, only how the traffic is transported between hosts." [GitHub]

The guide seemed to omit one detail that lead to me to see the Flannel container puking something like this error:

E0427 06:08:23.685930 13405 memcache.go:265] couldn’t get current server API group list: Get “https://X.X.X.X:6443/api?timeout=32s 2”: dial tcp X.X.X.X:6443: connect: connection refused

Following this SO answer revealed that the cluster's CIDR had not been set. So, I patched it following this [SO] advice so:

kubectl patch node nuc -p '{"spec":{"podCIDR":"10.244.0.0/16"}}'
kubectl patch node adele -p '{"spec":{"podCIDR":"10.244.0.0/16"}}'

which will work until the next reboot (one of the SO answers describes how to make that permanent as does this one).

Anyway, this was the puppy and now the cluster seems to be behaving well.

Incidentally, this gives a lot of log goodies:

kubectl cluster-info dump

Thursday, February 15, 2024

Spark and Schemas

I helped somebody on Discord with a tricksy problem. S/he was using a Python UDF in PySpark and seeing NullPointerExceptions. This suggests a Java problem as the Python error message for an NPE looks more like "AttributeError: 'NoneType' object has no attribute ..." But why would Python code cause Spark to throw an NPE?

The problem was the UDF was defining a returnType struct that stated a StructField was not nullable.


The line charge_type.lower (highlighted) was a red herring as they had clearly changed more than one thing when experimenting (always change one thing at a time!)

Note that Spark regards the nullable field as advisory only.
When you define a schema where all columns are declared to not have null values , Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column.
- Spark, The Definitive Guide
And the reason is in this code where Spark is generating bespoke code. If nullable is false, it does not check the reference unnecessarily. But if there reference is null, Spark barfs like so:

Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_2$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$11(EvalPythonExec.scala:148)

So, the Python returned without an NPE but caused the JVM code to error as the struct it returns contains nulls when it said it wouldn't.