Friday, March 6, 2026

Cloud and K8s

I've spent the week setting up HTTPS certificates and domain names for my Azure and GCP K8s clusters.

At one point, the GCP K8s installation just kept hanging with "Still creating...". It turned out that we just didn't allocate enough resources:
But not all hanging Kubernetes deployments are so easy to spot.

Automatically point domain names at K8s pod IPs

AWS is better integrated because you can buy the domain names through Route53. But for Azure and GCP K8s, we did the following.
  1. we bought a domain name from AWS via Route53. 
  2. we delegated the nameservers of this domain to Microsoft or Google.
  3. a Kubernetes sidecar starts up and contacts Let's Encrypt's API .
  4. Let's encrypt returns a token 
  5. the sidecar encodes with its private key and hosts it (a.k.a Key Authorization) on port 80.
  6. Let's encrypt reads that file and decodes it with the cluster's public key. Now it can grant a certificate.
The sidecar is called ACME (Automatic Certificate Management Environment) and is ephemeral:

$ kubectl get events -A --sort-by=.lastTimestamp | grep -i  acme
...
default         45m         Normal    Started                   pod/cm-acme-http-solver-6j5xp                                  Started container acmesolver
default         44m         Normal    Sync                      ingress/cm-acme-http-solver-lgl5w                              Scheduled for sync
default         43m         Normal    Killing                   pod/cm-acme-http-solver-6j5xp                                  Stopping container acmesolver

the outside world can talk to both a service and an ingress. Which you choose depends on what you want. Use Ingress for HTTPS.

An Ingress always talks to a service. Note that it is a logical abstraction and can have multiple ingresses for one domain. For example:

$ kubectl get ingress -A
NAMESPACE   NAME                        CLASS    HOSTS                   ADDRESS         PORTS     AGE
default     cm-acme-http-solver-ldflx   <none>   emryspolarisgcp.click   35.189.87.151   80        14m
default     polaris-ingress             nginx    emryspolarisgcp.click   35.189.87.151   80, 443   14m

is saying the emryspolarisgcp.click can point to different services depending on its ports. Here ACME is sticking around to finish the creation of certificates.

This was for GCP where the nameservers were incorrect (they changed every time we deployed the managed zones though Terraform). You might want to see them with:

gcloud dns managed-zones describe polaris-zone --project=afon-core

What can go wrong?

Traffic can be swallowed because:
  • network security groups (or lack of them)
  • misconfigured ports
  • selectors not pointing at the correct pods
Incoming traffic goes through the system in this order:
  1. ingress (optional - see above)
  2. service
  3. endpoint
  4. pod
kubectl get ingress shows the name and the ports of the front facing interface
kubectl describe ingress XXX shows the service to which traffic is sent

Note with an nginx load balancer, this service comes before the ingress:

$ dig A emryspolarisazure.click +short
20.108.199.92
$ kubectl get service -A
NAMESPACE       NAME                                               TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE
...                     50m
default         polaris-internal                                   ClusterIP      10.2.185.168   <none>          8181/TCP                     47m
ingress-nginx   nginx-ingress-ingress-nginx-controller             LoadBalancer   10.2.238.76    20.108.199.92   80:30157/TCP,443:32110/TCP   47m
ingress-nginx   nginx-ingress-ingress-nginx-controller-admission   ClusterIP      10.2.30.150    <none>          443/TCP                      47m
...

Debugging

If in doubt, port forward:

kubectl port-forward svc/polaris-internal 8080:8181 -n default

This will at least establish that the communication between your service and application is fine.

It's important to check that the firewall is at least expecting a connection for that IP address and port. Don't use curl for this as it is subject to network security rules and certificates being in place. So, run:

nc -zv 20.108.199.92 80

if you want to make sure that port is open as firewalls allow a TCP three way handshake even if the Network Security Group blacks further traffic.

Upon setting up the stack with tofu, the certificate doesn't look healthy and I can't access my site via HTTP.

$ kubectl get certificate polaris-tls   
NAME          READY   SECRET        AGE
polaris-tls   False   polaris-tls   28m
$ kubectl get challenges -A
NAMESPACE   NAME                                STATE     DOMAIN                    AGE
default     polaris-tls-1-648081749-304604175   invalid   emryspolarisazure.click   36m
$ kubectl describe certificate polaris-tls
...
Events:
  Type     Reason     Age   From                                       Message
  ----     ------     ----  ----                                       -------
  Normal   Issuing    23m   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal   Generated  23m   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "polaris-tls-rw6t9"
  Normal   Requested  23m   cert-manager-certificates-request-manager  Created new CertificateRequest resource "polaris-tls-1"
  Warning  Failed     21m   cert-manager-certificates-issuing          The certificate request has failed to complete and will be retried: Failed to wait for order resource "polaris-tls-1-648081749" to become ready: order is in "invalid" state:
$ kubectl describe challenge -A
...
Events:
  Type     Reason     Age   From                     Message
  ----     ------     ----  ----                     -------
  Normal   Started    37m   cert-manager-challenges  Challenge scheduled for processing
  Normal   Presented  37m   cert-manager-challenges  Presented challenge using HTTP-01 challenge mechanism
  Warning  Failed     35m   cert-manager-challenges  Accepting challenge authorization failed: acme: authorization error for emryspolarisazure.click: 400 urn:ietf:params:acme:error:connection: 51.132.211.134: Fetching http://emryspolarisazure.click/.well-known/acme-challenge/5yGd57VQUrjc2ns-Q-VEVIl3vl6WKFK4B2fQu643_TM: Timeout during connect (likely firewall problem)

Running:

kubectl delete certificate polaris-tls

did the trick as it forces the certificate to renew. Watch and wait for it to be ready with:

kubectl get certificate polaris-tls -w

You need to run this (or put the equivalent in your Terraform file):

kubectl annotate service nginx-ingress-ingress-nginx-controller   -n ingress-nginx   "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path=/healthz"

and that should now all work. A happy system should look like:

$ kubectl describe certificate polaris-tls
...
Events:
  Type    Reason     Age   From                                       Message
  ----    ------     ----  ----                                       -------
  Normal  Issuing    42m   cert-manager-certificates-trigger          Issuing certificate as Secret does not exist
  Normal  Generated  42m   cert-manager-certificates-key-manager      Stored new private key in temporary Secret resource "polaris-tls-thlvj"
  Normal  Requested  42m   cert-manager-certificates-request-manager  Created new CertificateRequest resource "polaris-tls-1"
  Normal  Issuing    40m   cert-manager-certificates-issuing          The certificate has been successfully issued

No comments:

Post a Comment