Agile Java Man: Cloud and K8s

I've spent the week setting up HTTPS certificates and domain names for my Azure and GCP K8s clusters.

At one point, the GCP K8s installation just kept hanging with "Still creating...". It turned out that we just didn't allocate enough resources:

But not all hanging Kubernetes deployments are so easy to spot.

Automatically point domain names at K8s pod IPs

AWS is better integrated because you can buy the domain names through Route53. But for Azure and GCP K8s, we did the following.

we bought a domain name from AWS via Route53.
we delegated the nameservers of this domain to Microsoft or Google.
a Kubernetes sidecar starts up and contacts Let's Encrypt's API .
Let's encrypt returns a token
the sidecar encodes with its private key and hosts it (a.k.a Key Authorization) on port 80.
Let's encrypt reads that file and decodes it with the cluster's public key. Now it can grant a certificate.

The sidecar is called ACME (Automatic Certificate Management Environment) and is ephemeral:

$ kubectl get events -A --sort-by=.lastTimestamp | grep -i acme

...

default 45m Normal Started pod/cm-acme-http-solver-6j5xp Started container acmesolver

default 44m Normal Sync ingress/cm-acme-http-solver-lgl5w Scheduled for sync

default 43m Normal Killing pod/cm-acme-http-solver-6j5xp Stopping container acmesolver

the outside world can talk to both a service and an ingress. Which you choose depends on what you want. Use Ingress for HTTPS.

An Ingress always talks to a service. Note that it is a logical abstraction and can have multiple ingresses for one domain. For example:

$ kubectl get ingress -A

NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE

default cm-acme-http-solver-ldflx <none> emryspolarisgcp.click 35.189.87.151 80 14m

default polaris-ingress nginx emryspolarisgcp.click 35.189.87.151 80, 443 14m

is saying the emryspolarisgcp.click can point to different services depending on its ports. Here ACME is sticking around to finish the creation of certificates.

This was for GCP where the nameservers were incorrect (they changed every time we deployed the managed zones though Terraform). You might want to see them with:

gcloud dns managed-zones describe polaris-zone --project=afon-core

What can go wrong?

Traffic can be swallowed because:

network security groups (or lack of them)
misconfigured ports
selectors not pointing at the correct pods

Incoming traffic goes through the system in this order:

ingress (optional - see above)
service
endpoint
pod

kubectl get ingress shows the name and the ports of the front facing interface

kubectl describe ingress XXX shows the service to which traffic is sent

Note with an nginx load balancer, this service comes before the ingress:

$ dig A emryspolarisazure.click +short

20.108.199.92

$ kubectl get service -A

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

... 50m

default polaris-internal ClusterIP 10.2.185.168 <none> 8181/TCP 47m

ingress-nginx nginx-ingress-ingress-nginx-controller LoadBalancer 10.2.238.76 20.108.199.92 80:30157/TCP,443:32110/TCP 47m

ingress-nginx nginx-ingress-ingress-nginx-controller-admission ClusterIP 10.2.30.150 <none> 443/TCP 47m

...

Debugging

If in doubt, port forward:

kubectl port-forward svc/polaris-internal 8080:8181 -n default

This will at least establish that the communication between your service and application is fine.

It's important to check that the firewall is at least expecting a connection for that IP address and port. Don't use curl for this as it is subject to network security rules and certificates being in place. So, run:

nc -zv 20.108.199.92 80

if you want to make sure that port is open as firewalls allow a TCP three way handshake even if the Network Security Group blacks further traffic.

Upon setting up the stack with tofu, the certificate doesn't look healthy and I can't access my site via HTTP.

$ kubectl get certificate polaris-tls

NAME READY SECRET AGE

polaris-tls False polaris-tls 28m

$ kubectl get challenges -A

NAMESPACE NAME STATE DOMAIN AGE

default polaris-tls-1-648081749-304604175 invalid emryspolarisazure.click 36m

$ kubectl describe certificate polaris-tls

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Issuing 23m cert-manager-certificates-trigger Issuing certificate as Secret does not exist

Normal Generated 23m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-rw6t9"

Normal Requested 23m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"

Warning Failed 21m cert-manager-certificates-issuing The certificate request has failed to complete and will be retried: Failed to wait for order resource "polaris-tls-1-648081749" to become ready: order is in "invalid" state:

$ kubectl describe challenge -A

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Started 37m cert-manager-challenges Challenge scheduled for processing

Normal Presented 37m cert-manager-challenges Presented challenge using HTTP-01 challenge mechanism

Warning Failed 35m cert-manager-challenges Accepting challenge authorization failed: acme: authorization error for emryspolarisazure.click: 400 urn:ietf:params:acme:error:connection: 51.132.211.134: Fetching http://emryspolarisazure.click/.well-known/acme-challenge/5yGd57VQUrjc2ns-Q-VEVIl3vl6WKFK4B2fQu643_TM: Timeout during connect (likely firewall problem)

Running:

kubectl delete certificate polaris-tls

did the trick as it forces the certificate to renew. Watch and wait for it to be ready with:

kubectl get certificate polaris-tls -w

You need to run this (or put the equivalent in your Terraform file):

kubectl annotate service nginx-ingress-ingress-nginx-controller -n ingress-nginx "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path=/healthz"

and that should now all work. A happy system should look like:

$ kubectl describe certificate polaris-tls

...

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Issuing 42m cert-manager-certificates-trigger Issuing certificate as Secret does not exist

Normal Generated 42m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-thlvj"

Normal Requested 42m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"

Normal Issuing 40m cert-manager-certificates-issuing The certificate has been successfully issued

Agile Java Man

Friday, March 6, 2026

Cloud and K8s

No comments:

Post a Comment

Blog Archive

About Me