I've spent the week setting up HTTPS certificates and domain names for my Azure and GCP K8s clusters.
At one point, the GCP K8s installation just kept hanging with "Still creating...". It turned out that we just didn't allocate enough resources:
AWS is better integrated because you can buy the domain names through Route53. But for Azure and GCP K8s, we did the following.
- we bought a domain name from AWS via Route53.
- we delegated the nameservers of this domain to Microsoft or Google.
- a Kubernetes sidecar starts up and contacts Let's Encrypt's API .
- Let's encrypt returns a token
- the sidecar encodes with its private key and hosts it (a.k.a Key Authorization) on port 80.
- Let's encrypt reads that file and decodes it with the cluster's public key. Now it can grant a certificate.
The sidecar is called ACME (Automatic Certificate Management Environment) and is ephemeral:
$ kubectl get events -A --sort-by=.lastTimestamp | grep -i acme
...
default 45m Normal Started pod/cm-acme-http-solver-6j5xp Started container acmesolver
default 44m Normal Sync ingress/cm-acme-http-solver-lgl5w Scheduled for sync
default 43m Normal Killing pod/cm-acme-http-solver-6j5xp Stopping container acmesolver
the outside world can talk to both a service and an ingress. Which you choose depends on what you want. Use Ingress for HTTPS.
An Ingress always talks to a service. Note that it is a logical abstraction and can have multiple ingresses for one domain. For example:
$ kubectl get ingress -A
NAMESPACE NAME CLASS HOSTS ADDRESS PORTS AGE
default cm-acme-http-solver-ldflx <none> emryspolarisgcp.click 35.189.87.151 80 14m
default polaris-ingress nginx emryspolarisgcp.click 35.189.87.151 80, 443 14m
is saying the emryspolarisgcp.click can point to different services depending on its ports. Here ACME is sticking around to finish the creation of certificates.
This was for GCP where the nameservers were incorrect (they changed every time we deployed the managed zones though Terraform). You might want to see them with:
gcloud dns managed-zones describe polaris-zone --project=afon-core
What can go wrong?
Traffic can be swallowed because:
- network security groups (or lack of them)
- misconfigured ports
- selectors not pointing at the correct pods
Incoming traffic goes through the system in this order:
- ingress (optional - see above)
- service
- endpoint
- pod
kubectl get ingress shows the name and the ports of the front facing interface
kubectl describe ingress XXX shows the service to which traffic is sent
Note with an nginx load balancer, this service comes before the ingress:
$ dig A emryspolarisazure.click +short
20.108.199.92
$ kubectl get service -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
... 50m
default polaris-internal ClusterIP 10.2.185.168 <none> 8181/TCP 47m
ingress-nginx nginx-ingress-ingress-nginx-controller LoadBalancer 10.2.238.76 20.108.199.92 80:30157/TCP,443:32110/TCP 47m
ingress-nginx nginx-ingress-ingress-nginx-controller-admission ClusterIP 10.2.30.150 <none> 443/TCP 47m
...
Debugging
If in doubt, port forward:
kubectl port-forward svc/polaris-internal 8080:8181 -n default
This will at least establish that the communication between your service and application is fine.
It's important to check that the firewall is at least expecting a connection for that IP address and port. Don't use curl for this as it is subject to network security rules and certificates being in place. So, run:
nc -zv 20.108.199.92 80
if you want to make sure that port is open as firewalls allow a TCP three way handshake even if the Network Security Group blacks further traffic.
Upon setting up the stack with tofu, the certificate doesn't look healthy and I can't access my site via HTTP.
$ kubectl get certificate polaris-tls
NAME READY SECRET AGE
polaris-tls False polaris-tls 28m
$ kubectl get challenges -A
NAMESPACE NAME STATE DOMAIN AGE
default polaris-tls-1-648081749-304604175 invalid emryspolarisazure.click 36m
$ kubectl describe certificate polaris-tls
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Issuing 23m cert-manager-certificates-trigger Issuing certificate as Secret does not exist
Normal Generated 23m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-rw6t9"
Normal Requested 23m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"
Warning Failed 21m cert-manager-certificates-issuing The certificate request has failed to complete and will be retried: Failed to wait for order resource "polaris-tls-1-648081749" to become ready: order is in "invalid" state:
$ kubectl describe challenge -A
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 37m cert-manager-challenges Challenge scheduled for processing
Normal Presented 37m cert-manager-challenges Presented challenge using HTTP-01 challenge mechanism
Warning Failed 35m cert-manager-challenges Accepting challenge authorization failed: acme: authorization error for emryspolarisazure.click: 400 urn:ietf:params:acme:error:connection: 51.132.211.134: Fetching http://emryspolarisazure.click/.well-known/acme-challenge/5yGd57VQUrjc2ns-Q-VEVIl3vl6WKFK4B2fQu643_TM: Timeout during connect (likely firewall problem)
Running:
kubectl delete certificate polaris-tls
did the trick as it forces the certificate to renew. Watch and wait for it to be ready with:
kubectl get certificate polaris-tls -w
You need to run this (or put the equivalent in your Terraform file):
kubectl annotate service nginx-ingress-ingress-nginx-controller -n ingress-nginx "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path=/healthz"
and that should now all work. A happy system should look like:
$ kubectl describe certificate polaris-tls
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Issuing 42m cert-manager-certificates-trigger Issuing certificate as Secret does not exist
Normal Generated 42m cert-manager-certificates-key-manager Stored new private key in temporary Secret resource "polaris-tls-thlvj"
Normal Requested 42m cert-manager-certificates-request-manager Created new CertificateRequest resource "polaris-tls-1"
Normal Issuing 40m cert-manager-certificates-issuing The certificate has been successfully issued
No comments:
Post a Comment