Thursday, February 19, 2026

An unruly Terraform

If the Terraform state is out of synch with reality, you might need to change that state manually with something like:

tofu state list

followed by

tofu state rm XXX

I had to delete load balancers manually through the AWS Web Console and then also the EKS instance. I then had to manually delete any references to them from my JSON.

Tip: regularly delete the directory in which the Terraform lives as state gets kept there that the next run implicitly relies upon. The consequence if you don't is that after a major refactor, you run the configuration and everything looks fine. You check in thinking you've done a good job but there was an invisible dependency on the previous running and checking out to a fresh directory fails. So:

Delete all files regularly

I was getting lots of:

│ Error: Get "https://21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com/api/v1/namespaces/default/services/jupyter-lb": dial tcp: lookup 21D13D424AA794FA2A76DE52CA79FBE9.gr7.eu-west-2.eks.amazonaws.com on 127.0.0.1:53: no such host
 

even after blatting my Terrafrom cdktf.out/stacks directory. Turns out state files were accumulating in the root directory of my project (which contained cdktf.out). Once they too were blatted, things looked better.

Changing the cdktf.out.json file resulted in:

│ Error: Inconsistent dependency lock file
│ 
│ The following dependency selections recorded in the lock file are inconsistent with the current configuration:
│   - provider registry.opentofu.org/hashicorp/helm: required by this configuration but no version is selected
│ 
│ To update the locked dependency selections to match a changed configuration, run:
│   tofu init -upgrade

The solution was to run tofu init -upgrade

GCP

You might see this error when running Terraform on GCP:

│ Error: Error setting access_token
│ 
│   with data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C,
│   on cdk.tf.json line 25, in data.google_client_config.gcp-polaris-deployment_currentClient_7C40CA9C:
│   25:       }
│ 
│ oauth2: "invalid_grant" "reauth related error (invalid_rapt)" "https://support.google.com/a/answer/9368756"

It's nothing really to do with TF but rather your GCP credentials. Login with gcloud auth application-default login and try again. D'oh.

AWS

aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values=$VPC --region $REGION

aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC --region $REGION

aws ec2 describe-security-groups --filters Name=vpc-id,Values=$VPC --region $REGION

This last one showed 3 security groups.

The reason that these AWS entities lingered is because my tofu destroy was always hanging. And the reason it never finished is that there were finalizers that prevented it. To avoid this, I needed to run:

kubectl patch installation default -p '{"metadata":{"finalizers":[]}}' --type=merge
kubectl patch service YOUR_LOAD_BALANCER -p '{"metadata":{"finalizers":null}}'  --type=merge

Also, CRDs need to be destroyed:

for CRD in $(kubectl get crds | awk '{print $1}') ; do {
    kubectl patch crd $CRD --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
    kubectl delete crd $CRD --force
} done

I would then run these scripts as a local-exec provisioned in a resource.

I asked on the DevOps Discord server how normal this was:
PhillHenry
I'm using Terraform to manage my AWS stack that (amongst other things) creates a load balancer using an aws-load-balancer-controller. I'm finding destroying the stack just hangs then times out after 20 minutes.

I've had to introduce bash scripts that patch finalizers in services and installations plus force delete CRDs. Finally, tofu detroy cleans everything up but I can't help feeling I'm doing it all wrong by having to add hacks.

Is this normal? If not, can somebody point me in the right direction over what I'm doing wrong? 
snuufix
It is normal with buggy providers, it's just sad that even AWS is one.
Redeploying

Redeploying a component was  simply a matter of running:

tofu apply -replace=kubernetes_manifest.sparkConnectManifest  --auto-approve

This is a great way to redeploy just my Spark Connect pod when I've changed the config:


Helm

If you want to find out what version of a Helm chart you're using when you forget to set it, this might help. It's where Helm caches the charts it downloads.

$ ls -ltr ~/.cache/helm/repository/
...
-rw-r--r-- 1 henryp henryp 107929 Nov 12 10:41 spark-kubernetes-operator-1.3.0.tgz
-rw-r--r-- 1 henryp henryp 317214 Nov 20 15:56 eks-index.yaml
-rw-r--r-- 1 henryp henryp    433 Nov 20 15:56 eks-charts.txt
-rw-r--r-- 1 henryp henryp  36607 Nov 24 09:15 aws-load-balancer-controller-1.15.0.tgz
-rw-r--r-- 1 henryp henryp 493108 Dec 11 14:47 kube-prometheus-stack-51.8.0.tgz
-rw-r--r-- 1 henryp henryp  38337 Dec 15 12:20 aws-load-balancer-controller-1.16.0.tgz
...

No comments:

Post a Comment