So, let's look at the events (a.k.a operations):
gcloud container operations list --project $PROJECT
Then to drill down on the operation of interest:
gcloud container operations describe operation-XXX --region $REGION --project $PROJECT
It seemed pretty quiet. The last two events were:
- CREATE_CLUSTER began at 16:35:38 and ran to 16:41:37
- DELETE_NODE_POOL started at 16:41:41 and ran to 16:46:02
So, that delete came hot on the heals of the cluster successfully being created. I looked at the logs with:
gcloud logging read "resource.labels.cluster_name=spark-cluster AND timestamp>=\"2025-11-14T16:41:35Z\" AND timestamp<=\"2025-11-14T16:41:42Z\"" --project=$PROJECT --limit 10 --order=desc
and one of these logs looked like this:
requestMetadata:
callerSuppliedUserAgent: google-api-go-client/0.5 Terraform/1.10.7 (+https://www.terraform.io)
Terraform-Plugin-SDK/2.36.0 terraform-provider-google/dev6,gzip(gfe)
...
response:
operationType: DELETE_NODE_POOL
This was saying that the DELETE_NODE_POOL originated from my own Terraform-Plugin-SDK! And the reason for that was my Terraform had:
"remove_default_node_pool": true
It did this because it then tried to create its own node pool. However, it seems that having 2 node pools at once exhausted the GCP quotas. My node failed to start but TF merrily went ahead and continued to delete the default pool.
You can see quotas with:
gcloud compute regions describe $REGION
and node pools with:
gcloud container node-pools describe default-pool --cluster $CLUSTER_NAME --region $REGION --project $PROJECT
No comments:
Post a Comment