Saturday, November 15, 2025

Debugging Google Cloud Kubernetes

A problem I was having when spinning up a K8s cluster and then trying to deploy my own Polaris was that the pod stuck in the Pending state. A quick kubectl describe pod gave the last event as "Pod didn't trigger scale-up:"

So, let's look at the events (a.k.a operations):

gcloud container operations list --project $PROJECT

Then to drill down on the operation of interest:

gcloud container operations describe operation-XXX --region $REGION --project $PROJECT

It seemed pretty quiet. The last two events were:
  • CREATE_CLUSTER began at 16:35:38 and ran to 16:41:37
  • DELETE_NODE_POOL started at 16:41:41 and ran to 16:46:02
So, that delete came hot on the heals of the cluster successfully being created. I looked at the logs with:

gcloud logging read "resource.labels.cluster_name=spark-cluster AND timestamp>=\"2025-11-14T16:41:35Z\" AND timestamp<=\"2025-11-14T16:41:42Z\"" --project=$PROJECT --limit 10 --order=desc 

and one of these logs looked like this:

  requestMetadata:
    callerSuppliedUserAgent: google-api-go-client/0.5 Terraform/1.10.7 (+https://www.terraform.io)
      Terraform-Plugin-SDK/2.36.0 terraform-provider-google/dev6,gzip(gfe)
...
  response:
    operationType: DELETE_NODE_POOL

This was saying that the DELETE_NODE_POOL originated from my own Terraform-Plugin-SDK! And the reason for that was my Terraform had:

        "remove_default_node_pool": true

It did this because it then tried to create its own node pool. However, it seems that having 2 node pools at once exhausted the GCP quotas. My node failed to start but TF merrily went ahead and continued to delete the default pool.

You can see quotas with:

gcloud compute regions describe $REGION

and node pools with:

gcloud container node-pools describe default-pool --cluster $CLUSTER_NAME --region $REGION --project $PROJECT

No comments:

Post a Comment