Tuesday, November 18, 2025

Snowflake and AWS

This is how you get Snowflake to talk to your AWS real estate. Before we start, get your AWS account ID with:

aws sts get-caller-identity 

This will be used as your GLUE_CATALOG_ID (see below).

Now, you need to create in Snowflake a volume like this:

CREATE OR REPLACE EXTERNAL VOLUME YOUR_VOLUME_NAME
  STORAGE_LOCATIONS = (
    ( NAME = 'eu-west-2'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = 's3://ROOT_DIRECTORY_OF_TABLE/'
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'
    )
  )
  ALLOW_WRITES = FALSE;  

You run this even though you have yet to create the role. Then run:

select system$verify_external_volume('YOUR_VOLUME_NAME');

This will give you some JSON that includes a STORAGE_AWS_IAM_USER_ARN. You never create this user. Snowflake does it itself. Its ARN is what you need to create a role in AWS that allows Snowflake's user to see data.

You create a role was created with an ordinary aws iam create-role --role-name S3ReadWriteRoleSF --assume-role-policy-document... using the ARN that we got from Snowflake, above. That is, our Snowflake instance has its own AWS user and you must give that user access to your real estate.

Now, give Snowflake access to your cloud assets with:

aws iam put-role-policy --role-name ROLE_NAME --policy-name GlueReadAccess --policy-document file://glue-read-policy.json

Where glue-read-policy.json just contains the Actions needed to talk to Glue.

Finally, we create the Glue catalog (note that this is not a REST catalog like Polaris) but Glue:

CREATE OR REPLACE CATALOG INTEGRATION CATALOG_NAME
  CATALOG_SOURCE = GLUE
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'YOUR_DB_NAME'
  GLUE_CATALOG_ID = 'GLUE_CATALOG_ID'
  GLUE_AWS_ROLE_ARN = 'arn:aws:iam::GLUE_CATALOG_ID:role/ROLE_NAME'
  GLUE_REGION = 'eu-west-2'
  ENABLED = TRUE;

Now you bring all these threads together when you create a table with:

CREATE ICEBERG TABLE arbitrary_name 
  EXTERNAL_VOLUME = 'YOUR_VOLUME_NAME'
  CATALOG = 'CATALOG_NAME'
  CATALOG_TABLE_NAME = 'TABLE_NAME';

Create a REST catalog with:

CREATE OR REPLACE CATALOG INTEGRATION polaris_int
    CATALOG_SOURCE = POLARIS
    TABLE_FORMAT = ICEBERG
    REST_CONFIG = (
        CATALOG_URI = 'https://YOUR_HOST:8181/api/catalog/v1/'
        )
    REST_AUTHENTICATION = (
        TYPE = BEARER
        BEARER_TOKEN = 'TOKEN'
    )
    ENABLED = TRUE;

Note that the URI must be talking HTTPS not HTTP.

Saturday, November 15, 2025

Debugging Google Cloud Kubernetes

A problem I was having when spinning up a K8s cluster and then trying to deploy my own Polaris was that the pod stuck in the Pending state. A quick kubectl describe pod gave the last event as "Pod didn't trigger scale-up:"

So, let's look at the events (a.k.a operations):

gcloud container operations list --project $PROJECT

Then to drill down on the operation of interest:

gcloud container operations describe operation-XXX --region $REGION --project $PROJECT

It seemed pretty quiet. The last two events were:
  • CREATE_CLUSTER began at 16:35:38 and ran to 16:41:37
  • DELETE_NODE_POOL started at 16:41:41 and ran to 16:46:02
So, that delete came hot on the heals of the cluster successfully being created. I looked at the logs with:

gcloud logging read "resource.labels.cluster_name=spark-cluster AND timestamp>=\"2025-11-14T16:41:35Z\" AND timestamp<=\"2025-11-14T16:41:42Z\"" --project=$PROJECT --limit 10 --order=desc 

and one of these logs looked like this:

  requestMetadata:
    callerSuppliedUserAgent: google-api-go-client/0.5 Terraform/1.10.7 (+https://www.terraform.io)
      Terraform-Plugin-SDK/2.36.0 terraform-provider-google/dev6,gzip(gfe)
...
  response:
    operationType: DELETE_NODE_POOL

This was saying that the DELETE_NODE_POOL originated from my own Terraform-Plugin-SDK! And the reason for that was my Terraform had:

        "remove_default_node_pool": true

It did this because it then tried to create its own node pool. However, it seems that having 2 node pools at once exhausted the GCP quotas. My node failed to start but TF merrily went ahead and continued to delete the default pool.

You can see quotas with:

gcloud compute regions describe $REGION

and node pools with:

gcloud container node-pools describe default-pool --cluster $CLUSTER_NAME --region $REGION --project $PROJECT

Wednesday, November 5, 2025

Spark Operator

I've found that managing Spark clusters in Kubernetes is far easier using the Spark Operator. Here are some commands that helped me diagnose issues.

Dude, where's my appliction?

List your Spark applications with:

kubectl get sparkapplications

It can be annoying when you can't delete a sparkapplication with

kubectl delete sparkapplication YOUR_APP

even though it's running. In my case, I thought

kubectl rollout restart deployment spark-kubernetes-operator

left an orphaned cluster.

It's possible that you don't see anything even though there are Spark pods clearly there. In this case:

kubectl describe pod POD_NAME

and you should see something like:

...
Controlled By:  StatefulSet/XXX
...

Great, so it looks like the Spark Operator has set the cluster up by delegating to Kubernetes primitives. Let's see them:

kubectl get statefulsets

and then we can just:

kubectl delete statefulset XXX

OK, so, dude, where's my cluster

But we're barking up the wrong tree. The YAML to create a cluster has kind: SparkCluster so we're using the wrong CRD with sparkapplications.

kubectl get crd | grep spark
sparkclusters.spark.apache.org                              2025-11-04T10:52:56Z
...

Right, so now:

kubectl delete sparkclusters YOUR_CLUSTER

Python

As a little aside, I was seeing strange errors when running PySpark commands that appeared to be a versioning problems. A few commands that came in useful were:

import sys
print(sys.path)

to print where the Python executable was getting its libraries from and:

from pyspark.version import __version__
print(__version__)

to make sure we really did have the correct PySpark version. 

As it happened, it was the wrong version of the Iceberg runtime in spark.jars.packages.

Monday, November 3, 2025

AWS, Kubernetes and more

Setting up a 3 node Kubernetes cluster in AWS is as simple as:

eksctl create cluster --name $CLUSTERNAME --nodes 3

but this really hides a huge amount of what is going on. Apart from IAM, eksctl automatically creates: 
  • a new Virtual Private Cloud (VPC) in which sit the K8s control plane and workers. A VPC is "a logically isolated and secure network environment that is separate from the rest of the AWS cloud" [1]
  • two public subnets and two private subnets (best practice if you want high availability). By putting worker nodes in the private subnet, they cannot be maliciously scanned from the internet.
  • all necessary NAT Gateways to allow the private subnets to access the internet
  • Internet Gateways allowing the internet to talk to your public subnets.
  • Route Tables which are just rules for network traffic. It's the "Routers use a route table to determine the best path for data packets to take between networks" [2]
You can see some details with:

$ eksctl get cluster --name=$CLUSTERNAME --region=$REGION
NAME VERSION STATUS CREATED VPC SUBNETS SECURITYGROUPS PROVIDER
spark-cluster 1.32 ACTIVE 2025-10-27T10:36:02Z vpc-REDACTED subnet-REDACTED,subnet-REDACTED,subnet-REDACTED,subnet-REDACTED,subnet-READACTED,subnet-REDACTED sg-REDACTED EKS

Terraform

If you use Terraform, you might need to configure your local kubectl to talk to the EKS cluster by hand.

First, back up your old config with:

mv ~/.kube ~/.kube_bk

then run:

aws eks update-kubeconfig --name $CLUSTERNAME --region $REGION

But if you are running aws via Docker, this will have updated ~/.kube/config in the container, not the host. So, run:

docker run --rm -it  -v ~/.aws:/root/.aws -v ~/.kube:/root/.kube  amazon/aws-cli eks update-kubeconfig --name $CLUSTERNAME --region $REGION

Now it will write to your host's config but even then you'll have to change the command at the end of the file to point to a non-Docker version (yes, you'll have to install the AWS binary - preferably in a bespoke directory so you can continue using the Docker version).

Another issue I had was the connection to the new EKS cluster was different to my ~/.kube/config. This in itself was not a problem as you can put in (using Java and CDKTF):

LocalExecProvisioner.builder()
    .when("create") // Run only when the resource is created
    .command(String.format(
        "aws eks update-kubeconfig --name %s --region %s",
        CLUSTER_NAME,
        AWS_REGION)
    )
    .type("local-exec")
    .build()

which depends on the EksCluster and the DataAwsEksClusterAuth and in turn, the failing resource are to depend on it. 

However, this introduced other problems. 

First, I tried to get the reading of ~/.kube/config to depends_on the EKS cluster. That way, I'd only read it once the cluster was up and running, right? Well, no. This introduces a circular dependency as it's read before the cluster is started.

Any fiddling with the dependency tree leads to reading ~/.kube/config when it's stale. So, you need to initialize the Kubernetes details (which appears to be global and otherwise implicit) directly with:

String base64CertData = cluster.getCertificateAuthority().get(0).getData();
String encodedCert    = com.hashicorp.cdktf.Fn.base64decode(base64CertData);
KubernetesProvider kubernetesProvider = KubernetesProvider.Builder.create(this, "kubernetes")
    .host(cluster.getEndpoint())
    .clusterCaCertificate(encodedCert)
    .token(eksAuthData.getToken()) // Dynamically generated token
    .build();

Strangely, you still need to define the environment variable, KUBE_CONFIG_PATH as some resources need it, albeit after it has been correctly amended with the current cluster's details. 

Zombie Clusters

Running:

tofu destroy -auto-approve

just kept hanging. So, I ran:

tofu state list | grep -E "(nat_gateway|eip|eks_cluster)"

and found some EKS components running that I had to kill with:

tofu destroy -auto-approve -target=...

Finally, kubectl get pods barfed with no such host.

Load balancers

The next problem was the tofu destroy action was constantly saying:

aws_subnet.publicSubnet2: Still destroying... [id=subnet-XXX, 11m50s elapsed]

So, I ran:

aws ec2 describe-network-interfaces \
    --filters "Name=subnet-id,Values=subnet-XXX" \
    --query "NetworkInterfaces[].[NetworkInterfaceId, Description, InterfaceType, Status]" \
    --output table

and got an ENI that I tried to delete with:

aws ec2 delete-network-interface --network-interface-id eni-XXX

only to be told that it was still in use. Ho, hum:

$ aws ec2 describe-network-interfaces \
    --network-interface-ids eni-XXX \
    --query "NetworkInterfaces[0].{ID:NetworkInterfaceId, Description:Description, Status:Status, Attachment:Attachment}" \
    --output json
...
        "InstanceOwnerId": "amazon-elb",
...

So, let's see what that load balancer is:

$ aws elb describe-load-balancers \
    --query "LoadBalancerDescriptions[?contains(Subnets, 'subnet-XXX')].[LoadBalancerName]" \
    --output text

which gives me its name and now I can kill it with:

aws elb delete-load-balancer --load-balancer-name NAME

Finally, the destroy just wasn't working, failing ultimately with:

│ Error: deleting EC2 VPC (vpc-XXX): operation error EC2: DeleteVpc, https response error StatusCode: 400, RequestID: 8412a305-..., api error DependencyViolation: The vpc 'vpc-XXX' has dependencies and cannot be deleted.

Just going into the web console and deleting it there was the simple but curious solution.

[1] Architecting AWS with Terraform
[2] The Self-Taught Cloud Computing Engineer