Monday, October 20, 2025

Spark and K8s on AWS

This is a high-level overview of my work creating a Spark cluster running in a managed AWS Kubernetes cluster (EKS) giving Spark the permissions to write to cross cloud storage. I might write further 

EKS

Create a Kubernetes cluster with:

eksctl create cluster --name spark-cluster --nodes 3

Note that this can take about 15 minutes.

Note that you can have several K8s contexts on your laptop. You can see them with:

kubectl config get-contexts

and choose one with:

kubectl config use-context <your-cluster-context-name>

then you can run kubectl get pods -A and see K8s in the cloud.

A couple of caveats: I had to install aws-iam-authenticator by building it from scratch. This also required me to install GoLang. Binaries are installed in ~/go/bin.

Authorization

The hard thing about setting up a Kubernetes cluster in AWS is configuring permissions. This guide helped me but it was strictly the section Option 2: Using EKS Pod Identity that helped me. 

Basically, you have to configure a bridge between the K8s cluster and AWS. This is done through through a CSI (Container Storage Interface), a standard K8s mechanism for storage but in this case it's a means for storing secrets.

The high-level recipe for using secrets is:
  1. Create the secret with aws secretsmanager create-secret ... 
  2. Create a policy with aws iam create-policy... This reference the secret's ARNs in step 1.
  3. Create a role with aws iam create-role... that allows a role to be assumed via STS.
  4. Attach the policy in step 2 with the role in step 3 with aws iam attach-role-policy... 
  5. Create a SecretProviderClass with kubectl apply -f...  that references the secrets created in step 1
  6. Associate your K8s Deployment with the SecretProviderClass using its volumes.
Thanks to the installation of the CSI addons, secrets can be mounted in a Polaris container that are then used to vend credentials to Spark.

Diagnosing Spark in the Cluster

Connecting to Spark from my laptop produced these logs:

25/09/23 18:15:58 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...
25/09/23 18:15:58 INFO TransportClientFactory: Successfully created connection to a457XXX.eu-west-2.elb.amazonaws.com/3.9.78.227:7077 after 26 ms (0 ms spent in bootstraps)
25/09/23 18:16:18 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...
25/09/23 18:16:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...

where the connections were timing out. So I logged in with:

kubectl exec -it prod-master-0 -- /bin/bash

and looked at the actual logs under /opt/spark/ which were far more illuminating.

Spark Kubernetes Operator

I used the Spark operator to configure a Spark cluster for me. However, examples/prod-cluster-with-three-workers.yaml seemed to be out of synch with the CRDs installed by Helm(?). The apiVersion seems to need to be:

apiVersion: spark.apache.org/v1alpha1

This change meant I could then start my Spark cluster.

Aside: to restart the whole cluster, use:

kubectl delete -f spark_cluster.yaml
kubectl apply  -f spark_cluster.yaml

AWS CLI annoyances

Running the aws command rather annoyingly adds control characters. I spent a frustrated hour wondering why the exact text rendered from its output could not be used as input for another command. 

To remove the invisible control characters, run something like:

aws secretsmanager list-secrets --query "SecretList[?Name=='CloudDetails'].ARN" --output text --no-cli-pager | tr -d '[:cntrl:]' | tr -d '\n' 

where in this example, I trying to find the ARN of a particular secret.

No comments:

Post a Comment