This is a high-level overview of my work creating a Spark cluster running in a managed AWS Kubernetes cluster (EKS) giving Spark the permissions to write to cross cloud storage. I might write further
EKS
eksctl create cluster --name spark-cluster --nodes 3
Note that this can take about 15 minutes.
Note that you can have several K8s contexts on your laptop. You can see them with:
kubectl config get-contexts
and choose one with:
kubectl config use-context <your-cluster-context-name>
then you can run kubectl get pods -A and see K8s in the cloud.
A couple of caveats: I had to install aws-iam-authenticator by building it from scratch. This also required me to install GoLang. Binaries are installed in ~/go/bin.
Authorization
The hard thing about setting up a Kubernetes cluster in AWS is configuring permissions. This guide helped me but it was strictly the section Option 2: Using EKS Pod Identity that helped me.
Basically, you have to configure a bridge between the K8s cluster and AWS. This is done through through a CSI (Container Storage Interface), a standard K8s mechanism for storage but in this case it's a means for storing secrets.
The high-level recipe for using secrets is:
- Create the secret with aws secretsmanager create-secret ...
- Create a policy with aws iam create-policy... This reference the secret's ARNs in step 1.
- Create a role with aws iam create-role... that allows a role to be assumed via STS.
- Attach the policy in step 2 with the role in step 3 with aws iam attach-role-policy...
- Create a SecretProviderClass with kubectl apply -f... that references the secrets created in step 1
- Associate your K8s Deployment with the SecretProviderClass using its volumes.
Thanks to the installation of the CSI addons, secrets can be mounted in a Polaris container that are then used to vend credentials to Spark.
Diagnosing Spark in the Cluster
Connecting to Spark from my laptop produced these logs:
25/09/23 18:15:58 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...
25/09/23 18:15:58 INFO TransportClientFactory: Successfully created connection to a457XXX.eu-west-2.elb.amazonaws.com/3.9.78.227:7077 after 26 ms (0 ms spent in bootstraps)
25/09/23 18:16:18 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...
25/09/23 18:16:38 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://a457XXX.eu-west-2.elb.amazonaws.com:7077...
where the connections were timing out. So I logged in with:
kubectl exec -it prod-master-0 -- /bin/bash
and looked at the actual logs under /opt/spark/ which were far more illuminating.
Spark Kubernetes Operator
I used the Spark operator to configure a Spark cluster for me. However, examples/prod-cluster-with-three-workers.yaml seemed to be out of synch with the CRDs installed by Helm(?). The apiVersion seems to need to be:
apiVersion: spark.apache.org/v1alpha1
This change meant I could then start my Spark cluster.
Aside: to restart the whole cluster, use:
kubectl delete -f spark_cluster.yaml
kubectl apply -f spark_cluster.yaml
AWS CLI annoyances
Running the aws command rather annoyingly adds control characters. I spent a frustrated hour wondering why the exact text rendered from its output could not be used as input for another command.
To remove the invisible control characters, run something like:
aws secretsmanager list-secrets --query "SecretList[?Name=='CloudDetails'].ARN" --output text --no-cli-pager | tr -d '[:cntrl:]' | tr -d '\n'
where in this example, I trying to find the ARN of a particular secret.
No comments:
Post a Comment