To vend credentials, Polaris needs an AWS (or other cloud provider) account. But what if you want to talk to several AWS accounts? Well this ticket suggests an interesting workaround. It's saying "yeah, use just one AWS account but if you need to use others, set up a role that allows access to other AWS accounts, accounts outside the one that role lives in."
We are working in a cross cloud environment. We talk to not just AWS but GCP and Azure clouds. We happen to host Polaris in AWS but this choice was arbitrary. We can give Polaris the ability to vend credentials for all clouds no matter where it sits.
Integration with Spark
It's the spark.sql.catalog.YOUR_CATALOG.warehouse SparkConf value that identifies the Polaris catalog.
The YOUR_CATALOG defines the namespace. In fact, the top level value, spark.sql.catalog.YOUR_CATALOG, tells Spark which catalog to use (Hive, Polaris, etc).
So, basically, your config should look something like:
spark.sql.catalog.azure.oauth2.token POLARIS_ACCESS_TOKEN
spark.sql.catalog.azure.client_secret s3cr3t
spark.sql.catalog.azure.uri http://localhost:8181/api/catalog
spark.sql.catalog.azure.token POLARIS_ACCESS_TOKEN
spark.sql.catalog.azure.type rest
spark.sql.catalog.azure.scope PRINCIPAL_ROLE:ALL
spark.sql.catalog.azure.client_id root
spark.sql.catalog.azure.warehouse azure
spark.sql.catalog.azure.header.X-Iceberg-Access-Delegation vended-credentials
spark.sql.catalog.azure.credential root:s3cr3t
spark.sql.catalog.azure.cache-enabled false
spark.sql.catalog.azure.rest.auth.oauth2.scope PRINCIPAL_ROLE:ALL
spark.sql.catalog.azure org.apache.iceberg.spark.SparkCatalog
This is the config specific to my Azure catalog. AWS and GCP would have very similar config.
Local Debugging
Put:
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005",
in build.gradle.kts in the
tasks.named<QuarkusRun>("quarkusRun") {
jvmArgs =
listOf(
section then run with:
./gradlew --stop && ./gradlew run
then you'll then be able to remotely debug by attaching to port 5005.
Polaris in integration tests
Naturally, you're going to want to write a suite of regression tests. This is where the wonderful TestContainers shines. You can fire up a Docker container of Polaris in Java code.
There are some configuration issues. AWS and Azure are easy to configure within Polaris. You must just pass them the credentials as environment variables. GCP is a little harder as it's expecting a file of JSON containing its credentials (the Application Default Credentials file). Fortunately, TestContainers allows you to copy that file over once the container has started running.
myContainer = new GenericContainer<>("apache/polaris:1.1.0-incubating")
// AWS
.withEnv("AWS_ACCESS_KEY_ID", AWS_ACCESS_KEY_ID)
.withEnv("AWS_SECRET_ACCESS_KEY", AWS_SECRET_ACCESS_KEY)
// Azure
.withEnv("AZURE_CLIENT_SECRET", AZURE_CLIENT_SECRET)
.withEnv("AZURE_CLIENT_ID", AZURE_CLIENT_ID)
.withEnv("AZURE_TENANT_ID", AZURE_TENANT_ID)
// Polaris
.withEnv("POLARIS_ID", POLARIS_ID)
.withEnv("POLARIS_SECRET", POLARIS_SECRET)
.withEnv("POLARIS_BOOTSTRAP_CREDENTIALS", format("POLARIS,%s,%s", POLARIS_ID, POLARIS_SECRET))
// GCP
.withEnv("GOOGLE_APPLICATION_CREDENTIALS", GOOGLE_FILE)
.waitingFor(Wait.forHttp("/q/health").forPort(8182).forStatusCode(200));
;
myContainer.setPortBindings(List.of("8181:8181", "8182:8182"));
myContainer.start();
myContainer.copyFileToContainer(Transferable.of(googleCreds.getBytes()), GOOGLE_FILE);
The other thing you want for a reliable suite of tests is to wait until Polaris starts. Fortunately, Polaris is cloud native and offers a health endpoint which TestContainers can poll.
Polaris in EKS
I found I had to mix both AWS's own library (software.amazon.awssdk:eks:2.34.6) with the official Kubernetes library (io.kubernetes:client-java:24.0.0) before I could interrogate the Kubernetes cluster in AWS from my laptop and look at the logs of the Polaris container.
EksClient eksClient = EksClient.builder()
.region(REGION)
.credentialsProvider(DefaultCredentialsProvider.create())
.build();
DescribeClusterResponse clusterInfo = eksClient.describeCluster(
DescribeClusterRequest.builder().name(clusterName).build());
AWSCredentials awsCredentials = new BasicAWSCredentials(
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY);
var authentication = new EKSAuthentication(new STSSessionCredentialsProvider(awsCredentials),
region.toString(),
clusterName);
ApiClient client = new ClientBuilder()
.setBasePath(clusterInfo.cluster().endpoint())
.setAuthentication(authentication)
.setVerifyingSsl(true)
.setCertificateAuthority(Base64.getDecoder().decode(clusterInfo.cluster().certificateAuthority().data()))
.build();
Configuration.setDefaultApiClient(client);
Now you'll be able to query and monitor Polaris from outside AWS's Kubernetes offering, EKS.