Polaris can vend credentials from the top three cloud providers. That is, your Spark instance does not need to be granted access to the cloud provider as long as it can connect to the relevant Polaris catalog.
There are 3 steps in configuring vended credentials for Spark:
- Configure your cloud account such that it's happy handing out access tokens
- Configure Polaris, both the credentials to access Polaris and the Catalog that is essentially a proxy to the cloud provider
- Configure Spark's SparkConf.
Here's what you must do for AWS:
Permissioning Polaris to use AWS tokens
The key Action you need to vend tokens is sts:AssumeRole. If you were doing this in the AWS GUI, you'd go to IAM/Roles, select the Role that can read your S3 bucket, click on Trust relationships and grant the user (or better, a group) the right to access it with some JSON like:
{
"Sid": "Statement2",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::985050164616:user/henryp"
},
"Action": "sts:AssumeRole"
}
This is only half the job. You then need a reciprical relationship for, in my case, user/henryp. Here, I go to IAM/Users, find my user and create an inlined entry in Permissions policies. This just needs the Statement:
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::985050164616:role/myrole"
}
Where myrole is the role that can read the bucket.
Ideally, this would all be automated in scripts rather than point-and-click but I'm at the exploratory stage at the moment.
What's going on in Polaris
Calling:
spark.sql(s"CREATE NAMESPACE IF NOT EXISTS $catalog.$namespace")
triggers these steps:
Spark's CatalogManager.load initializes an Iceberg SparkCatalog that fetches a token via OAuth2Util.fetchToken via its HTTPClient.
Polaris's TokenRequestValidator will validateForClientCredentialsFlow and insist that the clientId and clientSecret are neither null nor empty.
These values were taken from SparkConf's spark.sql.catalog.$catalog.credential setting after being split after OAuth2Util did its parseCredential in Iceberg - splitting ID and secret on the colon. It also ensures the scope and grantType is something Polaris recognises.
The upshot is that despite what some guides say, you don't want credential to be BEARER.
Configuring Spark
Calling:
spark.createDataFrame(data).writeTo(tableName).create()
does trigger credential vending as you can see by putting a breakpoint in AwsCredentialStorageIntegration.getSubscopedCreds. This invokes AWS's StsAuthSchemeInterceptor.trySelectAuthScheme and if you have configured AWS correctly, you'll be able to get some credentials from a cloud call.
This all comes from a call to IcebergRestCatalogApi.createTable.
Note that it's in DefaultAwsClientFactory where the s3FileIOProperties where the vended credentias live after being populated in the call to s3() in the Icebeg codebase.
After a lot of head scratching, this resource said my SparkConfig needed:
.set(s"spark.sql.catalog.$catalog.header.X-Iceberg-Access-Delegation", "vended-credentials")
This is defined here. The other options are remote signing where it seems Polaris will sign a credential and "unknown". But it's important for this to be set as only then will the table's credentials lookup path be used.
No comments:
Post a Comment