Thursday, October 23, 2025
Exactly Once Semantics in Iceberg/Kafka
Monday, October 20, 2025
Spark and K8s on AWS
- Create the secret with aws secretsmanager create-secret ...
- Create a policy with aws iam create-policy... This reference the secret's ARNs in step 1.
- Create a role with aws iam create-role... that allows a role to be assumed via STS.
- Attach the policy in step 2 with the role in step 3 with aws iam attach-role-policy...
- Create a SecretProviderClass with kubectl apply -f... that references the secrets created in step 1
- Associate your K8s Deployment with the SecretProviderClass using its volumes.
Wednesday, October 15, 2025
Configuring Polaris Part 1
Wednesday, September 17, 2025
Configuring Polaris for Azure
- tenantId. You can find this by running az account show --query tenantId.
- multiTenantAppName. This is the Application (client) ID that was generated when the app was created. You can see it in Microsoft Entra ID -> App Registrations -> All Applications in the Azure portal or using the CLI: az ad app list, find your app with the name you created above and use its appId.
- consentUrl. I'm not entirely sure what this is but can be generated with APPID=$(az ad app list --display-name "MyMultiTenantApp" --query "[0].appId" -o tsv) && echo "https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=$APPID&response_type=code&redirect_uri=http://localhost:3000/redirect&response_mode=query&scope=https://graph.microsoft.com/.default&state=12345"
Thursday, September 11, 2025
Configuring Polaris for GCP
Configuring Polaris for AWS
- Configure your cloud account such that it's happy handing out access tokens
- Configure Polaris, both the credentials to access Polaris and the Catalog that is essentially a proxy to the cloud provider
- Configure Spark's SparkConf.
Thursday, September 4, 2025
Three things about Docker
docker history --no-trunc apache/polaris
Restoring OS properties
Saturday, August 30, 2025
(It's a) Kind and Strimzi
"NodePort is a Kubernetes Service type designed to make Pods reachable from a port available on the host machine, the worker node. The first thing to understand is that NodePort Services allow us to access a Pod running on a Kubernetes node, on a port of the node itself. After you expose Pods using the NodePort type Service, you’ll be able to reach the Pods by getting the IP address of the node and the port of the NodePort Service, such as <node_ip_address>:<node port>. The port can be declared in your YAML declaration or can be randomly assigned by Kubernetes. Most of the time, the NodePort Service is used as an entry point to your Kubernetes cluster." [The Kubernetes Bible]
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "emulationMajor" (class io.fabric8.kubernetes.client.VersionInfo), not marked as ignorable (9 known properties: "goVersion", "gitTreeState", "platform", "minor", "gitVersion", "gitCommit", "buildDate", "compiler", "major"])
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 4, column: 22] (through reference chain: io.fabric8.kubernetes.client.VersionInfo["emulationMajor"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) ~[com.fasterxml.jackson.core.jackson-databind-2.16.2.jar:2.16.2]
...
at io.fabric8.kubernetes.client.utils.KubernetesSerialization.unmarshal(KubernetesSerialization.java:257) ~[io.fabric8.kubernetes-client-api-6.13.4.jar:?]
Monday, August 25, 2025
Cloud Architecture
- must be secure
- must be entirely FOSS based
- must be cross-cloud
- allows a bring-your-own policy
Tuesday, August 12, 2025
BERT
Sunday, August 3, 2025
Iceberg Distributions
| write.distribution-mode | Number of files | Notes | 
| "hash" | p | df.writeTo(tableName).append() | 
| "hash", sorted DataFrame | p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... | 
| "hash", sorted table | p | df.sort("partitionField").writeTo(tableName).append() | 
| "hash", sorted table but only one value for partitionField | 1 | because p=1; assumes the size of the data to write is < write.spark.advisory-partition-size-bytes. Otherwise multiple files are written (Spark 3.5). | 
| "none" | d * p | df.writeTo(tableName).append() | 
| "none", sorted DataFrame | p | df.sort("partitionField").writeTo(tableName).append() | 
| "none", sorted table | d * p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... | 
| "none", sorted table but only one value for partitionField | d | because p=1 | 
"Fanout writer is better in all cases. We were silly. The memory requirements were tiny IMHO. Without fanout, you need to presort within the task but that ends up being way more expensive (and memory intesive) IMHO. In the latest versions @Anton Okolnychyi removed the local sort requirements if fanout is enabled, so I would recommend fanout always be enabled and especially if you are using distribution mode is none."
Sunday, June 15, 2025
Lessons from a migration
Don't give people raw SQL access. They can make profound changes and you'll have no logs and little ability to correct it.
A Quick look at Quarkus
- it allows polyglot development
- it allows Ahead-of-Time (AOT) compilation.
Making ML projects more robust
|  | 
| A bug in calculating age | 
|  | 
| A bug in sampling | 
 
