- tenantId. You can find this by running az account show --query tenantId.
- multiTenantAppName. This is the Application (client) ID that was generated when the app was created. You can see it in Microsoft Entra ID -> App Registrations -> All Applications in the Azure portal or using the CLI: az ad app list, find your app with the name you created above and use its appId.
- consentUrl. I'm not entirely sure what this is but can be generated with APPID=$(az ad app list --display-name "MyMultiTenantApp" --query "[0].appId" -o tsv) && echo "https://login.microsoftonline.com/common/oauth2/v2.0/authorize?client_id=$APPID&response_type=code&redirect_uri=http://localhost:3000/redirect&response_mode=query&scope=https://graph.microsoft.com/.default&state=12345"
Wednesday, September 17, 2025
Configuring Polaris for Azure
Thursday, September 11, 2025
Configuring Polaris for GCP
Configuring Polaris for AWS
- Configure your cloud account such that it's happy handing out access tokens
- Configure Polaris, both the credentials to access Polaris and the Catalog that is essentially a proxy to the cloud provider
- Configure Spark's SparkConf.
Thursday, September 4, 2025
Three things about Docker
docker history --no-trunc apache/polaris
Restoring OS properties
Saturday, August 30, 2025
(It's a) Kind and Strimzi
"NodePort is a Kubernetes Service type designed to make Pods reachable from a port available on the host machine, the worker node. The first thing to understand is that NodePort Services allow us to access a Pod running on a Kubernetes node, on a port of the node itself. After you expose Pods using the NodePort type Service, you’ll be able to reach the Pods by getting the IP address of the node and the port of the NodePort Service, such as <node_ip_address>:<node port>. The port can be declared in your YAML declaration or can be randomly assigned by Kubernetes. Most of the time, the NodePort Service is used as an entry point to your Kubernetes cluster." [The Kubernetes Bible]
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "emulationMajor" (class io.fabric8.kubernetes.client.VersionInfo), not marked as ignorable (9 known properties: "goVersion", "gitTreeState", "platform", "minor", "gitVersion", "gitCommit", "buildDate", "compiler", "major"])
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 4, column: 22] (through reference chain: io.fabric8.kubernetes.client.VersionInfo["emulationMajor"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) ~[com.fasterxml.jackson.core.jackson-databind-2.16.2.jar:2.16.2]
...
at io.fabric8.kubernetes.client.utils.KubernetesSerialization.unmarshal(KubernetesSerialization.java:257) ~[io.fabric8.kubernetes-client-api-6.13.4.jar:?]
Monday, August 25, 2025
Cloud Architecture
- must be secure
- must be entirely FOSS based
- must be cross-cloud
- allows a bring-your-own policy
Tuesday, August 12, 2025
BERT
Sunday, August 3, 2025
Iceberg Distributions
write.distribution-mode | Number of files | Notes |
"hash" | p | df.writeTo(tableName).append() |
"hash", sorted DataFrame | p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"hash", sorted table | p | df.sort("partitionField").writeTo(tableName).append() |
"hash", sorted table but only one value for partitionField | 1 | because p=1; assumes the size of the data to write is < write.spark.advisory-partition-size-bytes. Otherwise multiple files are written (Spark 3.5). |
"none" | d * p | df.writeTo(tableName).append() |
"none", sorted DataFrame | p | df.sort("partitionField").writeTo(tableName).append() |
"none", sorted table | d * p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"none", sorted table but only one value for partitionField | d | because p=1 |
"Fanout writer is better in all cases. We were silly. The memory requirements were tiny IMHO. Without fanout, you need to presort within the task but that ends up being way more expensive (and memory intesive) IMHO. In the latest versions @Anton Okolnychyi removed the local sort requirements if fanout is enabled, so I would recommend fanout always be enabled and especially if you are using distribution mode is none."
Sunday, June 15, 2025
Lessons from a migration
Don't give people raw SQL access. They can make profound changes and you'll have no logs and little ability to correct it.
A Quick look at Quarkus
- it allows polyglot development
- it allows Ahead-of-Time (AOT) compilation.
Making ML projects more robust
![]() |
A bug in calculating age |
![]() |
A bug in sampling |
Saturday, May 31, 2025
Iceberg and Kafka Connect
"A task is a thread that performs the actual sourcing or sinking of data.
The number of tasks per connector is determined by the implementation of the connector...
For sink connectors, the number of tasks should be equal to the number of partitions of the topic. The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance."
"there should only be one task assigned partition 0 of the first topic, so elect that one the leader" [Commiter in Iceberg code].