Thursday, September 11, 2025
Configuring Polaris for GCP
Configuring Polaris for AWS
- Configure your cloud account such that it's happy handing out access tokens
- Configure Polaris, both the credentials to access Polaris and the Catalog that is essentially a proxy to the cloud provider
- Configure Spark's SparkConf.
Thursday, September 4, 2025
Three things about Docker
docker history --no-trunc apache/polaris
Restoring OS properties
Saturday, August 30, 2025
(It's a) Kind and Strimzi
"NodePort is a Kubernetes Service type designed to make Pods reachable from a port available on the host machine, the worker node. The first thing to understand is that NodePort Services allow us to access a Pod running on a Kubernetes node, on a port of the node itself. After you expose Pods using the NodePort type Service, you’ll be able to reach the Pods by getting the IP address of the node and the port of the NodePort Service, such as <node_ip_address>:<node port>. The port can be declared in your YAML declaration or can be randomly assigned by Kubernetes. Most of the time, the NodePort Service is used as an entry point to your Kubernetes cluster." [The Kubernetes Bible]
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "emulationMajor" (class io.fabric8.kubernetes.client.VersionInfo), not marked as ignorable (9 known properties: "goVersion", "gitTreeState", "platform", "minor", "gitVersion", "gitCommit", "buildDate", "compiler", "major"])
at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 4, column: 22] (through reference chain: io.fabric8.kubernetes.client.VersionInfo["emulationMajor"])
at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) ~[com.fasterxml.jackson.core.jackson-databind-2.16.2.jar:2.16.2]
...
at io.fabric8.kubernetes.client.utils.KubernetesSerialization.unmarshal(KubernetesSerialization.java:257) ~[io.fabric8.kubernetes-client-api-6.13.4.jar:?]
Monday, August 25, 2025
Cloud Architecture
- must be secure
- must be entirely FOSS based
- must be cross-cloud
- allows a bring-your-own policy
Tuesday, August 12, 2025
BERT
Sunday, August 3, 2025
Iceberg Distributions
write.distribution-mode | Number of files | Notes |
"hash" | p | df.writeTo(tableName).append() |
"hash", sorted DataFrame | p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"hash", sorted table | p | df.sort("partitionField").writeTo(tableName).append() |
"hash", sorted table but only one value for partitionField | 1 | because p=1; assumes the size of the data to write is < write.spark.advisory-partition-size-bytes. Otherwise multiple files are written (Spark 3.5). |
"none" | d * p | df.writeTo(tableName).append() |
"none", sorted DataFrame | p | df.sort("partitionField").writeTo(tableName).append() |
"none", sorted table | d * p | ...TBLPROPERTIES ('sort-order' = 'partitionField ASC NULLS FIRST'... |
"none", sorted table but only one value for partitionField | d | because p=1 |
"Fanout writer is better in all cases. We were silly. The memory requirements were tiny IMHO. Without fanout, you need to presort within the task but that ends up being way more expensive (and memory intesive) IMHO. In the latest versions @Anton Okolnychyi removed the local sort requirements if fanout is enabled, so I would recommend fanout always be enabled and especially if you are using distribution mode is none."