Michael Colladoit's going to be both, so you shouldn't really use it in that scenario. The Polaris server is going to read/write metadata.json files in its own container's file system and the spark notebook will read/write data files in its own container's filesystem, so... [Discord]
Saturday, December 21, 2024
Debugging Polaris in Docker
Monday, December 9, 2024
Pipeline testing
precisely x medical events over D days for y different unique patients, distributed evenly over hospitals {a,b,c} where each patient is admitted on day i mod D and discharged i mod w days later, where i is the unique patient id in range [0,y].
Finally, the discharge date is null every z patients because we know we have bad data (urgh).If we turn this natural language into a (Python) code signature, it looks like this:
The assertions
Crafting the test data also raised some interesting corner cases that we needed to take back to the business analysts. For example. if a patient is discharged the same day they're admitted, do they show up on that day's occupancy numbers or not? If the discharge date is null what do we do with this patient?
Diagnosing Kafka
"Producers write data to and consumers read data from topic partition leaders. Synchronous data replication from leaders to followers ensures that messages are copied to more than one broker. Kafka producers can set the acks configuration parameter to control when a write is considered successful." [Disaster Recovery for Multi- Datacenter Apache Kafka Deployments, Confluent]
Wednesday, November 27, 2024
Cancel culture
"What happens when a pod starts up, and what happens when a pod shuts down?
"When a pod starts in a rolling deployment without the readiness probe configured ... the pod starts receiving traffic even though the pod is not ready. The absence of a readiness probe makes the application unstable. ...
Monday, November 11, 2024
Databricks in Azure
which opens my browser to the job's console.
Saturday, October 26, 2024
NVIDIA Rapids
...
at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:246)
yum -y install git
"cudaFree(0) to actually allocate the set device - no process exclusive required since we are relying on Spark to schedule it properly and not give it to multiple executors"
Saturday, October 5, 2024
Optimising GPU code
Changes in Java's memory
MemorySegment memorySegment = Arena.global().allocate(1024 * 1024 * 128, 8); System.out.println(memorySegment.address());
Friday, September 27, 2024
Running Iceberg Catalogs in a test suite
Russell SpitzerThis isn't actually defined anywhere. So at the moment we are kind of in a wild west, the Spark implementation deletes absolutely everything. I believe for Trino it just sends the request to the catalog and it's up to the catalog to decide. So for Spark "no", and it explicitly sends a drop table request to the rest catalog without a purge flag.
Thursday, September 26, 2024
Azure Automation
Sunday, September 22, 2024
GPU Notes
The organization of threads in CUDA terms:
- Thread: single unit of execution --- each thread has its own memory called registers
- Block: group of threads --- all threads in a block has access to a shared memory [Cuda Terminology].
- Grid: group of blocks --- all threads in a grid has access to [mutable] global memory and [immutable, global] constant memory.
[Penny Xu's blog]
Thursday, September 5, 2024
Architecting Azure
Bursty network logs |
Sunday, July 21, 2024
Hive for an Iceberg test suite
Cannot write into v1 table: `spark_catalog`.`database`.`spark_file_test_writeto`.
"Now V2 Predicate function pushdown is allowed (or at least should be coming soon) but for that you must use the datasource functions and not the spark wones" [Slack]"Best practice for partitions is partition by low cardinality and sort for high cardanality. This may reduce the level of partitions you need" [Slack]
So, I tried creating the table with:
ERROR | o.a.h.h.metastore.RetryingHMSHandler - MetaException(message:Unable to update transaction database java.sql.SQLSyntaxErrorException: Table/View 'NEXT_LOCK_ID' does not exist.
Sunday, June 16, 2024
First look at the Unity Catalog
Friday, May 17, 2024
More SQL Server Tuning
Friday, April 19, 2024
Tuning SQL Server
select * from sys.indexes
Saturday, April 6, 2024
When adding more CPUs does not help distressed CPUs
Insanely high CPU usage |
Insanely large GC Times |
Friday, April 5, 2024
Network Adventures in Azure Databricks
"Resource groups are units of deployment in ARM [Azure Resource Manager].
"They are containers grouping multiple resource instances in a security and management boundary.
"A resource group is uniquely named in a subscription.
"Resources can be provisioned on different Azure regions and yet belong to the same resource group.
"Resource groups provide additional services to all the resources within them. Resource groups provide metadata services, such as tagging, which enables the categorization of resources; the policy-based management of resources; RBAC; the protection of resources from accidental deletion or updates; and more...
"They have a security boundary, and users that don't have access to a resource group cannot access resources contained within it. Every resource instance needs to be part of a resource group; otherwise, it cannot be deployed." [Azure for Architects]
"A VNet is required to host a virtual machine. It provides a secure communication mechanism between Azure resources so that they can connect to each other.
"The VNets provide internal IP addresses to the resources, facilitate access and connectivity to other resources (including virtual machines on the same virtual network), route requests, and provide connectivity to other networks.
"A virtual network is contained within a resource group and is hosted within a region, for example, West Europe. It cannot span multiple regions but can span all datacenters within a region, which means we can span virtual networks across multiple Availability Zones in a region. For connectivity across regions, virtual networks can be connected using VNet-to-VNet connectivity." [Azure for Architects]
"Subnets provide isolation within a virtual network. They can also provide a security boundary. Network security groups (NSGs) can be associated with subnets, thereby restricting or allowing specific access to IP addresses and ports. Application components with separate security and accessibility requirements should be placed within separate subnets." [Azure for Architects]
AWS Real Estate
Just some notes I've made playing around with AWS real estate.
ECS
Amazon's offering that scales Docker containers. Whereas EC2 is simply a remote VM, ECS is a "logical grouping of EC2 machines" [SO]
Fargate
Is a serverless version of EC2 [SO].
Kinesis
A propriertary Amazon Kafka replacement. While Kafka writes data locally, Kinesis uses a quorum of shards.
MSK
Amazon also offers a hosted Kafka solution called MSK (Managed Streaming for Kafka).
Lambda
Runs containers like Docker that exists for up to 15 minutes and whose storage is ephemeral.
Glue
A little like Hive. It has crawlers that are batch jobs that compile metadata, thus doing some of the job of Hive's metastore. In fact, you can delegate the meta store that Spark uses to use Glue as its backing store. See:
EMR
EMR is AWS's MapReduce tool on which we can run Spark. "You can configure Hive to use the AWS Glue Data Catalog as its metastore." [docs] If you want to run Spark locally but still take advantage of Glue, follow these instructions.
Athena
Athena is AWS's hosted Trino offering. You can make data in S3 buckets available to Athena by using Glue crawlers.
Step Functions
AWS's orchestration of different services within Amazon's cloud.
CodePipeline
...is AWS's CI/CD offering.
Databases
DynamoDB is a key/value store and Aurora is a distributed relational DB.