Insanely high CPU usage |
Insanely large GC Times |
Musings on Data Science, Software Architecture, Functional Programming and whatnot.
Insanely high CPU usage |
Insanely large GC Times |
"Resource groups are units of deployment in ARM [Azure Resource Manager].
"They are containers grouping multiple resource instances in a security and management boundary.
"A resource group is uniquely named in a subscription.
"Resources can be provisioned on different Azure regions and yet belong to the same resource group.
"Resource groups provide additional services to all the resources within them. Resource groups provide metadata services, such as tagging, which enables the categorization of resources; the policy-based management of resources; RBAC; the protection of resources from accidental deletion or updates; and more...
"They have a security boundary, and users that don't have access to a resource group cannot access resources contained within it. Every resource instance needs to be part of a resource group; otherwise, it cannot be deployed." [Azure for Architects]
"A VNet is required to host a virtual machine. It provides a secure communication mechanism between Azure resources so that they can connect to each other.
"The VNets provide internal IP addresses to the resources, facilitate access and connectivity to other resources (including virtual machines on the same virtual network), route requests, and provide connectivity to other networks.
"A virtual network is contained within a resource group and is hosted within a region, for example, West Europe. It cannot span multiple regions but can span all datacenters within a region, which means we can span virtual networks across multiple Availability Zones in a region. For connectivity across regions, virtual networks can be connected using VNet-to-VNet connectivity." [Azure for Architects]
"Subnets provide isolation within a virtual network. They can also provide a security boundary. Network security groups (NSGs) can be associated with subnets, thereby restricting or allowing specific access to IP addresses and ports. Application components with separate security and accessibility requirements should be placed within separate subnets." [Azure for Architects]
Just some notes I've made playing around with AWS real estate.
ECS
Amazon's offering that scales Docker containers. Whereas EC2 is simply a remote VM, ECS is a "logical grouping of EC2 machines" [SO]
Fargate
Is a serverless version of EC2 [SO].
Kinesis
A propriertary Amazon Kafka replacement. While Kafka writes data locally, Kinesis uses a quorum of shards.
MSK
Amazon also offers a hosted Kafka solution called MSK (Managed Streaming for Kafka).
Lambda
Runs containers like Docker that exists for up to 15 minutes and whose storage is ephemeral.
Glue
A little like Hive. It has crawlers that are batch jobs that compile metadata, thus doing some of the job of Hive's metastore. In fact, you can delegate the meta store that Spark uses to use Glue as its backing store. See:
EMR
EMR is AWS's MapReduce tool on which we can run Spark. "You can configure Hive to use the AWS Glue Data Catalog as its metastore." [docs] If you want to run Spark locally but still take advantage of Glue, follow these instructions.
Athena
Athena is AWS's hosted Trino offering. You can make data in S3 buckets available to Athena by using Glue crawlers.
Step Functions
AWS's orchestration of different services within Amazon's cloud.
CodePipeline
...is AWS's CI/CD offering.
Databases
DynamoDB is a key/value store and Aurora is a distributed relational DB.
Although Hadoop Meta Store is used for most Spark implementations, it's not recommended for Iceberg. HMS does not support retries and deconflicting commits.
"HadoopCatalog has a number of drawbacks and we strongly discourage it for production use. There are certain features like rename and drop that may not be safe depending on the storage layer and you may also require a lock manager to get atomic behavior. JdbcCatalog is a much better alternative for the backing catalog." [Iceberg Slack]
Iceberg comes with a DynamoDb (AWS) implementation of the lock manager. Looking at the code, it appears that acquiring the lock uses an optimistic strategy. You can tell DynamoDB to put a row in the table iff it doesn't exist already. If it does, the underlying AWS library throws a software.amazon.awssdk.services.dynamodb.model.ConditionalCheckFailedException. There's a test for this in the AWS module here. It needs an AWS account to run.
"This is necessary for a file system-based catalog to ensure atomic transaction in storages like S3 that do not provide file write mutual exclusion." [Iceberg docs] This is a sentiment echoed in this blog.
The issue is the rename, not the data transfer. "Each object transfer is atomic. That is, either a whole file is transferred, or none of it is. But the directory structure is not atomic and a failure can cause mv to fail mid-way." [AWS Questions]
In the old world of HDFS, Spark would write its output to a temporary directory then atomically rename that directory to that of the final destination. However, S3 is not a file system but a blob store and the notion of a directory is just that: notional. When we change a "directory's" name, all the files in a directory need to be renamed one-by-one and renaming all the files Spark outputs is not atomic in S3. Implementations that talk to their own file system must implement Hadoop's OutputCommitter and Spark will call these when preparing to write etc.
The only mention of the lock manager in "Apache Iceberg: The Definitive Guide" is:
"If you are using AWS Glue 3.0 with Iceberg 0.13.1, you must also set the additional configurations for using the Amazon DynamoDB lock manager to ensure atomic transactions. AWS Glue 4.0, on the other hand, uses optimistic locking by default."
which is a bit too cryptic for me apparently because Glue 4.0 has a different version of Iceberg that uses optimistic locking [Discourse].
Catalogs
Catalogs "allows [Iceberg] to ensure consistency with multiple readers and writers and discover what tables are available in the environment... the primary high level requirement for a catalog implementation to work as an Iceberg catalog is to map a table path (e.g., “db1.table1”) to the file path of the metadata file that has the table’s current state."
The Catalogs are:
So, which should you use? From contributor, Daniel Weeks in Slack:
"If you're not using HMS currently, I would suggest going with JdbcCatalog, which you can also use directly or with a REST frontend... I would strongly suggest using JDBC Catalog unless there's something specific you need. HMS is built for hive and iceberg is not hive. There is both a lot of completely and baggage that comes with hive. For example, if you change the table scheme directly in hive, it does not change the schema in your iceberg table. Same with setting table properties. JDBC is super lightweight and native to iceberg, so if you don't have hive, I would avoid using it.
"There are multiple projects that are starting to adopt REST and I expect that only to grow, but that doesn't mean you necessarily need it right now. The main thing to think about is using multiple catalogs (not limit yourself to a single one). You can use JDBC directly now (most engines support it), but you can always add a REST frontend later. They can co-exist and REST can even proxy to your JDBC backend"
[1] "Apache Iceberg: The Definitive Guide"
When trying to run ArgoCD, I came across this problem that was stopping me from connecting. Using kubectl port-forward..., I was able to finally connect. But even then, if I ran:
$ kubectl get services --namespace argocd
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argocd-applicationset-controller ClusterIP 10.98.20.142 <none> 7000/TCP,8080/TCP 19h
argocd-dex-server ClusterIP 10.109.252.231 <none> 5556/TCP,5557/TCP,5558/TCP 19h
argocd-metrics ClusterIP 10.106.130.22 <none> 8082/TCP 19h
argocd-notifications-controller-metrics ClusterIP 10.109.57.97 <none> 9001/TCP 19h
argocd-redis ClusterIP 10.100.158.58 <none> 6379/TCP 19h
argocd-repo-server ClusterIP 10.111.224.112 <none> 8081/TCP,8084/TCP 19h
argocd-server LoadBalancer 10.102.214.179 <pending> 80:30081/TCP,443:30838/TCP 19h
argocd-server-metrics ClusterIP 10.96.213.240 <none> 8083/TCP 19h
kubectl cluster-info dump
I helped somebody on Discord with a tricksy problem. S/he was using a Python UDF in PySpark and seeing NullPointerExceptions. This suggests a Java problem as the Python error message for an NPE looks more like "AttributeError: 'NoneType' object has no attribute ..." But why would Python code cause Spark to throw an NPE?
The problem was the UDF was defining a returnType struct that stated a StructField was not nullable.
When you define a schema where all columns are declared to not have null values , Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column.- Spark, The Definitive Guide