Some tools that I've had too little time to investigate thoroughly.
TestContainers
The free and open source TestContainers offers huge convenience to developers. For instance, you can fire up a very lightweight Postgres container in just a second or two. This ZIO SQL test (AgregatingSpec) ran in just 3.85s on my laptop. In that time, it started a Docker container, populated the Postgres database in it with test data, ran some Scala code against it then tore down the container. The container can last as long as the JVM so all your tests can use it before it detects the JVM is exiting whereupon it will kill the container.
MinIO
If you need to run S3 API compatible storage locally, you can try MinIO. It's written in Go and open source and allows you to have a local Docker container emulating Amazon storage.
DuckDB
This open source, C++ application allows you to run SQL against Parquet files without having to fire up a whole platform. You can even run DBeaver against it.
Crossplane
Crossplane is an open source Go project that "connects your Kubernetes cluster to external, non-Kubernetes resources, and allows platform teams to build custom Kubernetes APIs to consume those resources." [docs]
Scala Native
You can now convert Scala code to stand alone executable binaries using Scala Native [baeldung]. It currently only works with single threaded applications. The output can be converted to WebAssembly...
WebAssembly
Wikipedia describes WebAssembly as "a portable binary-code format and a corresponding text format for executable programs ... for facilitating interactions between such programs and their host environment." It is an "open standard and aims to support any language on any operating system".
Tapir
Is a type-safe, Scala library that documents HTTP endpoints.
GraphQL
GraphQL is a type system, query language, etc accessible through a single endpoint that only returns what is asked of it and no surplus information. It's a spec and there are implementations in a number of languages. The graph bit comes in insofar a "query is a path in the graph, going from the root type to its subtypes until we reach scalar types with no subfields." [Bogdan Nedelcu]
LLVM
LLVM is an open source tool chain written in C++. The 'VM' in LLVM originally stood for Virtual Machine but these days but this is no longer the case. Instead of being a virtual machine, it turns any major language into a common intermediate code that can then be turned to machine code.
GraalVM
GraalVM is an open source JDK and JRE written in Java itself and has its roots in project Maxine. But it's more than that. It offers compilation to native code as well as supporting polyglot code via its Truffle framework, a language-agnostic AST.
Quarkus
Based on GraalVM (above), Quarkus is an open source Java Framework tailored for Kubernetes. Since the JVM code is natively compiled, startup and memory sizes are small.
Spring Boot
Is an "opinionated" Java framework that favours convention-over-configuration and runs Spring apps with the minimum of fuss.
Python/Java Interop
Together, Python and Java both dominate the data engineering landscape. These languages can interoperate via Py4J which uses sockets to allow Python to invoke Java code and Jython which runs Python code wholely inside the JVM. Py4J is used extensively in Spark to allow PySpark devs to talk to Spark JVMs.
Jython, unfortunately, does not support Python 3.
Project Nessie
Nessie is an open source, JVM project that promises to do to big data what Git did to code: versioning, branching etc. It apparently sits nicely on top of Iceberg and DataBricks.
The lakeFS project is a open source, Go project that offers similar functionality.
Cloud native CI/CD
Tekton that is written in GoLang.
Argo is a Python based, Kubernetes native tool. For instance, it handles rolling deployments building on K8's RollingUpdate strategy which does not natively control traffic flow during an update.
CircleCI seems to be mostly closed source.
Pipelines
Interestingly, CI/CD and data pipelines both use directed acycliclic graphs but with very different intent. User Han on Discord eloquently spelled out the architectural distinction:
Specifically the reason is in batch data [Pipeline] processing, we tend to scale things out horizontally by a whole lot, sometimes using GPUs. This is not a common feature supported by CI/CD workflow tools. In summary:Jenkins, CodePipeline, Github Actions, TeamCity, Argo, etc ==> used to build DAGs for CI/CD, tends to have shorter run time, less compute requirement, and fairly linear in dependencies.
Airflow, Dagster, Prefect, Flyte, etc ==> used to build data and/or machine learning pipelines. It tend to have longer run time, larger horizontal scaling needs, and sometimes complex dependencies. Data pipelines also sometimes have certain needs, e.g., backfilling, resume, rerun, parameterization, etc that's not common in CI/CD pipelines
Istio is an open source GoLang project that transparently provides "a uniform way to integrate microservices, manage traffic flow across microservices, enforce policies and aggregate telemetry data."
No comments:
Post a Comment