Thursday, December 16, 2021

Big Data or Pokeman?

It's hard to keep track of developments in big data. So much so that there is a quiz to see if you can differentiate big data tools from Pokemon characters. It's surprisingly difficult.

So, to make it look like an old fool like me can avoid being dissed by the cool kids, this is a cheat sheet of what's currently awesome:

Trino
Once known as PrestoSQL, Trino is a SQL engine that can sit on top of heterogenous data sources. PrestoSQL is offered by AWS under the name "Athena". The incubating Apache Kyuubi appears to be similar but tailored for Spark.

Amundsen
Amundsen is a "data discovery and metadata engine". Apache Atlas is an Hadoop based metadata and governance application written in Java.

Apache Griffin
Griffin is a JVM-based tool for checking for data quality. TensorFlow Data Validation and Great Expectations for Python.

Apache Arrow
Arrow is a language-agnostic columnar processing framework. You might need it if you want to use User Defined Aggregate Functions in PySpark [StackOverflow]. It's written in a number of languages, predominantly C++ and Java and can help leverage GPUs.

Azkaban
Azkaban is a Hadoop workflow management tool from LinkedIn. It's open source and written in Java.

Koalas
Koalas brings the Pandas API to PySpark. It depends on Arrow to do this, apparently.

DBT
DBT is an open source Python project that does the T in ELT. Transforms are in templated SQL, apparently.

Apache Pinot
Pinot is a distributed, columnar data store written in Java that ingests batched and streaming data. It's a little like Apache Druid. "The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database." [Medium]

Debezium
From the docs: "Debezium is an open source [Java] project that provides a low latency data streaming platform for change data capture (CDC). You setup and configure Debezium to monitor your databases, and then your applications consume events for each row-level change made to the database."

Samza
Apache Samza allows real-time analysis of streams. It comes from the same people who gave us Kafka (LinkedIn) and is written in Java and some Scala.

Pulsar
Apache Pulsar is like Apache Kafka but adds messaging on top of its streaming offering. It's mostly written in Java.

AirByte
This appears to be a half Java, half Python collection of connectors to "sync data from applications, APIs & databases to warehouses, lakes & other destinations."

Hudi
A Java/Scala Apache project that  stands for "Hadoop Upserts Deletes and Incrementals" which pretty much describes it. It can integrate with Spark.

DataHub
This is an open source "data catalog built to enable end-to-end data discovery, data observability, and data governance" [homepage] that came out of LinkedIn. It is written in Python and Java.

The adoption within the industry for big data tools can be found in this report [LinkedIn]

Andreesen-Horowitz make their predictions here.


No comments:

Post a Comment