It's hard to keep track of developments in big data. So much so that there is a quiz to see if you can differentiate big data tools from Pokemon characters. It's surprisingly difficult.
So, to make it look like an old fool like me can avoid being dissed by the cool kids, this is a cheat sheet of what's currently awesome:
Trino
Once known as PrestoSQL, Trino is a SQL engine that can sit on top of heterogenous data sources. PrestoSQL is offered by AWS under the name "Athena". The incubating Apache Kyuubi appears to be similar but tailored for Spark.
Amundsen
Amundsen is a "data discovery and metadata engine". Apache Atlas is an Hadoop based metadata and governance application written in Java.
Apache Griffin
Griffin is a JVM-based tool for checking for data quality. TensorFlow Data Validation and Great Expectations for Python.
Apache Arrow
Arrow is a language-agnostic columnar processing framework. You might need it if you want to use User Defined Aggregate Functions in PySpark [StackOverflow]. It's written in a number of languages, predominantly C++ and Java and can help leverage GPUs.
Azkaban
Azkaban is a Hadoop workflow management tool from LinkedIn. It's open source and written in Java.
Koalas
Koalas brings the Pandas API to PySpark. It depends on Arrow to do this, apparently.
DBT
DBT is an open source Python project that does the T in ELT. Transforms are in templated SQL, apparently.
Apache Pinot
Pinot is a distributed, columnar data store written in Java that ingests batched and streaming data. It's a little like Apache Druid. "The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database." [Medium]
From the docs: "Debezium is an open source [Java] project that provides a low latency data streaming platform for change data capture (CDC). You setup and configure Debezium to monitor your databases, and then your applications consume events for each row-level change made to the database."
The adoption within the industry for big data tools can be found in this report [LinkedIn]
Andreesen-Horowitz make their predictions here.
No comments:
Post a Comment