Thursday, December 16, 2021

Big Data or Pokeman?

It's hard to keep track of developments in big data. So much so that there is a quiz to see if you can differentiate big data tools from Pokemon characters. It's surprisingly difficult.

So, to make it look like an old fool like me can avoid being dissed by the cool kids, this is a cheat sheet of what's currently awesome:

Trino
Once known as PrestoSQL, Trino is a SQL engine that can sit on top of heterogenous data sources. PrestoSQL is offered by AWS under the name "Athena". The incubating Apache Kyuubi appears to be similar but tailored for Spark.

Amundsen
Amundsen is a "data discovery and metadata engine". Apache Atlas is an Hadoop based metadata and governance application written in Java.

Apache Griffin
Griffin is a JVM-based tool for checking for data quality. TensorFlow Data Validation and Great Expectations for Python.

Apache Arrow
Arrow is a language-agnostic columnar processing framework. You might need it if you want to use User Defined Aggregate Functions in PySpark [StackOverflow]. It's written in a number of languages, predominantly C++ and Java and can help leverage GPUs.

Azkaban
Azkaban is a Hadoop workflow management tool from LinkedIn. It's open source and written in Java.

Koalas
Koalas brings the Pandas API to PySpark. It depends on Arrow to do this, apparently.

DBT
DBT is an open source Python project that does the T in ELT. Transforms are in templated SQL, apparently.

Apache Pinot
Pinot is a distributed, columnar data store written in Java that ingests batched and streaming data. It's a little like Apache Druid. "The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database." [Medium]

Debezium
From the docs: "Debezium is an open source [Java] project that provides a low latency data streaming platform for change data capture (CDC). You setup and configure Debezium to monitor your databases, and then your applications consume events for each row-level change made to the database."

Samza
Apache Samza allows real-time analysis of streams. It comes from the same people who gave us Kafka (LinkedIn) and is written in Java and some Scala.

Pulsar
Apache Pulsar is like Apache Kafka but adds messaging on top of its streaming offering. It's mostly written in Java.

AirByte
This appears to be a half Java, half Python collection of connectors to "sync data from applications, APIs & databases to warehouses, lakes & other destinations."

Hudi
A Java/Scala Apache project that  stands for "Hadoop Upserts Deletes and Incrementals" which pretty much describes it. It can integrate with Spark.

DataHub
This is an open source "data catalog built to enable end-to-end data discovery, data observability, and data governance" [homepage] that came out of LinkedIn. It is written in Python and Java.

The adoption within the industry for big data tools can be found in this report [LinkedIn]

Andreesen-Horowitz make their predictions here.


Friday, December 3, 2021

More Logistic Regression

Log-Likelihood
How good is our model? I'm told the best way to think about the likelihood in a logistic regression is as the probability of having the data given the parameters. Compare this to what logistic regression is actually doing: giving the most probable parameters given the data.

First, some terminology:

Exo and Endo
"If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error... This assumption is referred to as exogeneity. Conversely, when this type of correlation exists, which violates the assumption, there is endogeneity." - Regression Analysis, Jim Frost.

"An endogenous variable is a variable in a statistical model that's changed or determined by its relationship with other variables within the model. In other words, an endogenous variable is synonymous with a dependent variable, meaning it correlates with other factors within the system being studied. Therefore, its values may be determined by other variables. Endogenous variables are the opposite of exogenous variables, which are independent variables or outside forces. Exogenous variables can have an impact on endogenous factors, however." [Investopedia]

Spark
Unfortunately, Spark does not seem to have a calculation for log-likelihood out of the box. This forced me to code my own. I looked at the code in the Python library, StatsModels, and converted it to PySpark.

Comparing the output from Spark was very data dependent. I guess this is inevitable since Spark uses IRLS [Wikipedia] as the solver and StatsModels was using l-BFGS. I was forced to use l-BFGS in StatsModels as I was otherwise "singular matrix" errors [SO].

But then, a passing data scientist helpfully pointed out that I was looking at the wrong StatsModel class (Logit in discreet_model.py not GLM in generalized_linear_model.py). Taking the GLM code, I got within 6% of StatsModels when both run on the same data. Could I do better? Well, first I rolled back the use of l-BFGS to the default solver (which is IRLS for both libraries) and removed a regularization parameter in the Spark code that had crept in from a copy-and-paste - oops. Now, the difference between the two was an impressive 4.5075472233299e-06. Banzai!

We may choose to have no regularization if we're looking for inferential statistics (which I am). This might lead to overfitting but that's OK. "Overfitting is predominantly an issue when building predictive models in which the goal is application to data not used to build the model itself... As far as other applications that are not predictive, overfitting is more secondary" [SO]

The Data
The size of data makes a difference. The more data, the lower the likelihood it is explained by the parameters. This is simply because the state space is bigger and that means there are more potentially wrong states.

Also, adding a bias of a single feature can change the log-likelihood significantly. Making a previously unimportant feature biased to a certain outcome 2/3 of the time reduced the LL in my data by an order of magnitude. Not surprisingly, if the data is constrained to a manifold, then it's more likely your model will find it.