Wednesday, February 23, 2022

ML Ops

"ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently" according to Wikipedia. In particular, my job includes "reproducibility and diagnostics of models and predictions" (see the "Goals" section). Despite repeatability being the bedrock of good science, it's surprising how few data scientists appreciate the importance of a pipeline that can reproduce results on demand.

ML Ops is an especially interesting field because, unlike most areas of software and data engineering, the mapping is often not one-to-one. For instance, in a typical banking app, an input (purchase of a financial instrument) leads to just one output (the client now owns a financial instrument) modulo any cross-cutting concerns such as logging. But in ML models, this is not always the case. Including a socioeconomic feature from the data set may change the output from the model in subtle ways. Or it may not at all. In fact, how can you be sure that you really did include it? Are you sure there wasn't a bug?

There seems to be a dearth of good resources for ML Ops, so here are some observations I've made.

Ontologies

Ontologies seem a really nice idea but I've never seen them work. This is partly due to paradoxes that derive from trying to be all things to all men. True life story: the discharge data for some hospitals was 01/01/1900 for about 100 patients. This lead to cumulative length-of-stays to be negative. "OK," says the ontology team. "It's only 100 patients out of 10 million so let's remove them". But the other 99 columns for these 100 patients were fine. So, during reconciliation with external systems, the QA team had a devil of a job trying to find why their numbers did not add up. They had no knowledge of the missing 100 data points that were absolutely fine other than their discharge dates.

Governance

Weekly town halls that encourage questions and maximise the pairs of eyes looking at the data while the data cleaning takes place. The format of this is very much like Prime Minister's Question Time. In addition, a regularly updated blog and threads that all teams can read aid knowledge transfer. What you really don't want is the data engineers working in isolation and not communicating regularly with downstream clients like the data scientists and QAs.

Test data

Fake data can sometimes be a problem. Say, I create some data but want to bias it. I make all people with a socioeconomic score less than X be in one class rather than the other and run a logistic regression model on it. All goes well. Then, one day, it is decided to have the socioeconomic score one-hot encoded to represent poor and everybody else. The threshold for this partition happens to be X. Then my tests start failing with "PerfectSeparationError: Perfect separation detected, results not available" and I don't know immediately why (this SO reply points out that this causes a model to blow up).

Non-determinism

If there are any functions that have non-deterministic results (for example F.first when you aggregate a groupBy in Spark) then you are immediately at odds with your QA team. For example, a health provider has demographic data including socioeconomic status. But this can change over time. They want to reduce patients events to just patients. This means a groupBy on the patient ID. But which socioeconomic indicator do we use in the aggregation? 

Plenty of Data Sets

When pipelining data, it's better to persist more data sets than fewer. This facilitates QA as otherwise it's hard to understand what functions acted on a row. The disadvantage is that a pipeline may take longer to run since we're constantly persisting files to disk. Still, the greater visibility you get is worth it.

There is no tool as far as I know to see the lineage of functions on a single row.

Repeatability

I recommend having non-production transforms that examine the model's output. For instance, I'm currently working on a work flow that looks for insights in health data. I have a transform that counts the number of statistically significant insights that also have a significant effect (how you calculate these thresholds are up to you). The point being, after ever pull request, the pipeline is run and the number of insights are compared to the previous run. If they change significantly, there's a reasonable chance something nasty was introduced to the code base.

Healthy looking (auto-generated) numbers of insights from my models

Without a decent platform, it's hard to have repeatable builds (see here and here for horror stories). Now, I have my issues with Palantir's Foundry but at least it has repeatability built in. Under the covers, it uses plain, old Git for its SCM capabilities. What's even nicer, models can be built on a 'data' branch that corresponds to the Git branch, so you won't overwrite the data created by another branch when you do a full build. 

Wednesday, February 9, 2022

Mini Driver

You can easily blow up the driver in Spark by calling collect() on a particularly large dataset. But is there any other way to do so?

This was the problem facing me this week: perfectly sensible looking code that resulted in:

Job aborted due to stage failure: Total size of serialized results of 4607 tasks (1024.1 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)

Well, the answer is, yes, innocuous joins can overwhelm the driver if they're broadcast as they are distributed to the executors via the driver.

"This is due to a limitation with Spark’s size estimator. If the estimated size of one of the DataFrames is less than the autoBroadcastJoinThreshold, Spark may use BroadcastHashJoin to perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out of memory error." [Databricks]

The solutions is to call explain() and "review the physical plan. If the broadcast join returns BuildLeft, cache the left side table. If the broadcast join returns BuildRight, cache the right side table."

This can be achieved by calling the Dataset.hint(...) method with the values you can find in JoinStrategyHint.hintAliases.

Generalized Linear Models

GLMs provide a unified framework for modeling data originating from the exponential family of densities which include Gaussian, Binomial, and Poisson, among others. It is defined in [1] as:

"A generalised linear model (or GLM) consists of three components:

  1. A random component, specifying the conditional distribution of the response variable Yi (for the ith of n independently sampled observations).
  2. The linear predictor - that is a linear function of regressors:
    ηi = α + β1Xi1+ β2Xi2+ ...
  3. A smoothe and invertible link function, g, which converts the expectation of the response variable, μi ≡ E(Yi), to the linear predictor" [see 2]. So:
    g(μi) = g(E[Yi])  = ηi

Note that the linear predictor allows for interaction effects, where one variable depends on another and vice versa, or curvilinear effects (ie powers of x terms), etc. Note that the right hand side represents a linear combination of the explanatory variables.

The key to understanding this is by asking: what is an exponential family? The probability mass function would look like this:

f(x|β) = h(x) e(T(x)g(β)-A(β))

where x and β are the data and parameters respectively, and h, g, T, and A are known functions. Now, the thing is that you can shoe-horn a few distributions into this form. Take the binomial distribution you were taught at high school:

f(k, n, p) =  nCpk(1 - p)n-k

with a bit of basic algebra (try it!), you can get it to look like:

f(x|p) = nCk e(x log[p/(1-p)] – n log(1-p))

Hey, that's the form of the exponential family! Here, we've set p=k/n and

g = log[p/(1-p)]

Interestingly, point 3 (above) says this function for g equals our linear predictor in 2, above, and you can derive the sigmoid/logit function (try it!). 

Note that this is not why we use the logit function in linear regression. Often, the argument is that it must map the linear predictor that ranges from  ±∞ to [0,1] as we're interested in probabilities. Although logit does this, there are an infinite number of equations that do also. So, here is a link that details why only logit can be the only suitable function.

Now, remember that a Bernoulli distribution is a just a binomial when n=1, so this completely describes the case for logistic regression. This is why when we're looking at a model with a binary response, we tell the GLM to use the Binomal family (see, for instace, here in the Spark docs). What we do for other use cases I shall deal with in another post.

[1] Applied Regression Analysis & Generalized Linear Models, Fox, 3rd edition.