"ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently" according to Wikipedia. In particular, my job includes "reproducibility and diagnostics of models and predictions" (see the "Goals" section). Despite repeatability being the bedrock of good science, it's surprising how few data scientists appreciate the importance of a pipeline that can reproduce results on demand.
ML Ops is an especially interesting field because, unlike most areas of software and data engineering, the mapping is often not one-to-one. For instance, in a typical banking app, an input (purchase of a financial instrument) leads to just one output (the client now owns a financial instrument) modulo any cross-cutting concerns such as logging. But in ML models, this is not always the case. Including a socioeconomic feature from the data set may change the output from the model in subtle ways. Or it may not at all. In fact, how can you be sure that you really did include it? Are you sure there wasn't a bug?
There seems to be a dearth of good resources for ML Ops, so here are some observations I've made.
Ontologies
Ontologies seem a really nice idea but I've never seen them work. This is partly due to paradoxes that derive from trying to be all things to all men. True life story: the discharge data for some hospitals was 01/01/1900 for about 100 patients. This lead to cumulative length-of-stays to be negative. "OK," says the ontology team. "It's only 100 patients out of 10 million so let's remove them". But the other 99 columns for these 100 patients were fine. So, during reconciliation with external systems, the QA team had a devil of a job trying to find why their numbers did not add up. They had no knowledge of the missing 100 data points that were absolutely fine other than their discharge dates.
Governance
Weekly town halls that encourage questions and maximise the pairs of eyes looking at the data while the data cleaning takes place. The format of this is very much like Prime Minister's Question Time. In addition, a regularly updated blog and threads that all teams can read aid knowledge transfer. What you really don't want is the data engineers working in isolation and not communicating regularly with downstream clients like the data scientists and QAs.
Test data
Fake data can sometimes be a problem. Say, I create some data but want to bias it. I make all people with a socioeconomic score less than X be in one class rather than the other and run a logistic regression model on it. All goes well. Then, one day, it is decided to have the socioeconomic score one-hot encoded to represent poor and everybody else. The threshold for this partition happens to be X. Then my tests start failing with "PerfectSeparationError: Perfect separation detected, results not available" and I don't know immediately why (this SO reply points out that this causes a model to blow up).
Non-determinism
If there are any functions that have non-deterministic results (for example F.first when you aggregate a groupBy in Spark) then you are immediately at odds with your QA team. For example, a health provider has demographic data including socioeconomic status. But this can change over time. They want to reduce patients events to just patients. This means a groupBy on the patient ID. But which socioeconomic indicator do we use in the aggregation?
Plenty of Data Sets
When pipelining data, it's better to persist more data sets than fewer. This facilitates QA as otherwise it's hard to understand what functions acted on a row. The disadvantage is that a pipeline may take longer to run since we're constantly persisting files to disk. Still, the greater visibility you get is worth it.
There is no tool as far as I know to see the lineage of functions on a single row.
Repeatability
I recommend having non-production transforms that examine the model's output. For instance, I'm currently working on a work flow that looks for insights in health data. I have a transform that counts the number of statistically significant insights that also have a significant effect (how you calculate these thresholds are up to you). The point being, after ever pull request, the pipeline is run and the number of insights are compared to the previous run. If they change significantly, there's a reasonable chance something nasty was introduced to the code base.
Healthy looking (auto-generated) numbers of insights from my models |
Without a decent platform, it's hard to have repeatable builds (see here and here for horror stories). Now, I have my issues with Palantir's Foundry but at least it has repeatability built in. Under the covers, it uses plain, old Git for its SCM capabilities. What's even nicer, models can be built on a 'data' branch that corresponds to the Git branch, so you won't overwrite the data created by another branch when you do a full build.