Agile Java Man: April 2022

Tuesday, April 19, 2022

Tips for effective MLOps

Some miscellaneous tips I've discovered after over a year of being hands-on with a clinical ETL pipeline.

Technology

Set up a local dev environment (Git, Pip, IDE of choice, Python environment etc). Being able to test locally cannot be more important. For instance, I've been using Palantir's Foundry but since it's just a wrapper around Spark, you can have a full, locally run, test suite using PySpark.

If you can't get the data on your laptop (GDPR etc), use simulated data. If your model cannot find a signal in the noise you control, it likely won't when it faces real data. Make these tests automated so it can be part of your CI/CD pipeline.

Keep everything as deterministic as you can. Using commands like SQL's first means you have no guarantees what will come out of the sausage machine. How can your QAs test that?

Prefer to write intermediary data sets to persistent storage and load them again in the next stage over processing everything in memory with a single output. It might make the pipeline take longer to run but it gives you something like an "audit log" of the steps in your model.

Have diagnositic transforms in your pipeline. Typically, they make assertions to ensure self-consistency (eg, all patients' hospital length-of-stay must be non-negative; the aggregated total of overnight patients cannot exceed the total number of beds). These transforms output any violations into data sets that will never be seen by the end user, just the MLOps team. Diagnostic transforms have the advantage over the local test in that they can use real data as they are embedded in the production pipeline.

Process

Testing data is harder than testing code so spare a thought for your QAs. Talk to the them long before you cut a line of code and establish whether they can test what you're building. If not, find out what you need to give them.

Talk to data owners, tell them what you're trying to do and ask them how they'd do it. They often tell you things you'd never know a priori - like which data is, ahem, "less reliable".

Spend some time (a week?) just playing with the data (EDA) so things become more apparent. It's important to know, for instance, if a column called "region" in A&E data set is the region of the patient or the care provider. Since a patient in A&E may be treated the other side of the country (perhaps they were taken ill while on holiday), the two regions might have nothing to do with one another.

Make sure everybody is singing from the same hymn sheet with a ubiquitous language. Calculating the number of weeks a patient waits couldn't be easier, right? Take the number of days between the start and end date and divide by 7, right? But do we round up or down? And MS SQLServer has its own weird way of doing this calculation.

Data

Don't forget the effect of time on your data (data drift). For instance, care provider networks often get restructured so does hospital trust X in 2020 go by a different name today? Patients move around the country so the same person may be associated with more than one locale. Did more data sources come online over time?

How often is the data updated and how stale is it? These are not the same thing. For instance, the data may be updated daily but its content lags behind by several months.

Were dodgy assumptions made on upstream data sets that you need to be aware of? For instance, removing dead people from the the patients data set might have made sense upstream when dealing with vaccination data but not for historical accident and emergency data.

When unifying data sets, watch out for semantically equivalent values that may be lexically different (eg, if one data set uses an ethnicity of "British white" and another that says "White British", then a union of the two will look like two different ethnicities). This problem becomes harder when one data set uses, say, different age bands. How would you reconcile them?

Monday, April 11, 2022

More MLOps: bug hunting

We have several ML models for which we calculate the odds (the coefficients in a linear regression model) and the risks (the difference in probabilities between the factual and the counterfactual).

Previously, we saw that for one of our six models, the sign of the odds and risks differed when they should always be the same.

Fortunately, we caught this by putting a transform into our pipeline that takes the output of both the linear regression model and factual/counterfactual probabilities and simply checks them.

Part of our ML pipeline. Some transforms create production data, others diagnostic.

In the graph above, anything with _regression_tests appended is a diagnostic transform while everthing else is production data that will ultimately be seen by the customer.

Recently, a diagnostic test was reporting:

difference in risk (0.14285714285714285) does not agree with the sign of the coefficient (-4182.031458784066) for STP QR1 when p-value = 0.988606800326618

Transform containing pipeline diagnostic data.

Aside: there are other violations in this file but they're not unexpected. For instance, there are 42 health regions in England but the data has some random rubbish from the upstream data capture systems. The last row is due to us expecting at least 10 statistically signifcant outputs from our model when we only found 9. That's fine, we'll turn up the squelch on that one. A red flag would be if the number of insights dropped to zero. This happened when we needed the outcome to be binary but accidentally applied a function of type ℕ => {0,1} twice. Oops.

In a previous post, we found the problem was the "risks" and the "odds" transforms, although using the same input file, were consuming that file at significantly different times. The problem manifested itself when most regions for that one model had different signs. But this single diagnostic we see in the violations above is different. Only one region (QR1) differs. All the others in the data set were fine.

Looking at the offending region, the first things to note was that we had very little data for it. In fact, of the 42 health regions, it had the least by a large margin.

Number of rows of data per health region for a particular model

So, the next step was to build a Python test harness and using these 127 rows and run the data against production code locally. To run this on my laptop is so much easier as I can, say, create a loop and run the model 20 times to see if there was some non-determinsm happening. Apart from tiny changes in numbers due to numerical instability, I could not recreate the bug.

So, like in the previous post, could my assumption that the data is the same for both risk and odds calculations be flawed? The pipeline was saying that the two descendant data sets were built seconds after the parent.

Both risk and odds data descend from the same parent

Could there be some bizarre AWS eventual consistency issue? This seems not the case as S3 read/write consistency now has strong consistency (see this AWS post from December 2020).

Conclusion

Why just one region in one model has a different sign is still a mystery. Multiple attempts to see the same bug have failed. However, since there is so little data for this region, the p-values are very high (c. 0.99) we disregard any insights anyway. We'll just have to keep an eye out for it happening again.

Saturday, April 9, 2022

The Scala 3 Type System - part 2

This is the successor to my first post about Scala 3's great new type system.

Peano Axioms (again)

For fun, I'm modelling the natural numbers thus:

sealed trait Nat
case object Zero extends Nat
case class Succ[N <: Nat](n: N) extends Nat

transparent inline def toNat(inline x: Int): Nat =
inline x match
case 0 => Zero
case 1 => Succ(Zero)
case _ => Succ(toNat(x - 1))

But despite me using transparent, the function toNat just returns the super-type toNat not Succ[Succ[Succ[....

scala> toNat(3)
val res0: Succ[Nat] = Succ(Succ(Succ(Zero)))

So, with a bit of help from the guys on the Scala metaprogramming Discord channel, I came up with this:

transparent inline def toNat[I <: Int]: ToNat[I] =
inline erasedValue[I] match
case z: 0 => Zero
case s: S[t] => Succ(toNat[t])

which worked:

scala> toNat[5]

val res0: Succ[Succ[Succ[Succ[Succ[Zero.type]]]]] = Succ(Succ(Succ(Succ(Succ(Zero)))))

You'll notice two major differences. First, toNat no longer takes any arguments. We use the parameterized type in lieu of a parameter

The second is this bizarre S[_] (which is imported from scala.compiletime.ops.int). It gives us the notion of a successor. Let's look more closely at it.

S Express(ion)

From the Scala source code:

/** Successor of a natural number where zero is the type 0 and successors are reduced as if the definition was:

*
* ```scala
* type S[N <: Int] <: Int = N match {
* case 0 => 1
* case 1 => 2
* case 2 => 3
* // ...
* case 2147483646 => 2147483647
* }
* ```
* @syntax markdown
*/
type S[N <: Int] <: Int

What sorcery is this? I see no implementation of that huge match statement!

PhillHenry
The bit I didn't understand was S[t] => ... I assume the compiler enforces that t must be preceeding natural number of S. Is this compiler magic? Or can anybody enforce that constraint in plain Scala?
ragnar
Its compiler magic in the same way that the literal 3 is compiler magic. Basically val x: S[3] = 4 and val y: S[x.type] = 5 work as you would expect, the same holds true in reverse for pattern matching.

(the Scala/metaprogramming Discord server, 3 April 2022).

Compiler Options

By default, the compiler will limit the depth of recursion to 32 levels (after all, you need to tell the compiler when things have become pathological). This is far too small for us and 1024 sounds a bit more sensible. However, I use SBT and wasted an hour telling SBT exactly how to use this value. You need to add:

scalacOptions ++= Seq("-Xmax-inlines", "1024")

Note -X is broken into 2 elements of the Seq. If you foolishly try to add it as a single element as it would appear in the CLI, the compiler/SBT will ignore it.

The takeaway point

A good approach to metaprogramming is to imagine that the functions are being called but the arguments correspond to the parameterized types, not the function's parameters.

Friday, April 8, 2022

Python/Pandas Tips

A few miscellaneous tips that I keep referring to so I thought I'd make a post.

Flatmap in Python

Given a list of lists t,

flat_list = [item for sublist in t for item in sublist]

This is a more Pythonic solution [StackOverflow] than

from itertools import chain

where you might do something like this:

spark_map = F.create_map([F.lit(x) for x in chain(*python_map.items())])

Here, the chain is basically flattening a list (tuples) in a list (.items).

Pandas

Even if your main code does not use Pandas, it can be very useful for writing assertions in your tests.

Say, you have data in the form of lists of lists in variable rows. You can organise this into a Pandas data frame with something like [SO]:

df = pd.DataFrame(rows, range(len(rows)), ["COLUMN_1", "COLUMN_2"])

Now, if I want to get the COLUMN_1 value for the highest COLUMN_2, I do:

df.iloc[df['PREDICTION_COLUMN'].idxmax()]["COHORT"]

One can sort Pandas data frames with something like this [StackOverflow]:

df = df.reindex(df.coefficients.abs().sort_values().index)

where the absolute value of the column coefficients is what I want to sort on.

One can filter with something like:

cleaned = df[(df["p_values"] < 0.05) & ((df["coefficients"] > 0.1) | (df["coefficients"] < -0.1))]

Or, if you want to filter something that does not contain a string, say:

fun[~fun["feature"].str.contains("intercept")].head(50)

Pandas doesn't try to guess the types in the file so if you have a date for instance, you need to do something like:

df['time'] = pd.to_datetime(df['time'])

to convert a string to a timestamp type.

To create a new column from old columns, the syntax is simply:

df["z_score"] = df["coefficients"] / df["standard_error"]

You can concat dataframes with something like

pd.concat([df, df2], axis=1)

where the axis is that of the concatenation - 0 for rows (vertically), 1 for columns (horizontally) etc.

But to join, you need something like:

pd.merge(df1, df2, on=cols, how="inner")

where the parameters are obvious.

Tests

Useful tip when running tests: if you want only a subset to run, you can use a glob ignore. For example:

python -m pytest integration_tests/ --ignore-glob='integration_tests/test_actionable*.py'

will ignore tests that start with test_actionable.

Agile Java Man