Agile Java Man: MLOps debugging: an example

In our ML pipeline, we use generalized linear models to calculate the odds of certain clinical outcomes. We showed this to the client but odds were hard for them to understand. "Can we have probabilities instead?" they asked.

So, having trained the GLMs, we fit the same data and calculate the average probabilities for each cohort. We then bastardise the data to create the counterfactuals For example, socioeconomic status is of interest so let's make everybody the same level (counterfactual) then make our predictions again and compare with the factual results.

Now, one would imagine that if the corresponding GLM coefficient were positive, then the average probability from the counterfactuals would be less than that of the factual data. This is simple maths, right? You dot product your feature vector with the coefficients and plug it into the sigmoid function. A positive coefficient would lead to a bigger product and so a smaller denominator.

We were mostly seeing this but about 10% of our predicitions violated this rule.

Investigation

A transform compares the two approaches (risks and odds) and with a bit of Pandas:

> import pandas as pd

> df = pd.read_csv("/home/henryp/Documents/CandF/fitted_vs_coeffs.txt", delimiter='\t')

> sub = df[["coefficients", "cohort_risk", "counterfactual_cohort_risk", "p_values"]]

> sub["diff"] = sub["cohort_risk"] - sub["counterfactual_cohort_risk"]

> sub = sub.reindex(sub["diff"].abs().sort_values().index)

> sub[-5:]

coefficients cohort_risk counterfactual_cohort_risk p_values feature model_id diff

25 -0.520122 0.239911 0.184961 0.121023 ethnic_category_one_hot_Other a_and_e_cancer 0.054950

23 -0.561247 0.277648 0.219477 0.006748 ethnic_category_one_hot_Asian a_and_e_cancer 0.058171

44 -0.292537 0.545455 0.480899 0.821720 ethnic_category_one_hot_Asian wait_18_ophthalmology 0.064556

50 -0.224494 0.307358 0.229469 0.480723 ethnic_category_one_hot_Black a_and_e_cancer 0.077889

8 -5.340222 0.636364 0.551042 0.957624 ethnic_category_one_hot_Asian wait_18_ophthalmology 0.085322

we note that the p-values indicate a lack of statistical significance in the coefficient even if the difference between factual and counterfactual probabilities can be pretty large.

But given the same data should produce the same p-values and coefficients each time (modulo floating point issues), how could this happen?

In Generalized Logistic Regression models, the coefficients don't change much on each run, even when shuffling the data. This is true irrespective of p-values. Let's demonstrate.

Shuffling

The Scala code taken from the official Spark docs for a GLR (here) gives this:

glr.fit(dataset.orderBy(rand())).coefficients.toArray.zip(glr.fit(dataset.orderBy(rand())).coefficients.toArray).map{ case (x, y) => x - y }

res24: Array[Double] = Array(1.231653667943533E-16, -2.220446049250313E-16, 2.220446049250313E-16, 0.0, 1.1102230246251565E-16, -2.220446049250313E-16, -1.6653345369377348E-16, 2.220446049250313E-16, 2.220446049250313E-16, -2.220446049250313E-16)

Running this a few times shows the differences are miniscule even though the p-values are pretty large:

glr.fit(dataset.orderBy(rand())).summary.pValues

res25: Array[Double] = Array(0.989426199114056, 0.32060241580811044, 0.3257943227369877, 0.004575078538306521, 0.5286281628105467, 0.16752945248679119, 0.7118614002322872, 0.5322327097421431, 0.467486325282384, 0.3872259825794293, 0.753249430501097)

Any differences in the coefficients between runs are almost certainly due to floating point arithmetic.

Modification

However, things are very different when we modify the data; even a small perturbation can radically change coefficients with high p-values.

Note that there are only 500 rows in this test data set. But let's fit two models where on average we drop one row at random. Then we see things like this:

def compare2(): Unit = {

val tol = 0.002

val fitted = glr.fit(dataset.where(rand() > tol))

val pValues = fitted.summary.pValues

val stdErrors = fitted.summary.coefficientStandardErrors

val coeffs_rand = fitted.coefficients.toArray

val other = glr.fit(dataset.where(rand() > tol))

val diffs = other.coefficients.toArray.zip(coeffs_rand).map{ case (x, y) => x - y }

val diffs_pc = diffs.zip(coeffs_rand).map{ case (x, y) => (x / y) * 100 } ;

val meta = pValues.zip(diffs).zip(diffs_pc).zip(stdErrors).map{ case(((p, d), pc), s) => (p, d, pc, s) }

println("%-15s%-15s%-15s%-15s".format("p-value", "difference", "% difference", "std error"))

meta.sortBy(x => -math.abs(x._1)).foreach { case (p, d, pc, s) => println(s"%-15.3f%-15.3f%-15.3f%-15.3f".format(p, d, pc, s)) }

}

scala> compare2()

p-value difference % difference std error

0.977 0.051 -224.562 0.793

0.602 0.101 -24.475 0.792

0.574 0.043 9.546 0.793

0.551 -0.035 7.432 0.795

0.459 0.044 -7.145 0.828

0.456 0.085 14.590 0.776

0.283 -0.068 -7.897 0.803

0.275 0.133 -15.230 0.796

0.134 -0.067 -5.484 0.811

0.006 0.119 5.188 0.830

We need run it only a dozen or so times to see that the coefficients with the largest p-value tend to have the largest percentage discrepancies between the two fits and that the sign can change.

The Culprit

These CLI antics lead me to question the "given the same data" assumption. A closer look at the timestamps of the input data sets revealed they were built a week apart (Palantir's Foundry will by default silently fall back to data sets on a different branch if they have not been built on the current branch. This data may be stale). Well, our data slowly grows over time. So, could these small differences be responsible for big swings in the GLM coefficients? I created in the pipeline a single snapshot of the data from which both the risk and odds values derived and there were zero discrepancies. Sweet.

Agile Java Man

Wednesday, March 23, 2022

MLOps debugging: an example

No comments:

Post a Comment

Blog Archive

About Me