Friday, March 25, 2022

Riddles of Sphinx

Sphinx is a tool to autogenerate documentation. It's generic but is often used in the Python ecosystem. 

Once configured, you run it via an old school Makefile. It will then do to your Python code was Javadoc does to Java. It doesn't appear to have the ability to mix code and documentation like ScalaTest.

How to use it

I found this StackOverflow answer a lot clearer than the official documentation. The upshot of it is:

  1. Create a docs folder. In this, you can run sphinx-quickstart if you like (separating source from build - see SOURCEDIR and BUILDDIR in the Makefile) but one way or another, your conf.py file will need to know where to find the code. If you've followed the instruction in the SO answer, it's two levels up so conf.py needs to have:

    import os
    import sys
    sys.path.insert(0, os.path.abspath(os.path.join('..', '..', 'src')))


    (note those two '..') in it. Also, note that this is the top-level directory of your code. If there are relative imports in your code, these need to be relative to this directory.
  2. You'll also need the extensions to parse Python.
  3. The index.rst file needs to be told what to pull in. You do this by adding a reference under toctree (let's arbitrarily call it module) that corresponds to another file (module.rst) in the same directory. 
  4. In turn, our arbitrarily names module.rst needs in its toctree a reference to our final sub-module. We'll call it my_docs and it will necessitate a file called my_docs.rst in the same directory.
  5. Here, in the automodule block, we'll tell Sphinx which file want documented. Note that you don't need the .py extension.
  6. Note that the .rst files do not need to conform to your directory structure.

PDFs

Sphinx natively creates Latex files but to create PDF, I needed to run:

sudo apt-get install latexmk

which allowed me to run:

make latexpdf

Dependencies

I had a problem with third-party libraries on which my code depended. Having found my code, Sphinx was telling me no module named pyspark when it hit the import statement in my code. This SO answer helped out but it left me with a nasty hard-coding in my conf.py. This may be more down to the way Python manages its dependencies (which is horrible compared to Java).

Writing effective Sphinx docstring

You can reference constants in your docstring with something like :py:const:`FQN.CONSTANT_NAME` (where FQN is the fully qualified path to your Python file). If you also put something like this in your .rst file:

.. autoattribute:: FQN.CONSTANT_NAME

then Sphinx will not only provide a hyperlink to the constant but also print out the value of that constant.

Executing the code in comments

For this, try doctest. It not only executes code in the comments but checks the results.


Wednesday, March 23, 2022

MLOps debugging: an example

In our ML pipeline, we use generalized linear models to calculate the odds of certain clinical outcomes. We showed this to the client but odds were hard for them to understand. "Can we have probabilities instead?" they asked. 

So, having trained the GLMs, we fit the same data and calculate the average probabilities for each cohort. We then bastardise the data to create the counterfactuals For example, socioeconomic status is of interest so let's make everybody the same level (counterfactual) then make our predictions again and compare with the factual results.

Now, one would imagine that if the corresponding GLM coefficient were positive, then the average probability from the counterfactuals would be less than that of the factual data. This is simple maths, right? You dot product your feature vector with the coefficients and plug it into the sigmoid function. A positive coefficient would lead to a bigger product and so a smaller denominator.

We were mostly seeing this but about 10% of our predicitions violated this rule.

Investigation

A transform compares the two approaches (risks and odds) and with a bit of Pandas:

> import pandas as pd
> df = pd.read_csv("/home/henryp/Documents/CandF/fitted_vs_coeffs.txt", delimiter='\t')
> sub = df[["coefficients", "cohort_risk", "counterfactual_cohort_risk", "p_values"]]
> sub["diff"] = sub["cohort_risk"] - sub["counterfactual_cohort_risk"]
> sub = sub.reindex(sub["diff"].abs().sort_values().index)
> sub[-5:]
    coefficients  cohort_risk  counterfactual_cohort_risk  p_values  feature                       model_id               diff
25     -0.520122     0.239911                    0.184961  0.121023  ethnic_category_one_hot_Other a_and_e_cancer         0.054950
23     -0.561247     0.277648                    0.219477  0.006748  ethnic_category_one_hot_Asian a_and_e_cancer         0.058171
44     -0.292537     0.545455                    0.480899  0.821720  ethnic_category_one_hot_Asian wait_18_ophthalmology  0.064556
50     -0.224494     0.307358                    0.229469  0.480723  ethnic_category_one_hot_Black a_and_e_cancer         0.077889
8      -5.340222     0.636364                    0.551042  0.957624  ethnic_category_one_hot_Asian wait_18_ophthalmology  0.085322


we note that the p-values indicate a lack of statistical significance in the coefficient even if the difference between factual and counterfactual probabilities can be pretty large.

But given the same data should produce the same p-values and coefficients each time (modulo floating point issues), how could this happen?

In Generalized Logistic Regression models, the coefficients don't change much on each run, even when shuffling the data. This is true irrespective of p-values. Let's demonstrate.

Shuffling

The Scala code taken from the official Spark docs for a GLR (here) gives this:

glr.fit(dataset.orderBy(rand())).coefficients.toArray.zip(glr.fit(dataset.orderBy(rand())).coefficients.toArray).map{ case (x, y) => x - y }

res24: Array[Double] = Array(1.231653667943533E-16, -2.220446049250313E-16, 2.220446049250313E-16, 0.0, 1.1102230246251565E-16, -2.220446049250313E-16, -1.6653345369377348E-16, 2.220446049250313E-16, 2.220446049250313E-16, -2.220446049250313E-16)

Running this a few times shows the differences are miniscule even though the p-values are pretty large:

glr.fit(dataset.orderBy(rand())).summary.pValues

res25: Array[Double] = Array(0.989426199114056, 0.32060241580811044, 0.3257943227369877, 0.004575078538306521, 0.5286281628105467, 0.16752945248679119, 0.7118614002322872, 0.5322327097421431, 0.467486325282384, 0.3872259825794293, 0.753249430501097)

Any differences in the coefficients between runs are almost certainly due to floating point arithmetic.

Modification

However, things are very different when we modify the data; even a small perturbation can radically change coefficients with high p-values.

Note that there are only 500 rows in this test data set. But let's fit two models where on average we drop one row at random. Then we see things like this:

def compare2(): Unit = {
  val tol         = 0.002 
  val fitted      = glr.fit(dataset.where(rand() > tol))
  val pValues     = fitted.summary.pValues 
  val stdErrors   = fitted.summary.coefficientStandardErrors
  val coeffs_rand = fitted.coefficients.toArray  
  val other       = glr.fit(dataset.where(rand() > tol))
  val diffs       = other.coefficients.toArray.zip(coeffs_rand).map{ case (x, y) => x - y }
  val diffs_pc    = diffs.zip(coeffs_rand).map{ case (x, y) => (x / y) * 100 } ;  
  val meta        = pValues.zip(diffs).zip(diffs_pc).zip(stdErrors).map{ case(((p, d), pc), s) => (p, d, pc, s) }
  println("%-15s%-15s%-15s%-15s".format("p-value", "difference", "% difference", "std error"))
  meta.sortBy(x => -math.abs(x._1)).foreach { case (p, d, pc, s) => println(s"%-15.3f%-15.3f%-15.3f%-15.3f".format(p, d, pc, s)) }
}

scala> compare2()
p-value        difference     % difference   std error      
0.977          0.051          -224.562       0.793          
0.602          0.101          -24.475        0.792          
0.574          0.043          9.546          0.793          
0.551          -0.035         7.432          0.795          
0.459          0.044          -7.145         0.828          
0.456          0.085          14.590         0.776          
0.283          -0.068         -7.897         0.803          
0.275          0.133          -15.230        0.796          
0.134          -0.067         -5.484         0.811          
0.006          0.119          5.188          0.830        

We need run it only a dozen or so times to see that the coefficients with the largest p-value tend to have the largest percentage discrepancies between the two fits and that the sign can change.

The Culprit

These CLI antics lead me to question the "given the same data" assumption. A closer look at the timestamps of the input data sets revealed they were built a week apart (Palantir's Foundry will by default silently fall back to data sets on a different branch if they have not been built on the current branch. This data may be stale). Well, our data slowly grows over time. So, could these small differences be responsible for big swings in the GLM coefficients? I created in the pipeline a single snapshot of the data from which both the risk and odds values derived and there were zero discrepancies. Sweet.

Sunday, March 20, 2022

Scala 3 Type System Notes

Importance of types

Pathological states are unreachable; the state space of the application simply does not include them.

Scala 3 has some new keywords that help metaprogramming. Here are some notes I made as I try to grok them.

New Scala3 Keywords

The inline modifier "tells the compiler that it should inline any invocation of this method at compile-time. If it's not possible, compiler will fail the compilation." Note that the compiler will complain with inline match can only be used in an inline method if you try to put it in a normal match clause.

The inline keyword will partially act like the Scala 2 or C/C++ version of the keyword if the expression is too complex to evaluate at runtime. That is, it will stick the code in at compile time and let it evaluate at runtime. If you don't want this partial evaluation, "you can tell the compiler to avoid complex expression as parameter. This is done by using an inline if" [Scalac blog]. So, by using a def and an if that are both inlined you could, for example have all debug statements removed at compile time.

Note, the parameters to a function can also be inlined.

"erasedValue comes from compiletime package. It's usually used in tandem with inline match. It allows us to match on the expression type, but we cannot access extracted value as that code is executed at compile-time

"constValue comes from compiletime too. It returns the value of a singleton type."

Note that "in Scala 3 you must use a lower-case identifier for a type being extracted from a pattern match." [MichaƂ Sitko's excellent blog post]

The keyword using appears to be the Scala 3 equivalent of implicit in Scala 2. For compilation to succeed, every time you see a using there must be a corresponding given that satisfies its demands.

The keyword with is a synonym for the equals operator. Quoting the Scala Book, this SO answer says:

Because it is common to define an anonymous instance of a trait or class to the right of the equals sign when declaring an alias given, Scala offers a shorthand syntax that replaces the equals sign and the "new ClassName" portion of the alias given with just the keyword with.
The keyword given will make available zero or more functions available in the ether for a corresponding using.

The keyword transparent can be applied to both traits and inline methods. In each case, it's saying the returned type is more specific at compile time. If it cannot be more specific at compile time (say, it depends on a condition that is only known at runtime) then it will fall back to the most general type.

Opaque types seem to basically be tiny types that are very easy to use (see the official documentation). To paraphrase: outside of the module, it is not possible to discover that the opaque type is actually implemented as its alias.

Examples

If we define the natural numbers as:

  trait Nat
  class _0 extends Nat
  class Succ[A <: Nat] extends Nat
  trait <[A <: Nat, B <: Nat]

then the Peano axioms can be represented by:

  given basic[B <: Nat]: <[_0, Succ[B]] with {}
  given inductive[A <: Nat, B <: Nat](using lt: <[A, B]): <[Succ[A], Succ[B]] with {}

(From RockTheJVM)

So, that inductive function allows compilation if it is satisfied in a (potentially) recursive manner until we reach basic. Having these axioms in the ether, we can compile this:

type _1 = Succ[_0]
type _2 = Succ[_1] // Succ[Succ[_0]]
type _3 = Succ[_2] // Succ[Succ[Succ[_0]]]
type _4 = Succ[_3] // Succ[Succ[Succ[Succ[_0]]]]
...

I hope to build on this to make data engineer pipelines easier to debug. I'll expand on this idea in a future post as this has become long enough.

Monday, March 7, 2022

Model Selection - part 2

We want to predict who will receive a vaccination. Although the outcome is binary (the patient did or did not receive a vaccination), we can build a confusion matrix if for each class, we pretend it's a binary problem. So, if we started with the booked_vaccinated class, our confusion matrix looks like:

true positive:  prediction == booked_vaccinated AND label == booked_vaccinated

false positive: prediction == booked_vaccinated AND label != booked_vaccinated

true negative:  prediction != booked_vaccinated AND label != booked_vaccinated

false negative: prediction != booked_vaccinated AND label == booked_vaccinated

Now we have a confusion matrix, we can look at AUCs to test our model. But, is it the only metric?

AUC or Log-Likelihood

Here are some miscellaneous notes from a data scientist who spent a good chunk of his PhD optimizing models:

  1. AUC is good for ascertaining whether the model is well specified and realistic. It can tell you which choice of features are better than others.
  2. Log-likelihood tells us how well the model fits the data. If the data is constant and the model structure is the same, the LL can tell us which choice of hyperparameters are better than others.
Note that the hypersurface of log-likelihood for a linear model is always convex. That is, there are no local maxima and they're guaranteed to give a global maxima.

Because log-likelihoods are logs of probabilities, even small relative differences are important. Since delta log likelihood is the ratio between probabilities, differences of a few units can be significant.

Guarding again Regressions

By having a stage in our pipeline that invokes the AUC calculations, we can keep an eye on our models as the code is developed. For instance, I have a Palantir @transform that just prints out a single row like this:

AUC calculations for an ML pipeline

If this changes radically, I know I've done something wrong.

Thursday, March 3, 2022

Model Selection

Area Under the Precision-Recall Curve 

Unlike ROC curves, PR curves have space that it's impossible to reach. There are techniques therefore to calculate the area under the curve (AUC) by interpolating. [2]

"One can easily show that, like in ROC space, each valid confusion matrix, where TP > 0, defines a single and unique point in PR space. In PR space, both recall and precision depend on the TP cell of the confusion matrix, in contrast to the true positive rate and false positive rate used in ROC space. This dependence, together with the fact that a specific data set contains a fixed number of negative and positive examples, imposes limitations on what precisions are possible for a particular recall." [1]

This isn't just an interesting fact. It has implications for calculating the AUC. Calculating the AUC requires some approximations. For instance, if the model predicts no positives (either true or false), the precision goes off to infinity.

Apache Spark "computes the area under the curve (AUC) using the trapezoidal rule." [Spark docs] However, "the trapezoidal rule ... uses linear interpolation and can be too optimistic." [SKLearn docs]

SKLearn's average_precision_score (see above) calulates the AUC by taking the x-deltas by the y-value (that is, not using trapezoids but rectangles). Since the deltas on the x-axis add up to 1.0, this result of this calculation is also the average. The y-value for x=0 can be estimated from the first data point (see [2]).

Properties of the PR-curve

Another feature of a PR curve is the monkey score line. In a classifier that is wildly guessing, the PR curve will look like a flat line as the precision (=TP/FP) will be proportional to E[P/N] (StackExchange)

The perfect plot of a PR curve is a perfect ROC curve rotated 90-degrees clockwise. Furthermore, there are 3 interesting properties of PR curves [2]:

  1. Interpolation between two precision-recall points is non-linear.
  2. The ratio of positives and negatives defines the baseline.
  3. A ROC point and a precision-recall point always have a one-to-one relationship.

[1] Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation

[2] The blog of Takaya Saito and Marc Rehmsmeier