Agile Java Man: November 2021

Sunday, November 21, 2021

Why set cannot be a functor

This is well known but keeps tripping me up when I think of an example so here it is.

Set demands that its elements implement Eq (or equals() or whatever is semantically equivalent). List has no such requirement. A bogus equals implementation will lead to Set violating the functor laws.

Imagine an implementation that looked something like:

class OddEquals(val id: Int, val label: String) {

override def equals(a: Any): Boolean =

if (a.isInstanceOf[OddEquals]) {

val other = a.asInstanceOf[OddEquals]

id == other.id

} else false

override def hashCode(): Int = id

}

which would be a fairly typical piece of code for objects whose identity depends solely on their primary key.

Now, image two services. One finds an entity given any related string, we'll call it f. I'll fake it like this:

type F = String => OddEquals

val f: F = x => new OddEquals(x.length, x)

The other service simply extracts the entity's name:

type G = OddEquals => String

val g: G = _.label

The functor law says:

xs.map(x => g(f(x))) == xs.map(f).map(g))

But if xs is Set("hello", "world"), the left-hand side is the same as the input but the right-hand side is Set("hello").

Compare this to List("hello", "world") and you'll see that both left- and right-hand side are equal (that is, same as xs)

Wednesday, November 17, 2021

Python's Garbage

Python's GC appears to be hugely different to the GC in Java we know and love. I have some PySpark code that collects data from a data set back in the driver and then processes it in Pandas. The first few work nicely but ultimately PySpark barfs with memory issues and no amount of increasing the driver memory seems to solve the problem. This is odd, because each collect itself only retrieves a modest amount of data. After that, I'm not interested in it.

"Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system.

"As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't. There is one thing that always works, however, because it is done at the OS, not language, level.

Then if you do something like

import multiprocessing
result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that." [StackOverflow]

However, this does not work well for turning PySpark DataFrames to Pandas data frames. Spark is lazy so the DataFrame cannot be serialized and given to the new process (you cannot serialize, for instance, an open network port).

Ultimately, I had to cut my losses and return from StatsModels to Spark's logistic regression implementation.

In addition to this, we have the horrors of what happens with a Spark UDF when written in Python:

"Each row of the DataFrame is serialised, sent to the Python Runtime and returned to the JVM. As you can imagine, it is nothing optimal. There are some projects that try to optimise this problem. One of them is Apache Arrow, which is based on applying UDFs with Pandas." [Damavis]

Yuck.

Wednesday, November 10, 2021

Logistic Regression in Action

Once again, I'm looking at patient data waiting times. This time, it's the A&E department. I'm not so much interested in how long they wait but rather the binary condition of whether they waited more or less than four hours.

Logistic Regression with Spark

I used Spark's GeneralizedLinearRegression. Hyperparameter tuning did not produce much difference in output.

Splitting the data 90/10 for train and test gave me the following output on the test curves:

Prediction where true value is 1

The above shows the predictions for patients that we know really did wait more than 4 hours for treatment. The below shows predictions where we really know they waited less:

Predicition where true value is 0

Clearly, the model has spotted something but the overlap between the two states is large. What should the threshold be to best differentiate one from the other? Fortunately, I'm not trying to predict but to infer.

You can evaluate how good the model is with some simple PySpark code:

from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol=dependent_col, rawPredictionCol=prediction_col)
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})

Evaluating with Spark's BinaryClassificationEvaluator gave me an area under the ROC curve of 0.79. Hardly stellar but not terrible.

"A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5... Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance." [MachineLearningMastery]

Probabilities, Odds, and Odds-Ratios

"Odds are the probability of an event occurring divided by the probability of the event not occurring. An odds ratio is the odds of the event in one group ... divided by the odds in another group." [Making sense of odds and odds ratios]

"The [linear regression] model is specified in terms of K−1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)... The choice of denominator is arbitrary in that the estimates are equivariant under this choice." [Elements of Statistical Learning]

K here is the number of groups. So, for the case where we're looking at true/false outputs, K=2 and the equations in ESL become the simple equation in the following section.

Model output

An example: "given estimated parameters and values for x₁ and x₂ , the model might predict y = 0.5, but the only meaningful values of y are 0 and 1. It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x₁ and x₂ has a 50% chance. But it is also possible for this model to predict y = 1.1 or y = −0.1, and those are not valid probabilities. Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities." [Think Stats, 2nd Edition, Downey]

In a two-state output, Downey's equation would be:

y = log(odds) = log(p / (1-p)) = β₀ + β₁ x₁ + ...

Odds(-ratios) are used when reporting effect size. "when X, the predictor is continuous, the odds ratio is constant across values of X. But probabilities aren’t" because of that non-linear relationship above between odds and p.

"It works exactly the same way as interest rates. I can tell you that an annual interest rate is 8%. So at the end of the year, you’ll earn $8 if you invested $100, or $40 if you invested $500. The rate stays constant, but the actual amount earned differs based on the amount invested. Odds ratios work the same." [TheAnalysisFactor] Since there is no bound on x₁ etc, you could easily have a value greater than 1.0 or less than 0.0 which just doesn't make sense for probabilities.

Interpreting the coefficients

I have 36 features and the p-values of approximately 30 of them are all less than 0.05. This seems suspiciously good even if sometimes their corresponding coefficients are not particularly significant. "If that difference between your coefficient and zero isn't interesting to you, don't feel obligated to interpret it substantively just because p < .05." [StackOverflow]

"How does one interpret a coefficient of 0.081 (Std. Error = 0.026)? ... Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14).

"This summary includes Z scores for each of the coefficients in the model (coefficients divided by their standard errors)... A Z score greater than approximately 2 in absolute value is significant at the 5% level." [Elements of Statistical Learning]

Intepreting the intercept

If we take the example here [QuantifyingHealth] where we're looking at the relationship between tobacco consumption (x₁) and heart disease (y) over a 10 year period, then the interpretation of the intercept's coefficient depends on several conditions.

For example, say that the intercept's coefficient is -1.93. This leads us to believe the probability of heart disease for a non-smoker (x₁=0) is 0.13 by plugging -1.93 in the above equation then rearranging.

If the smoking consumption is standardized, x₁=0 "corresponds to the mean lifetime usage of tobacco in kg, and the interpretation becomes: For an average consumer of tobacco, the probability of having a heart disease in the next 10 years is 0.13" [ibid].

If we're only interested in smokers not the general population, then x₁≠0. "Let’s pick the maximum [623kg] as a reference and calculate the limit of how much smoking can affect the risk of heart disease... We get P = 0.22. The interpretation becomes: the maximum lifetime consumption of 623 kg of tobacco is associated with a 22% risk of having heart disease".

Agile Java Man