Once again, I'm looking at patient data waiting times. This time, it's the A&E department. I'm not so much interested in how long they wait but rather the binary condition of whether they waited more or less than four hours.
Logistic Regression with Spark
I used Spark's GeneralizedLinearRegression. Hyperparameter tuning did not produce much difference in output.
Splitting the data 90/10 for train and test gave me the following output on the test curves:
|
Prediction where true value is 1 |
The above shows the predictions for patients that we know really did wait more than 4 hours for treatment. The below shows predictions where we really know they waited less:
|
Predicition where true value is 0 |
Clearly, the model has spotted something but the overlap between the two states is large. What should the threshold be to best differentiate one from the other? Fortunately, I'm not trying to predict but to infer.
You can evaluate how good the model is with some simple PySpark code:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol=dependent_col, rawPredictionCol=prediction_col)
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
auprc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
Evaluating with Spark's BinaryClassificationEvaluator gave me an area under the ROC curve of 0.79. Hardly stellar but not terrible.
"A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5... Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance." [MachineLearningMastery]
Probabilities, Odds, and Odds-Ratios
"Odds are the probability of an event occurring divided by the probability of the event not occurring. An odds ratio is the odds of the event in one group ... divided by the odds in another group." [Making sense of odds and odds ratios]
"The [linear regression] model is specified in terms of K−1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)... The choice of denominator is arbitrary in that the estimates are equivariant under this choice." [Elements of Statistical Learning]
K here is the number of groups. So, for the case where we're looking at true/false outputs, K=2 and the equations in ESL become the simple equation in the following section.
Model output
An example: "given estimated parameters and values for x1 and x2 , the model might predict y = 0.5, but the only meaningful values of y are 0 and 1. It is tempting to interpret a result like that as a probability; for example, we might say that a respondent with particular values of x1 and x2 has a 50% chance. But it is also possible for this model to predict y = 1.1 or y = −0.1, and those are not valid probabilities. Logistic regression avoids this problem by expressing predictions in terms of odds rather than probabilities." [Think Stats, 2nd Edition, Downey]
In a two-state output, Downey's equation would be:
y = log(odds) = log(p / (1-p)) = β0 + β1 x1 + ...
Odds(-ratios) are used when reporting effect size. "when X, the predictor is continuous, the odds ratio is constant across values of X. But probabilities aren’t" because of that non-linear relationship above between odds and p.
"It works exactly the same way as interest rates. I can tell you that an annual interest rate is 8%. So at the end of the year, you’ll earn $8 if you invested $100, or $40 if you invested $500. The rate stays constant, but the actual amount earned differs based on the amount invested. Odds ratios work the same." [TheAnalysisFactor] Since there is no bound on x1 etc, you could easily have a value greater than 1.0 or less than 0.0 which just doesn't make sense for probabilities.
Interpreting the coefficients
I have 36 features and the p-values of approximately 30 of them are all less than 0.05. This seems suspiciously good even if sometimes their corresponding coefficients are not particularly significant. "If that difference between your coefficient and zero isn't interesting to you, don't feel obligated to interpret it substantively just because p < .05." [StackOverflow]
"How does one interpret a coefficient of 0.081 (Std. Error = 0.026)? ... Incorporating the standard error we get an approximate 95% confidence interval of exp(0.081 ± 2 × 0.026) = (1.03, 1.14).
"This summary includes Z scores for each of the coefficients in the model (coefficients divided by their standard errors)... A Z score greater than approximately 2 in absolute value is significant at the 5% level." [Elements of Statistical Learning]
Intepreting the intercept
If we take the example here [QuantifyingHealth] where we're looking at the relationship between tobacco consumption (x1) and heart disease (y) over a 10 year period, then the interpretation of the intercept's coefficient depends on several conditions.
For example, say that the intercept's coefficient is -1.93. This leads us to believe the probability of heart disease for a non-smoker (x1=0) is 0.13 by plugging -1.93 in the above equation then rearranging.
If the smoking consumption is standardized, x1=0 "corresponds to the mean lifetime usage of tobacco in kg, and the interpretation becomes: For an average consumer of tobacco, the probability of having a heart disease in the next 10 years is 0.13" [ibid].
If we're only interested in smokers not the general population, then x1≠0. "Let’s pick the maximum [623kg] as a reference and calculate the limit of how much smoking can affect the risk of heart disease... We get P = 0.22. The interpretation becomes: the maximum lifetime consumption of 623 kg of tobacco is associated with a 22% risk of having heart disease".