Friday, June 25, 2021

Analysing a Regression Analysis Model

I'm playing around with hospital waiting lists trying to find out what factors affect waiting times. 

Using the Pearson Correlation of my features (which are the ethnic make up of a hospital waiting list and the time spent waiting), the data looks like this:

Pearson Correlation Heatmap: Ethnicity and waiting time

What if I normalise the rows?
Pearson Correlation Heatmap: normalized rows

Well that was silly. Of course there will be a negative correlation between ethnicities as the total needs to sum to 1.0.

Anyway, it was at this point I found the data was dirty beyond salvage due to upstream cleaning processes gone wrong. Having got a new data set, I tried again (this time ignoring the Pearson correlation between ethnicities):

Pearson correlation: ethnicities vs waiting list time

Note that this time, the data was standardized. It looks like waiting list time goes up for white people.

Inferential Linear Regression 

Putting Pearson correlation to one side, let's see a linear regression model trained on this data. Note, in Spark's linear regression algorithm, one is able to use ElasticNet which allows a mix between L1 (lasso) and L2 (ridge regression) regularization.

Using dummy encoding, the results look like this:

coefficient             category   p-value
2.9759572500367764 White 0.011753367452440378
0.607824511239789 Black 0.6505335882518613
1.521480345096513 Other 0.2828384545133975
1.2063242152220122 Mixed 0.4440061099633672
14.909193776961727 intercept  0.0

Hmm, those p-values look far from conclusive. OK, let's make the most populous group the zero-vector:

coefficient             category  p-value
-3.133385162559836 Mixed   0.01070291868710882
-1.7512494284036988 Black   0.10414898264964889
-1.4364038661386487 Other   0.08250720507990783
-2.3504004542073984 Asian   0.0006682319884454557
17.88573117661592 intercept 0.0

Now this is more interesting. The p-values are still a little too high to draw conclusions about two groups but it's starting to look like the waiting list size is lower if you are Asian or Mixed ethnicity.

Adding other columns makes the ethnicities at least look even more certain although the p-value for these new categories - including socioeconomic and age  -were themselves not particularly conclusive.

coefficient             category  p-value
-4.309164982942572 Mixed 0.0004849926653480718
-2.3795206765105696 Black 0.027868866311572482
-2.2066035364090726 Other 0.008510096871574113
-2.9196528536463315 Asian 2.83433623644580e-05
20.342211763968347 intercept 0.0

Now those p-values look pretty good and they're in agreement with my Pearson correlation. 

"A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! [Alternatively,] omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model."[Regression Analysis, Jim Frost]

Work continues.

No comments:

Post a Comment