Wednesday, June 30, 2021

Journeys in Data Engineering

I'm currently helping a huge health provider with its data issues. Here are some things I've learned:

Don't bikeshed. There's no point wondering whether you've captured the correct ethnic description ("White? Irish White? Eastern European White?") when there are bigger fish to fry. For instance, hospitals were putting the patients' personal information in the primary key field. This sent the data officer apoplectic until it was cleansed from the files before being sent downstream. But as a result, the downstream keys were not unique. The data scientists consuming this data aggregated it and came to conclusions unwittingly based on one individual having visited the hospital over three million times.

Don't proactively go looking for data quality issues. They'll come looking for you. Continue to build your models but be extremely circumspect. This is a more efficient process than spending time wondering if the data looks good.

Just because the data looks good, it doesn't mean it's usable. How often is the data updated? Is it immutable or is it frequently adjusted? In short, there's an orthogonal axis in data space separate to data quality and it's a function of time. Perfect data that becomes old is (probably) no longer perfect. Perfect data that becomes incomplete is (probably) no longer perfect.

Give a thought to integration tests. Platforms like Palantir are pretty under-developed in this area (answer to my requests for an integration test platform: "we're working on it"). So, you may need to write bespoke code that just kicks the tires of your data every time you do a refresh. 

Remember that documentation is the code. I've had a nice experience using ScalaTest with its Given, When, Thens. It ensured that when running integration tests, the code generated the documentation so the two would never fall out of synch. This is, of course, much harder (impossible?) when running in a walled garden like Palantir.

Stick with a strongly typed language. This can be hard since pretty much all data scientists use Python (and therefore PySpark) but there is nothing worse than waiting several minutes for your code to run only to found out that a typo trips you up. In a language that is at least compiled, such problems would not occur. I'd go so far to say that Python is simply not the correct tool for distributed computing.

Python tools like Airflow are made much easier by using Docker since Python library version management is a pain in the arse. Far better for each application to have its own Docker container.

Never forget that your schema is your data contract as much as an API is your coding contract. Change it, and people downstream may squeal.

Persisting intermediate data sets hugely help debugging. 

Finally, don't use null to indicate something for which you know the meaning. If, for instance, when a record is in a certain state, it's better to say value X equals a flag to indicate that. If you use null, somebody reading that data doesn't know if the record is in this state or if the data is just missing.

Friday, June 25, 2021

Analysing a Regression Analysis Model

I'm playing around with hospital waiting lists trying to find out what factors affect waiting times. 

Using the Pearson Correlation of my features (which are the ethnic make up of a hospital waiting list and the time spent waiting), the data looks like this:

Pearson Correlation Heatmap: Ethnicity and waiting time

What if I normalise the rows?
Pearson Correlation Heatmap: normalized rows

Well that was silly. Of course there will be a negative correlation between ethnicities as the total needs to sum to 1.0.

Anyway, it was at this point I found the data was dirty beyond salvage due to upstream cleaning processes gone wrong. Having got a new data set, I tried again (this time ignoring the Pearson correlation between ethnicities):

Pearson correlation: ethnicities vs waiting list time

Note that this time, the data was standardized. It looks like waiting list time goes up for white people.

Inferential Linear Regression 

Putting Pearson correlation to one side, let's see a linear regression model trained on this data. Note, in Spark's linear regression algorithm, one is able to use ElasticNet which allows a mix between L1 (lasso) and L2 (ridge regression) regularization.

Using dummy encoding, the results look like this:

coefficient             category   p-value
2.9759572500367764 White 0.011753367452440378
0.607824511239789 Black 0.6505335882518613
1.521480345096513 Other 0.2828384545133975
1.2063242152220122 Mixed 0.4440061099633672
14.909193776961727 intercept  0.0

Hmm, those p-values look far from conclusive. OK, let's make the most populous group the zero-vector:

coefficient             category  p-value
-3.133385162559836 Mixed   0.01070291868710882
-1.7512494284036988 Black   0.10414898264964889
-1.4364038661386487 Other   0.08250720507990783
-2.3504004542073984 Asian   0.0006682319884454557
17.88573117661592 intercept 0.0

Now this is more interesting. The p-values are still a little too high to draw conclusions about two groups but it's starting to look like the waiting list size is lower if you are Asian or Mixed ethnicity.

Adding other columns makes the ethnicities at least look even more certain although the p-value for these new categories - including socioeconomic and age  -were themselves not particularly conclusive.

coefficient             category  p-value
-4.309164982942572 Mixed 0.0004849926653480718
-2.3795206765105696 Black 0.027868866311572482
-2.2066035364090726 Other 0.008510096871574113
-2.9196528536463315 Asian 2.83433623644580e-05
20.342211763968347 intercept 0.0

Now those p-values look pretty good and they're in agreement with my Pearson correlation. 

"A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! [Alternatively,] omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model."[Regression Analysis, Jim Frost]

Work continues.

Thursday, June 10, 2021

Notes on Regression Analysis


Regression Analysis 

"A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each unit change in an independent variable when you hold all of the other independent variables constant." [Multicollinearity in Regression Analysis]

This last point is key. Normalizing rows means that there will be anticorrelation between fields. This is because if one value increases the others must necessarily decrease as they all sum to 1.0.

Similarly, one-hot encoding by definition increases multicollinearity because if feature X has value 1, then I know that all the others have 0 "which can be problematic when you sample size is small" [1].

The linked article the describes how Variance Inflation Factors ("VIFs") can be used to identify multicollinearity.

"If you just want to make predictions, the model with severe multicollinearity is just as good!" [MiRA]

To remove or not?

The case for adding columns: Frost [1] has a good example on how excluding correlated features can give the wrong result. It describes how a study to test the health impact of coffee at first showed it was bad for you. However, this study ignored smokers. Since smokers are statistically more likely to be coffee drinkers, you can't exclude smoking from your study.

The case for removing columns: "P-values less than the significance level indicate that the term is statistically significant. When a variable is not significant, consider removing it from the model." [1]

Inferential Statistics

But what if we don't want to make a model? "Regression analysis is a form of inferential statistics [in which] p-values and coefficients are the key regression output." [1]

Note that the p-value is indicates whether we should reject the null hypothesis. It is not an estimate of how accurate our coefficient is. Even if the coefficient is large, if the p-value is also large "the observed difference ... might represent random error. If we were to collect another random sample and perform the analysis again, this [coefficient] might vanish." [1]

Which brings us back to multicollinearity as it "reduces the precision of the estimated coefficients, which weakens the statistical power of your regression model... [it] affects the coefficients and p-values." 

["Statistical power in a hypothesis test is the probability that the test can detect an effect that truly exists." - Jim Frost]

[1] Regression Analysis, Jim Frost