The software industry has made great strides to make the engineering process more robust. Despite being an adjacent disciple, data science is still in the dark ages.
You can unit test any code so there is no excuse for not writing them. However, a lot of data science tools are not ergonomic in this regard. Notebooks are great for EDA, less so for robust software practises.
So, although they are not a replacement for automated tests, here are some tips if you're stuck in the world of notebooks.
Observability
If your platform does not support it out of the box, add it manually. For instance, in one project, there was one table that another 5 joined to at some point downstream in the pipeline. That first table was run several times with different parameters. This lead to the potential problem that the 6 tables in total may be out of synch with each other. A solution was to add a runtime date in the first table and all the other 5 preserve this value when they are generated. The Python code that then uses this data asserts that the date is consistent across all tables.
Plot graphs of important features
In a bioinformatics project, the age of a patient was calculated in Python as the time between a patient's date of birth and the start of the observation period. This was reasonable. But several weeks later, the client asked for all of a patients medical details to be used in the study. Consequently, the observation period - calculated in SQL - effectively became a patient's first contact with the medical services.
For most people, this happens in the first year of life. For them, their age became zero. This wasn't universally true. Imigrants would have their first medical intervention later in life. So, somebody browsing the data might think it odd there was so many infants in their sample but assume that it was just a statistical fluctuation as there were clearly 30 and 40 year olds in there too.
However, when you plotted the age after the SQL code change, the graph of the population's age looked like this:
![]() |
A bug in calculating age |
Sampling
The moment you add constraints to sampled data, you must beware of some statistical oddities.
We wanted to assign a cutoff date to the negative cohort randomly sampled from the date from the positive. Sounds straightforward, right? Well, then you have to add a constraint that the cutoff date only made sense if it came after a patient's date of birth. But this very reasonable constraint skews the data. The reason being that somebody born in 2000 can be given any cutoff date 2000 and 2025 giving them an age at cutoff of 0 to 25 years old. After sampling, we might give them a cutoff date of 2003 and consequently an age of 3.
![]() |
A bug in sampling |
However, somebody born in 2022 can have a maximum age of 3 after sampling. Consequently, we have a situation where the 25-year old born in 2000 can contribute to the 3-year old bucket after sampling, but the 3-year old born in 2022 cannot contribute to the 25-year old bucket. Hence, you get far more 3-year olds than are really in the data.
No comments:
Post a Comment