Wednesday, June 30, 2021

Journeys in Data Engineering

I'm currently helping a huge health provider with its data issues. Here are some things I've learned:

Don't bikeshed. There's no point wondering whether you've captured the correct ethnic description ("White? Irish White? Eastern European White?") when there are bigger fish to fry. For instance, hospitals were putting the patients' personal information in the primary key field. This sent the data officer apoplectic until it was cleansed from the files before being sent downstream. But as a result, the downstream keys were not unique. The data scientists consuming this data aggregated it and came to conclusions unwittingly based on one individual having visited the hospital over three million times.

Don't proactively go looking for data quality issues. They'll come looking for you. Continue to build your models but be extremely circumspect. This is a more efficient process than spending time wondering if the data looks good.

Just because the data looks good, it doesn't mean it's usable. How often is the data updated? Is it immutable or is it frequently adjusted? In short, there's an orthogonal axis in data space separate to data quality and it's a function of time. Perfect data that becomes old is (probably) no longer perfect. Perfect data that becomes incomplete is (probably) no longer perfect.

Give a thought to integration tests. Platforms like Palantir are pretty under-developed in this area (answer to my requests for an integration test platform: "we're working on it"). So, you may need to write bespoke code that just kicks the tires of your data every time you do a refresh. 

Remember that documentation is the code. I've had a nice experience using ScalaTest with its Given, When, Thens. It ensured that when running integration tests, the code generated the documentation so the two would never fall out of synch. This is, of course, much harder (impossible?) when running in a walled garden like Palantir.

Stick with a strongly typed language. This can be hard since pretty much all data scientists use Python (and therefore PySpark) but there is nothing worse than waiting several minutes for your code to run only to found out that a typo trips you up. In a language that is at least compiled, such problems would not occur. I'd go so far to say that Python is simply not the correct tool for distributed computing.

Python tools like Airflow are made much easier by using Docker since Python library version management is a pain in the arse. Far better for each application to have its own Docker container.

Never forget that your schema is your data contract as much as an API is your coding contract. Change it, and people downstream may squeal.

Persisting intermediate data sets hugely help debugging. 

Finally, don't use null to indicate something for which you know the meaning. If, for instance, when a record is in a certain state, it's better to say value X equals a flag to indicate that. If you use null, somebody reading that data doesn't know if the record is in this state or if the data is just missing.

No comments:

Post a Comment