Monday, June 19, 2023

Data Contracts

What they are? "Data Contracts are first and foremost a cultural change toward data-centric collaboration" [Chad Sanderson's blog and here]

Examples of why they're necessary

A company wants to find the average of a consumer for a particular product. They find the age is 42. They are somewhat suprised by this as it's older than they expected so they check their workings - add all the ages and divide by the number of customers who bought it. After confirming the maths, the value is indeed 42 and they report it to their boss. Unfortunately, the mean age was artificially inflated because a subset of customers had an age of '999' because the system that captured that data used it as a placeholder for 'unknown'.

The next example actually happened. We were measuring average length of stay (LoS) in hospitals. When sampling the data, everything looked fine. But out of the millions of patients, a very small number (~30) had a discharge date of 1/1/1900. Clearly, the system that captured that data used this value as a token for 'unknown'. This erroneously reduced the overall LoS. The data bug was only caught when drilling down into individual hospitals, some average LoS figures were negative. Until then, we were merrily reporting the wrong national figure.

Is it a purely cultural problem?

The trouble about cultural solutions is that they depend on unreliable units called "humans". For instance, a friend of mine was the victim of an upstream, breaking change originating in the Zurich office. When he contacted the team, they were unaware of both him, his London team and the fact they were consumers of this data.

I asked on the Apache dev mailing lists if we could implement a more robust, technical solution for Spark. Watch this space for developments.

Possible solutions

Andrew Jones (who is writing a book on data contracts) uses JSON Schema to validate his data. "JSON Schema is a powerful tool for validating the structure of JSON data" [JSON Schema docs]. 

Elliot West of Dreamio (see the mailing list traffic) also favours JSON Schema. However, because JSON has only a few data types (strings, arrays, numericals etc) it's not rich enough to enforce constraints like "date X must be before date Y".

Implementations

This is a new area of development but Databrick's Delta Live Tables (DLT) claims it can “prevent bad data from flowing into tables through validation and integrity checks and avoid data quality errors with predefined error policies (fail, drop, alert or quarantine data).”
Unfortunately, it seems to be Python-only: “Can I use Scala or Java libraries in a Delta Live Tables pipeline? No, Delta Live Tables supports only SQL and Python. You cannot use JVM libraries in a pipeline. Installing JVM libraries will cause unpredictable behavior, and may break with future Delta Live Tables releases.” [docs]

No comments:

Post a Comment