Thursday, January 20, 2022

Impure Evil

"Debugging is like being the detective in a crime movie where you are also the murderer."

Here's a case study of the dangers of using functions that are not pure in a lazily evaluation environment. While it's obvious that the rand function is not pure, it's less obvious with others.

The requirements

We have a table of national accident & emergency events that also contains patient data. We want just the patients from this data. Some patients appear more than once. The question is: how do we reduce multiple rows for the same patient into a single row?

This is not as straightforward as it sounds. The same patient may appear in the events that are years apart. In that time, they may have live in a different area code. So, one solution was to just choose any as it was not overly important. Therefore, we used the first function. After all, why not? We just want an arbitrary value in the event we have more than only one choice.

Once we have this data, we want to express the ratio of patients with property Z to those without segmented on hospital attended. It is not relevent what Z actually is but we will have to make two calculations: one for the set of those with Z and one for all patients irrespective of Z.

The problem

When we ran this in Spark and saw some values where the numerator was greater than the denominator. 

The diagnosis

The problem arises when we use impure functions, especially with lazy evaluation.

  1. The first function will return a non-deterministic value for patients with multiple attendances. There is no guarantee which it will choose.
  2. Spark is lazy by default.
  3. So, upon calculating the denominators, first is invoked and it gives X for a given patient. Then, when calculating the numerators, first is invoked again (because of #2) but this time returns Y (because of #1) for the same patient. The result is that sometimes more people will fit the criteria for the numerator than the denominator for any given slice-and-dice of the data.
Note that this would also be true of most database implementations. For instance, SQL Server's FIRST_VALUE is also non-deterministic.

Solution

Calling cache() will sometimes help but is not robust. If, for instance, an executor crashes, the data is regenerated by another instance and it may not necessarily be the same as what was generated before.

A better solution is to adopt a deterministic algorithm when choosing from a collection of values. We used max but it could be anything else. 

However, note that there is no guarantee that there is a consistency between values. For instance, we have a patient's area code expressed as a code recognised nationally and also as a code recognised by the health providers. There should be a one-to-one mapping between them. But given more than one of each, applying a lexical max to both may not produce a valid pair.

No comments:

Post a Comment