Wednesday, January 18, 2023

More ML pipeline debugings

The simpler model is sometimes better. 

We used Meta's Prophet to forecast the flow of patients between care providers. However, we had problems. 

First, the errors were pretty high when back testing. The average error over all care providers was about 20%.

Second, the sum of all care providers in a region was sometimes wildly larger than we'd expect. Curiously, the forecasts for the individual providers in the region were reasonably good.

Thirdly, even if the total numbers of patients flowing between care providers were plausible, it wasn't compatible with the model that forecast the number of patients at the care provider.

A simpler model

We tried a model that used just the historical average number of patients for a care provider. This calculation was over all data on the same day of the week and same month of the year (but ignoring the Covid pandemic when figures were weird).

Example: for any given Monday in January, we simply looked at the average flow between two nodes for all Mondays in every January over all non-Covid years. 

This approach yielded a lower error than Prophet - about 16%. See below for what metric we used to quantify the error. 

Odd figures

When we rendered the Prophet forecasts in the front-end application, they didn't seem too bad most of the time. For instance, we predict that 3027 people would go through the emergency departments in all hospitals in region X tomorrow when today it was 2795. OK, not too shocking.

But if we looked at flows with small numbers, the figures looked crazy. For instance, we predicted the number of people being discharged from all hospitals in a region into mental health units would be 15 tomorrow when this week it had actually averaged about 1.

One of the issues was Prophet itself. Prophet timeseries for small numbers may well have predictions below 0. Obviously, the users did not like us predicting a negative number of patients. But if we simply mapped all negative number to zero, we might get surprises.

Let's take some random numbers:

>>> import numpy as np 
>>> xs = np.random.normal(1, 2, 10)
>>> sum(xs)
9.788255911110234

but:

>>> gt0 = lambda x: 0 if x<0 else x
>>> sum(list(map(gt0, xs)))
12.020263767841016

The forecast for individual care providers was considered good even though Prophet would regularly predict, say, 1 person being sent from an emergency department to mental health facilities when often the figure was actually zero. 

But this 100% error was operationally fine for the clinicians because it was such a small number. It only became an issue when summed over all providers in the region. 

Absent data

We thought using the simpler, historical averages would solve this problem but it didn't. So, we took a closer look at the data. An emergency department would only discharge people into a mental health unit roughly once a week. In the absence of such a discharge, there would be no record. Why should there be? The hospitals only record events that happen not events that don't.

For both Prophet and historical averages, we're not feeding information into our model. But the absence of data is itself information! For instance, the averages of two series are different if you choose to ignore zeros:

>>> np.mean([5,0,0,0,0,3,0,0])
1.0
>>> np.mean([5,3])
4.0

Not a revelation when you think about it but backfilling this information via Spark was a bit gnarly. This StackOverflow answer helped. Once we did this, all figures agreed.

Vertices and Edges

Using historical data, the average flows between nodes necessarily summed to the values at the nodes. This was not true for independent Prophet models.

Imagine we have three care providers ab and c
The number of patients that daily flow through them are A, B and C respectively.
Patients flow from a to c and from b to c.
Let's say the flow from a to c is X and the flow from b to c is Y.

The expected number of patients flowing through c is therefore exactly:

E[C] = E[X+Y] = E[X] + E[Y]  

Compare this to Prophet (or any other model that takes just the historical C to forecast future C). In general:

E[C] != E[A] + E[B]

This is important at the front-end where users are complaining that the predicted flows on their screen don't add up.

Which error metric to use?

I was using root mean squared error to calculate errors whereas a colleague was using  mean absolute error. Which is most appropriate must be decided on a case by case basis. RMSE punishes a data set with a large number of outliers. Whereas, "If being off by 10 is just twice as bad as being off by 5, then MAE is more appropriate". Note that "MAE will never be higher than RMSE because of the way they are calculated" - see this SO answer.

Conclusion

Start with the simplest model you can. It's generally easier to implement, cognitively less demanding and chances are that it will give reasonable results anyway. If the results are not good enough, only then investigate shiny new toys.