Sunday, June 15, 2025

Lessons from a migration

Don't use low/no-code. It really is a false economy. It appeals to managers during sales pitches because it looks so simple. But your team will spend all their time and all your money debugging in a most un-ergonimc manner. And as for testing....

Azure Data Factory is a no-code solution that's great for simple DAGs. However, it is not suitable for anything more entrerprise-y. For instance, if you're copying large amounts of data from one database to another and there is a failure, there is no scope for cleaning up the data. The first you'll know of it (because nobody ever reads the logs) is when the users are complaining that the numbers coming out of their SQL queries are wrong.  

ADF doesn't even allow nested ForEach loops! It stores pipelines as JSON so watch it get confused when the pipeline itself contains JSON!

Don't give people raw SQL access. They can make profound changes and you'll have no logs and little ability to correct it.

Everything needs to have an automated audit log. There's going to be large number of questions like: "why did it do this?" It's never simply a matter of success or fail. There is huge nuance - eg, type conversion may mean what was in the source system is not exactly the same as in the destination system. Is this a pass or a fail?

Processes need orchestrating. One process reads a dataset while another deletes it/writes to it causing nondeterministic errors. You get similar issues when you issue a cancel.

Communication - docs are always changing. Be aware that most knowledge is tribal, not recorded.

Scheduling: everything was based on time. So, it's possible two daily jobs were running at the same time if the first took more than 24 hours. Data migrating from one DB to another within a cloud zone and subscription was at a rate of ~10mb/s. This meant some tables took hours. And it didn't scale. As the project progressed, more tables were to be migrated. But this caused some jobs to take days. So, the weekly transfer was still going on when people came to the office on Monday morning. Consequently, their queries were returning inconsistent results - the silent killer.

The metadata must be constrained. If it lives in a database, it must have referential integrity. This can be overlooked because it's not business data. But if you want to eliminate mysterious errors, it's essential to get this right.

Like water finding its natural level, people will naturally gravitate to the path of least resistance. As a general rule, teams do not follow best practises, naming conventions or industry standards. This is not strictly true but it's so common that you must assume for all teams that their data and code is a Rube Goldberg Machine.

Regular meetings with major stakeholders. These meeting can be brief (about 15 minutes is OK once the cadence is established) but they do need to be frequent (at least twice a week, preferably daily).

A Quick look at Quarkus


Microservice Framework

Quarkus depends a lot on SmallRye "a set of implementations of the MicroProfile specifications. We said that Quarkus is also a MicroProfile implementation, so this begs for a bit of explanation. Each SmallRye project implements one of the MicroProfile specifications" [1].

The MicroProfile specification is a set of standards for building microservice containers - for instance, handling Kubernetes health checks etc.

Native code

GraalVM has two main features: 
  • it allows polyglot development
  • it allows Ahead-of-Time (AOT) compilation.
This latter features is what Quarkus is interested in, although "Mandrel is a downstream (forked) distribution of the Oracle GraalVM CE with the main goal of providing a way to build a native executable specifically designed to support Quarkus... Mandrel focuses mainly on the native-image build tool. It doesn’t provide a full GraalVM toolset." [1]

Generate the binary with:

mvn package -Pnative

This defers to GRAALVM_HOME/bin/native-image and gcc

If we delegate to Graal, why use Quarkus at all? Well, Graal struggles with reflection. Quarkus provides shims that mean popular frameworks don't use reflection and can therefore be turned to native code.

Note that just because it produces an executable binary, the artifact still needs a garbage collector. In the output, you'll see:

Garbage collector: Serial GC (max heap size: 80% of RAM)

You don't need to use Quarkus to build native executables, though. For instance, if you tried to convert the Java port of Llama.cpp, you can build it with just make native - although note that I had to use version 24 of GraalVM as earlier versions didn't like the use of AVX (vector extensions).

Thread management

Much like Scala's Cats and ZIO, Quarkus facilitates execution engines. For instance it has a notion of blocking threads. "The @Blocking annotation on both methods tells Quarkus that the method executes a blocking operation (persist or delete) on a database, so it needs to execute on worker thread, which allows blocking." [1]
PolarisPrincipalAuthenticatorFilter

[1] Quarkus in Action (Manning)

Making ML projects more robust

The software industry has made great strides to make the engineering process more robust. Despite being an adjacent disciple, data science is still in the dark ages.

You can unit test any code so there is no excuse for not writing them. However, a lot of data science tools are not ergonomic in this regard. Notebooks are great for EDA, less so for robust software practises.

So, although they are not a replacement for automated tests, here are some tips if you're stuck in the world of notebooks.

Observability

If your platform does not support it out of the box, add it manually. For instance, in one project, there was one table that another 5 joined to at some point downstream in the pipeline. That first table was run several times with different parameters. This lead to the potential problem that the 6 tables in total may be out of synch with each other. A solution was to add a runtime date in the first table and all the other 5 preserve this value when they are generated. The Python code that then uses this data asserts that the date is consistent across all tables. 

Plot graphs of important features

In a bioinformatics project, the age of a patient was calculated in Python as the time between a patient's date of birth and the start of the observation period. This was reasonable. But several weeks later, the client asked for all of a patients medical details to be used in the study. Consequently, the observation period - calculated in SQL - effectively became a patient's first contact with the medical services. 

For most people, this happens in the first year of life. For them, their age became zero. This wasn't universally true. Imigrants would have their first medical intervention later in life. So, somebody browsing the data might think it odd there was so many infants in their sample but assume that it was just a statistical fluctuation as there were clearly 30 and 40 year olds in there too. 

However, when you plotted the age after the SQL code change, the graph of the population's age looked like this:

A bug in calculating age

Without an end-to-end suite of tests, nobody thought of updating the age calculation logic.

Sampling 

The moment you add constraints to sampled data, you must beware of some statistical oddities. 

We wanted to assign a cutoff date to the negative cohort randomly sampled from the date from the positive. Sounds straightforward, right? Well, then you have to add a constraint that the cutoff date only made sense if it came after a patient's date of birth. But this very reasonable constraint skews the data. The reason being that somebody born in 2000 can be given any cutoff date 2000 and 2025 giving them an age at cutoff of 0 to 25 years old. After sampling, we might give them a cutoff date of 2003 and consequently an age of 3. 

A bug in sampling

However, somebody born in 2022 can have a maximum age of 3 after sampling. Consequently, we have a situation where the 25-year old born in 2000 can contribute to the 3-year old bucket after sampling, but the 3-year old born in 2022 cannot contribute to the 25-year old bucket. Hence, you get far more 3-year olds than are really in the data.