Agile Java Man

Friday, June 24, 2022

Free Monads

Remember in a previous post that the Free monad can represent a tree where each node has either a branch or a label (thus inferring only leaves have labels).

When we "convert method calls into references to an ADT [we get] the Free monad. When an ADT mirrors the arguments of related functions, it is called a Church encoding. Free is named because it can be generated for free for any S[_] . For example, we could set S to be the [algebras] and generate a data structure representation of our program... In FP, an algebra takes the place of an interface in Java... This is the layer where we define all side-effecting interactions of our system." [1]

My friend, Aaron Pritzlaff, has written a small project that (apart from demonstrating interoperability) does not use any Cats nor Zio magic.

Here, we build case classes that are all instances of Operation[A] (where the A represents the type of the operation's result). This can conveniently be lifted into a Free[F, A] (where F is now our Operation) with .liftM.

As it happens, this creates a type of Free[ called a Suspend[ that's essentially a placeholder for the A as when we flatMap on it, the function we use is A => Free[F, A]. And we can keep doing this until we have built up a tree of Free[Operation, A]s as we promised at the beginning of this post. A shallow path represents a for comprehension that terminates early.

The important thing to remeber is that we built a data structure that is yet to be executed.

We would execute the true with a call to ".foldMap [which] has a marketing buzzword name: MapReduce." [1] In Aaron's implementation, we just recursively call .foldMap on the Free tree we constructed. Either we'll terminate with a pure value or a Suspended node that will be mapped to a natural transformation (using the ~> function) that maps not A to B, but F[_] to G[_]. This is our interpreter and for us, G must be a monad so we can call flatMap as we pop the stack.

As we work our way back through the stack of monads calling flatMap as we go, we' re invoking those functions, A => Free[F, A], we spoke about earlier. But note that they're acting on the Gs that our interpreter instantiated.

Although it's good to understand, the Free monad might not be your best choice according to the Cats people:

Rob Norris @tpolecat Sep 24 17:49 2020

Free is out of style but it's still great for some applications. Tagless style is much more common now for writing DSLs and APIs in general. Free is good if you really need to limit what the user can do (the algebra is exactly your data type's (i.e., F's) constructors plus Monad. No more, no less. Sometimes you want this.

The end result is easier to use than tagless because you're just dealing with data. You don't have to thread an effect parameter (typically) everywhere. So your business logic can be very concrete, which is easier for beginners I think.

Fabio Labella @SystemFw Sep 24 19:35 2020

the idea behind Free is great for implementing advanced monads like IO or Stream (which are implemented with a similar strategy even though they don't use Free literally)

Daniel Spiewak @djspiewak Sep 25 15:49 2020

I use Free for prototyping sometimes. It can be a useful tool to see how your effect algebra teases apart at a granular level without actually committing to an implementation, but it really only works for algebraic effects, and you very quickly bump up against some annoying type metaprogramming problems if you want to push it.
I think we've pretty much all learned that there are better ways of encoding effect composition than Free, and simultaneously the mocking it enables, while powerful, isn't that useful in practice... It's still a cool toy though.

[1] Functional Programming for Mortals with Cats

Monday, June 6, 2022

Packaging Python

Java programmers don't know the meaning of classpath hell until they've played with Python. Here are some notes I took while ploughing through the excellent Practical MLOps (Gift & Deza). Following their instructions, I as attempting to get a ML model served using Flask in a Docker container. Spoiler: it didn't work out of the box.

Since the correct OnnxRuntime wheel for my Python runtime did not exist, I had to build onnxruntime with --build-wheel while making the artifact.

This is where I encountered my first dependency horror:

CMake 3.18 or higher is required. You are running version 3.10.2

when running onnxruntime/build.sh. (You can put a new version first in your PATH and avoid having to install it at the OS level).

This finally yielded onnxruntime-1.12.0-cp36-cp36m-linux_x86_64.whl which could be installed into my environment with pip install WHEEL_FILE... except that cp number must correspond to your Python version (3.6 in this case).

Moving virtual environments between machines is hard. You'd be best advised to use pip freeze to capture the environment. But ignoring this advice yields an interesting insight into the Python dependency system:

The first problem is that if you've created the environment with python -m venv then the scripts have your directory structure backed into them, as a simple grep will demonstrate. Copying the entire directory structure up to the virtual environment solved that.

But running the code gave me "No module named ..." errors. Looking at the sys.path didn't show my site-packages [SO] despite me having run activate. Odd. OK, so I defined PYTHONPATH and then I could see my site-packages in sys.path.

Then, you want to use exactly the same Python version. No apt-get Python for us! We have to manually install it [SO]. When doing this on a Docker container, I had to:

RUN apt-get update

RUN apt-get install -y wget

RUN apt-get install -y gcc

RUN apt-get install -y make

RUN apt-get install -y zlib1g-dev

Note that this [SO] helped me to create a Docker container that just pauses the moment it starts. This allows you to login and inspect it without it instantly dying on a misconfiguration.

The next problem: there are many compiled binaries in your virtual environment.

# find $PYTHONPATH/ -name \*.so | wc -l

185

Copying these between architectures is theoretically possible but the "as complexity of the code increases [so does] the likelihood of being linked against a library that is not installed" [SO]

Indeed, when I ran my Python code, I got a Segmentation Fault which can happen if "there's something wrong with your Python installation." [SO]

Python builds

A quick addendum on the how Python builds projects: the standard way is no longer standard: "[A]s of the last few years all direct invocations of setup.py are effectively deprecated in favor of invocations via purpose-built and/or standards-based CLI tools like pip, build and tox" [Paul Gannsle's blog]

Wednesday, May 25, 2022

The CLI for busy Data Scientists and Engineers

I've been asked to give a talk on the command line interface for a mixed audience of Data Scientists and Engineers. Since the world is becoming ever more container based, I'm focussing on Linux.

Containers

Sometimes you need to diagnose things within the container. Jump into the container with:

docker exec -it CONTAINER_ID /bin/bash

To get the basic diagnostic tools mentioned below, you'll generally need to execute:

apt-get update # you need to run this first
apt-get install net-tools # gives you netstat
apt-get install iputils-ping # gives you ping
apt-get install procps # gives you ps
apt-get install lsof

Note these installations will all be gone next time you fire up the image as the underlying image does not change.

You can find out which module your favourite command belongs to by running something like this:

$ dpkg -S /bin/netstat
net-tools: /bin/netstat

Formatting

You can use regex expressions in grep with the -P switch. For example, let's search for lines that are strictly composed of two 5-letter words seperated by a space:

$ echo hello world | grep -P "\w{5}\s\w{5}$"
hello world
$ echo hello wordle | grep -P "\w{5}\s\w{5}$"
$

You can extract elements from a string with an arbitrary delimiter with awk. For example, this takes the first and sixth elements from a line of CSV:

$ echo this,is,a,line,of,csv | awk -F',' '{print $1 " " $6}'
this csv
$

To prettify output, use column like this:

$ echo "hello, world. I love you!
goodbye cruelest world! So sad" | column -t
hello, world. I love you!
goodbye cruelest world! So sad
$

To print to standard out as well as to a file, use tee. For example:

$ echo And now for something completely different | tee /tmp/monty_python.txt

And now for something completely different

$ cat /tmp/monty_python.txt

And now for something completely different

To capture everything typed and output to your terminal (very useful in training), use script:

$ script -f /tmp/my_keystrokes.log

Script started, file is /tmp/my_keystrokes.log

$ echo hello world

hello world

$ cat /tmp/my_keystrokes.log

Script started on 2022-05-13 16:10:37+0100

$ echo hello world

hello world

$ cat /tmp/my_keystrokes.log

Beware its recursive nature! Anyway, you stop it with an exit.

You can poll an output with watch. For example, this will keep an eye on the Netty threads in a Java application (IntelliJ as it happens):

$ watch "jstack `jps | grep Main | awk '{print \$1}'` | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that the $1 has been escaped and the quote mark within the quote has been triple escapted. The switch -A10 is just to show the 10 lines After what we pattern matched. Backticks execute a command within a command. Of course, we can avoid this escaping with:

$ watch "jstack $(jps | grep Main | awk '{print $1}') | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that $(...).

Resources

The command htop gives lots of information on a running system. Pressing P or M orders the resources by processor or memory usage respectively. VIRT and RES are your virtual memory (how much your application has asked for) and resident memory (how much it's actually using) the latter is normally the most important. The load average tells you how much work is backing up. Anything over the number of processors you have is suboptimal. How many processors do you have?

$ grep -c ^processor /proc/cpuinfo

The top command also lists zombie tasks. I'm told that these are threads that are irretrievable stuck, probably due to some hardware driver issue.

File handles can be seen using lsof. This can be useful to see, for example, where something is logging. For instance, guessing that IntelliJ logs to a file that has log in its name, we can run:

$ lsof -p 12610 2>/dev/null | grep log$

java 12610 henryp 12w REG 259,3 7393039 41035613 /home/henryp/.cache/JetBrains/IntelliJIdea2021.2/log/idea.log

The 2>/dev/null pipes errors (the 2) to a dark pit that is ignore.

To see what your filewall is dropping (useful when you've misconfigured a node), run:

$ sudo iptables -L -n -v -x

To see current network connections, run:

$ netstat -nap

You might want to pipe that to grep LISTEN to see what processes are listening and on which port. Very useful if something already has control of port 8080.

For threads, you can see what's going on by accessing the /proc directory. While threads are easy to see in Java (jstack), Python is a little more opaque, not least because the Global Interpretter Lock (GIL) only really allows one physical thread of execution (even if Python can allow logical threads). To utilise more processors, you must start a heavyweight thread (see "multiprocessing" here). Anyway, find the process ID you're interested in and run something like:

$ sudo cat /proc/YOUR_PROCESS_ID/stack
[<0>] do_wait+0x1cb/0x230
[<0>] kernel_wait4+0x89/0x130
[<0>] __do_sys_wait4+0x95/0xa0
[<0>] __x64_sys_wait4+0x1e/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

as everything is a file in Unix, right? This happens to be the stack in Python code that is time.sleeping. You'll see a similar looking stack for a Java thread that happens to be waiting.

If you want to pin work to certain cores, use something like taskset. For example, if I wanted to run COMMAND on all but one of my 16 cores, I run:

taskset 0xFFFE COMMAND

This is very useful if some data munging is so intense it's bringing my system down. Using this, at least one thread is left for the OS.

Finally, vmstat gives you lots of information about the health of the box such as blocks being read/written from/to disk (bo/bi), the number of processes runnable (not necessarily running) and the number blcoked (r/b) and the number of context switches per second (cs)

Tuesday, April 19, 2022

Tips for effective MLOps

Some miscellaneous tips I've discovered after over a year of being hands-on with a clinical ETL pipeline.

Technology

Set up a local dev environment (Git, Pip, IDE of choice, Python environment etc). Being able to test locally cannot be more important. For instance, I've been using Palantir's Foundry but since it's just a wrapper around Spark, you can have a full, locally run, test suite using PySpark.

If you can't get the data on your laptop (GDPR etc), use simulated data. If your model cannot find a signal in the noise you control, it likely won't when it faces real data. Make these tests automated so it can be part of your CI/CD pipeline.

Keep everything as deterministic as you can. Using commands like SQL's first means you have no guarantees what will come out of the sausage machine. How can your QAs test that?

Prefer to write intermediary data sets to persistent storage and load them again in the next stage over processing everything in memory with a single output. It might make the pipeline take longer to run but it gives you something like an "audit log" of the steps in your model.

Have diagnositic transforms in your pipeline. Typically, they make assertions to ensure self-consistency (eg, all patients' hospital length-of-stay must be non-negative; the aggregated total of overnight patients cannot exceed the total number of beds). These transforms output any violations into data sets that will never be seen by the end user, just the MLOps team. Diagnostic transforms have the advantage over the local test in that they can use real data as they are embedded in the production pipeline.

Process

Testing data is harder than testing code so spare a thought for your QAs. Talk to the them long before you cut a line of code and establish whether they can test what you're building. If not, find out what you need to give them.

Talk to data owners, tell them what you're trying to do and ask them how they'd do it. They often tell you things you'd never know a priori - like which data is, ahem, "less reliable".

Spend some time (a week?) just playing with the data (EDA) so things become more apparent. It's important to know, for instance, if a column called "region" in A&E data set is the region of the patient or the care provider. Since a patient in A&E may be treated the other side of the country (perhaps they were taken ill while on holiday), the two regions might have nothing to do with one another.

Make sure everybody is singing from the same hymn sheet with a ubiquitous language. Calculating the number of weeks a patient waits couldn't be easier, right? Take the number of days between the start and end date and divide by 7, right? But do we round up or down? And MS SQLServer has its own weird way of doing this calculation.

Data

Don't forget the effect of time on your data (data drift). For instance, care provider networks often get restructured so does hospital trust X in 2020 go by a different name today? Patients move around the country so the same person may be associated with more than one locale. Did more data sources come online over time?

How often is the data updated and how stale is it? These are not the same thing. For instance, the data may be updated daily but its content lags behind by several months.

Were dodgy assumptions made on upstream data sets that you need to be aware of? For instance, removing dead people from the the patients data set might have made sense upstream when dealing with vaccination data but not for historical accident and emergency data.

When unifying data sets, watch out for semantically equivalent values that may be lexically different (eg, if one data set uses an ethnicity of "British white" and another that says "White British", then a union of the two will look like two different ethnicities). This problem becomes harder when one data set uses, say, different age bands. How would you reconcile them?

Monday, April 11, 2022

More MLOps: bug hunting

We have several ML models for which we calculate the odds (the coefficients in a linear regression model) and the risks (the difference in probabilities between the factual and the counterfactual).

Previously, we saw that for one of our six models, the sign of the odds and risks differed when they should always be the same.

Fortunately, we caught this by putting a transform into our pipeline that takes the output of both the linear regression model and factual/counterfactual probabilities and simply checks them.

Part of our ML pipeline. Some transforms create production data, others diagnostic.

In the graph above, anything with _regression_tests appended is a diagnostic transform while everthing else is production data that will ultimately be seen by the customer.

Recently, a diagnostic test was reporting:

difference in risk (0.14285714285714285) does not agree with the sign of the coefficient (-4182.031458784066) for STP QR1 when p-value = 0.988606800326618

Transform containing pipeline diagnostic data.

Aside: there are other violations in this file but they're not unexpected. For instance, there are 42 health regions in England but the data has some random rubbish from the upstream data capture systems. The last row is due to us expecting at least 10 statistically signifcant outputs from our model when we only found 9. That's fine, we'll turn up the squelch on that one. A red flag would be if the number of insights dropped to zero. This happened when we needed the outcome to be binary but accidentally applied a function of type ℕ => {0,1} twice. Oops.

In a previous post, we found the problem was the "risks" and the "odds" transforms, although using the same input file, were consuming that file at significantly different times. The problem manifested itself when most regions for that one model had different signs. But this single diagnostic we see in the violations above is different. Only one region (QR1) differs. All the others in the data set were fine.

Looking at the offending region, the first things to note was that we had very little data for it. In fact, of the 42 health regions, it had the least by a large margin.

Number of rows of data per health region for a particular model

So, the next step was to build a Python test harness and using these 127 rows and run the data against production code locally. To run this on my laptop is so much easier as I can, say, create a loop and run the model 20 times to see if there was some non-determinsm happening. Apart from tiny changes in numbers due to numerical instability, I could not recreate the bug.

So, like in the previous post, could my assumption that the data is the same for both risk and odds calculations be flawed? The pipeline was saying that the two descendant data sets were built seconds after the parent.

Both risk and odds data descend from the same parent

Could there be some bizarre AWS eventual consistency issue? This seems not the case as S3 read/write consistency now has strong consistency (see this AWS post from December 2020).

Conclusion

Why just one region in one model has a different sign is still a mystery. Multiple attempts to see the same bug have failed. However, since there is so little data for this region, the p-values are very high (c. 0.99) we disregard any insights anyway. We'll just have to keep an eye out for it happening again.

Saturday, April 9, 2022

The Scala 3 Type System - part 2

This is the successor to my first post about Scala 3's great new type system.

Peano Axioms (again)

For fun, I'm modelling the natural numbers thus:

sealed trait Nat
case object Zero extends Nat
case class Succ[N <: Nat](n: N) extends Nat

transparent inline def toNat(inline x: Int): Nat =
inline x match
case 0 => Zero
case 1 => Succ(Zero)
case _ => Succ(toNat(x - 1))

But despite me using transparent, the function toNat just returns the super-type toNat not Succ[Succ[Succ[....

scala> toNat(3)
val res0: Succ[Nat] = Succ(Succ(Succ(Zero)))

So, with a bit of help from the guys on the Scala metaprogramming Discord channel, I came up with this:

transparent inline def toNat[I <: Int]: ToNat[I] =
inline erasedValue[I] match
case z: 0 => Zero
case s: S[t] => Succ(toNat[t])

which worked:

scala> toNat[5]

val res0: Succ[Succ[Succ[Succ[Succ[Zero.type]]]]] = Succ(Succ(Succ(Succ(Succ(Zero)))))

You'll notice two major differences. First, toNat no longer takes any arguments. We use the parameterized type in lieu of a parameter

The second is this bizarre S[_] (which is imported from scala.compiletime.ops.int). It gives us the notion of a successor. Let's look more closely at it.

S Express(ion)

From the Scala source code:

/** Successor of a natural number where zero is the type 0 and successors are reduced as if the definition was:

*
* ```scala
* type S[N <: Int] <: Int = N match {
* case 0 => 1
* case 1 => 2
* case 2 => 3
* // ...
* case 2147483646 => 2147483647
* }
* ```
* @syntax markdown
*/
type S[N <: Int] <: Int

What sorcery is this? I see no implementation of that huge match statement!

PhillHenry
The bit I didn't understand was S[t] => ... I assume the compiler enforces that t must be preceeding natural number of S. Is this compiler magic? Or can anybody enforce that constraint in plain Scala?
ragnar
Its compiler magic in the same way that the literal 3 is compiler magic. Basically val x: S[3] = 4 and val y: S[x.type] = 5 work as you would expect, the same holds true in reverse for pattern matching.

(the Scala/metaprogramming Discord server, 3 April 2022).

Compiler Options

By default, the compiler will limit the depth of recursion to 32 levels (after all, you need to tell the compiler when things have become pathological). This is far too small for us and 1024 sounds a bit more sensible. However, I use SBT and wasted an hour telling SBT exactly how to use this value. You need to add:

scalacOptions ++= Seq("-Xmax-inlines", "1024")

Note -X is broken into 2 elements of the Seq. If you foolishly try to add it as a single element as it would appear in the CLI, the compiler/SBT will ignore it.

The takeaway point

A good approach to metaprogramming is to imagine that the functions are being called but the arguments correspond to the parameterized types, not the function's parameters.

Friday, April 8, 2022

Python/Pandas Tips

A few miscellaneous tips that I keep referring to so I thought I'd make a post.

Flatmap in Python

Given a list of lists t,

flat_list = [item for sublist in t for item in sublist]

This is a more Pythonic solution [StackOverflow] than

from itertools import chain

where you might do something like this:

spark_map = F.create_map([F.lit(x) for x in chain(*python_map.items())])

Here, the chain is basically flattening a list (tuples) in a list (.items).

Pandas

Even if your main code does not use Pandas, it can be very useful for writing assertions in your tests.

Say, you have data in the form of lists of lists in variable rows. You can organise this into a Pandas data frame with something like [SO]:

df = pd.DataFrame(rows, range(len(rows)), ["COLUMN_1", "COLUMN_2"])

Now, if I want to get the COLUMN_1 value for the highest COLUMN_2, I do:

df.iloc[df['PREDICTION_COLUMN'].idxmax()]["COHORT"]

One can sort Pandas data frames with something like this [StackOverflow]:

df = df.reindex(df.coefficients.abs().sort_values().index)

where the absolute value of the column coefficients is what I want to sort on.

One can filter with something like:

cleaned = df[(df["p_values"] < 0.05) & ((df["coefficients"] > 0.1) | (df["coefficients"] < -0.1))]

Or, if you want to filter something that does not contain a string, say:

fun[~fun["feature"].str.contains("intercept")].head(50)

Pandas doesn't try to guess the types in the file so if you have a date for instance, you need to do something like:

df['time'] = pd.to_datetime(df['time'])

to convert a string to a timestamp type.

To create a new column from old columns, the syntax is simply:

df["z_score"] = df["coefficients"] / df["standard_error"]

You can concat dataframes with something like

pd.concat([df, df2], axis=1)

where the axis is that of the concatenation - 0 for rows (vertically), 1 for columns (horizontally) etc.

But to join, you need something like:

pd.merge(df1, df2, on=cols, how="inner")

where the parameters are obvious.

Tests

Useful tip when running tests: if you want only a subset to run, you can use a glob ignore. For example:

python -m pytest integration_tests/ --ignore-glob='integration_tests/test_actionable*.py'

will ignore tests that start with test_actionable.