Wednesday, December 21, 2022

Time series

Meta's much hyped Prophet seems to have problems identifying spikes in small numbers. For instance, we had a hospital that was admitting 4- to 5-times the normal number of patients on Fridays. Then, one day for whatever reason, Friday became a day like any other. Despite a lot of brute-force hyperparameter tuning, the ratio of RMSE/mean stayed at about 0.69.

Prophet (mis)handling significant behavioural change

From the issues on Github, "by default the model will do a bad job on this time series because the model assumes smooth seasonality. That assumption comes because seasonality is modeled with a truncated Fourier series, which basically means it cannot change very rapidly."

You can add_regressors to the model but firstly I don't want to manually do this (I'd have to inspect thousands of data sets by-eye!) and when I tried it, my RMSE was worse - for reasons as yet unknown. Plots showed that it simply translated predictions for that spline down the y-axis; there was not change in how it treated the periodicity.

SARIMAX

On the same data, the default SARIMAX implementation in StatsModel gives you:

SARIMAX with default values

You need to explicitly tell it that there is weekly seasonality. In this  case seasonal_order = (1, 0, 1, 7) works well. Note that 7 means we expect weekly behaviour. Indeed, SARIMAX quickly recognises the change:

SARIMAX with an explicitly provided seasonality
And the overall RMSE/mean ratio becomes a little better at 0.61.

Correlations

Don't be fooled by in-series correlations versus cross-series correlations. This SO link shows how the random walk generated by the tossing of two coins can appear to be correlated (just by luck) when of course they could not possibly be. This is because each have in-series correlation; each value in the cumulative total of HEADS-TAILS will be +/-1 the previous.

SciPy appears to have a tool to find the correlation between series even if the effect is lagged.

Saturday, November 26, 2022

MLOps: debugging a pipeline

The domain

Healthcare in England is broken down into about 40 regions. For each region, we want to measure the differences in clinical outcomes conditioned on the ethnic and socioeconomic categories of the patients. To do this, we feed the data for each health region into a Spark GLM.

The problem

Everything was fine with our pipeline for six months before it started to blow up with:

Caused by: org.apache.spark.SparkException: Failed to execute user defined function(GeneralizedLinearRegressionModel$$Lambda$4903/0x0000000101ee9840: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, double) => double)

Now, before we do a deep dive, the first thing to note is that we have a robust suite of tests that use synthetic data and they are all passing. 

Secondly, the code that was blowing up was used by five other data sets and they were all working fine in production.

If the code seems OK but one path through the ML pipeline was blowing up in code common to other paths, what does this suggest? Well, if it's not the code, there must be something suspicious about the data, right? The tests use synthetic data so of course they would pass.

The investigation

The first course of action when debugging is to take a good, long stare at the error. This might be obvious but many devs pay insufficient attention to it as it's generally a hundred lines of hard-to-read stack trace. This is like a detective who disregards the crime scene because there's too much evidence to collect. 

Anyway, our murder scene was full of Scala and Python stack traces but if we persevere, we find the line that was triggering the error was a call to Dataframe.collect(). This is generally suspicious but on this occasion, we knew we were dealing with a very small data set so this seemed safe. Indeed there were no OOMEs which is the most common problem with calls to collect()

But remember Spark is lazily evaluated. It could be something deeper in the stack that is the root cause. So, navigating a few stack frames previous, we see some one-hot encoding of ethnic groups. Hmm, what can go wrong with one-hot encoding? Well, one potential gotcha is when there is only one category, an exception will be raised.

However, this seemed unlikely. We break down ethnicities into only five groups and there are over a million people in each health region. It would be extraordinarily unlikely if there were a region that only had patients of a single ethnicity. 

Time to look at the data.

Any region with such homogenous patient data probably has very little data to begin with so lets count the number of rows per region. And bingo! there it is: a region called null that has a single (white) patient. This was a recent development in the data being fed into the model which explained why things had worked so well for so long.

The offending row comes from upstream data sets curated by a different department entirely so we're still considering what to do. For now, we could apply a band-aid and filter out any regions called null or better still, any region with fewer than a few thousand patients (as otherwise we're likely to get single cohorts).

One model to rule them?

At the end of the day, the code, the model and the data need to be considered holistically. For instance, which data sets you feed into a model must be evaluated beforehand. 

As an example, we also condition on age bands in this particular GLM model so if we were to feed neonatal or paediatric data into the model it would blow up as all patients would fall into the 0-18 age band. Obvious when you think about it but perhaps surprising if you've inherited somebody else's code.

Saturday, November 12, 2022

Architectural patterns

Some architectural terms (old and new) that I keep bumping into.

Eventual Consistency
"Eventual consistency — also called optimistic replication — is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, ultimately all accesses to that item will return the last updated value.  Eventually-consistent services are often classified as providing BASE semantics (basically-available, soft-state, eventual consistency), in contrast to traditional ACID ... Another great model that can be directly implemented in the application layer is strong eventual consistency (SEC), which can be achieved via conflict-free replicated data types (CRDT), giving us the missing safety property of eventual consistency.  

"Event-driven applications usually favor eventual consistency, for the most part. However, we could also opt-in for strong consistency in particular system areas. Thus, it is fair to say we can combine both consistency models depending on our use case." - Functional Event-Driven Architecture, Volpe

The impacts of consistency on Microservices

Microservices should ideally be totally independent. For example, in a highway  management system, the weather service is totally orthoganol to the roadworks service even though both have an impact on congestion. However, microservices in the real world often have soft dependencies. As a result, "in a microservices world, we don’t have the luxury of relying on a single strongly consistent database. In that world, inconsistency is a given." [James Roper]

Hugo Oliviera Rocha outlines some antipatterns here. The first is "events as simple notifications. The source system publishes an event notifying the consumers that something changed in its domain. Then the consumers will request additional information to the source system... The main issue and the main reason why this option should be seldom used is when you apply it to a larger scale.

"[I]nstead of requesting the source system for additional information, it is possible to save the data internally as a materialized read model... The main issue isn’t the disk space, it is the initialization, maintenance, and keeping that data accurate."

He says event sourcing is just a band aid and suggests using fat (ie, denormalised) messages. The downside it they can be chunky.

CRDT
"To implement eventually consistent counting correctly, you need to make use of structures called conflict-free replicated data types (commonly referred to as CRDTs). There are a number of CRDTs for a variety of values and operations: sets that support only addition, sets that support addition and removal, numbers that support increments, numbers that support increments and decrements, and so forth." - Big Data, Nathan Marz

To a functional programmer, this looks a lot like semigroups and reducing.

Data Mesh
"Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, domain-specific data consumers and views “data-as-a-product,” with each domain handling their own data pipelines. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards." [TowardsDataScience]

"Data Mesh is a journey so you cannot implement Data Mesh per-se, you need to adopt the principles and start to make incremental changes." Adidas's journey [Medium]. Of the seven points given, two (decentralization and self-service) are the antithesis of ontologies.

Batch Views
"The batch views are like denormalized tables in that one piece of data from the master dataset may get indexed into many batch views. The key difference is that the batch views are defined as functions on the master dataset. Accordingly, there is no need to update a batch view because it will be continually rebuilt from the master dataset. This has the additional benefit that the batch views and master dataset will never be out of sync."  Big Data, Nathan Marz

Saga Pattern
"The Saga Pattern is as microservices architectural pattern to implement a transaction that spans multiple services. A saga is a sequence of local transactions. Each service in a saga performs its own transaction and publishes an event. The other services listen to that event and perform the next local transaction" [DZone]

Example in Cats here.

Type 1 and 2 data evolution
Slowly changing dimensions [Wikipedia] is a "concept that was introduced by in  Kimball and Ross in The Data Warehouse Toolkit."  A strategy could be that the data source "tracks historical data by creating multiple records. This is called a type 2 dimension." [The Enterprise Big Data Lake - Gorelik].  

Type 1 is overwritting a row's data as opposed to type that adds a new row.

Data Marts
Definitions for data marts tend to be a bit wooly but the best I heard was from a colleague who defined it as "data structured for use cases and particularly queries."

Data Marts tend to use type 2 dimensions (see above). 

Hexagon Architecture
Hexagon a.k.a Onion a.k.a Ports and Adapters "give us patterns on how to separate our domain from the ugliness of implementation." [Scala Pet Store on GitHub] This is an old pattern, as anybody who has written microservices will know, but the name was new to me.  The idea is that there are many faces the app shows the outside world for means of communication but the kernel inside "is blissfully ignorant of the nature of the input device." [Alistair Cockburn] This faciliates testing and reduces cognitive overhead that comes from having business logic scattered over many tiers and codebases.

Microservices
This is a huge area but here are some miscellaneous notes.

Before you jump on board with the Java based Lagom, it's worth noting that Martin Fowler wrote "Don't start with microservices – monoliths are your friend". This provoked a whole debate here. It's all worth reading but the comment that stuck out for me was:
"Former Netflix engineer and manager here. My advice:
Start a greenfield project using what you know ... Microservices is more often an organization hack than a scaling hack. Refactor to separate microservices when either: 1) the team is growing and needs to split into multiple teams, or 2) high traffic forces you to scale horizontally. #1 is more likely to happen first. At 35-50 people a common limiting factor is coordination between engineers. A set of teams with each team developing 1 or more services is a great way to keep all teams unblocked because each team can deploy separately. You can also partition the business complexity into those separate teams to further reduce the coordination burden."
A fine example of Conway's Law.

Builds in large organisations

Interestingly, Facebook report Git not being scalable. Meanwhile, Google uses Bazel which is supposed to be polyglot and very scalable.

Strangler Pattern
This is one of those obvious patterns that I never knew had a name.

"The Strangler pattern is one in which an “old” system is put behind an intermediary facade. Then, over time external replacement services for the old system are added behind the facade... Behind the scenes, services within the old system are refactored into a new set of services." [RedHat]

Downsides can be the maintenance effort.

Medallion architecture

This [DataBricks] divides data sets into bronze (raw), silver (cleaned) and gold (application-ready).

(GitHub) Action stations

Here are some notes I made on learning GitHub Actions:

There are some implicit environment variables. For instance, GITHUB_ENV (docs) is a temporary file that can hold environment variables like this:

          echo "ENVIRONMENT=develop" >> $GITHUB_ENV

This only appears to take an effect in the next run block.

In addition to these, there are contexts, which are "a way to access information about workflow runs, runner environments, jobs, and steps." For instance github.ref that refers to "the branch or tag ref that triggered the workflow run" (docs) and you use it with something like:

        if: endsWith(github.ref, '/develop')

To set up secrets you follow the instructions here. It asks you to go to the Settings tab on GitHub page. If you can't see it, you don't have permission to change them. You can reference these secrests like any other context. For example, to login to AWS:

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: '${{ secrets.AWS_ACCESS_KEY_ID }}'
          aws-secret-access-key: '${{ secrets.AWS_SECRET_ACCESS_KEY }}'
          aws-region: eu-west-2


Where aws-actions/configure-aws-credentials@v1 (and its ilk) are plugins to facilitate access to third party tools.

Contexts can also reference the output of actions. For example:

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1
      - name: 'Build, tag, and push image to Amazon ECR'
        env:
          ECR_REGISTRY: '${{ steps.login-ecr.outputs.registry }}'


Where login-ecr is an arbitrary ID but outputs.registry is part of the action's data structure.

Saturday, November 5, 2022

The semantics of Streaming

A clever but non-technical colleague asked why a batch system could not simply stream its data. The reason this is a big ask is that the semantics of batch and streaming are different, no matter how we try to pretend they are not. 

A file has a notion of completeness, it has an end. A stream does not neccesarily. You might like to send a message in a stream that indicates it has finished but now you impose an ordering constraint that the file did not necessarily have. 

And if you impose a constraint on order, you can no longer parallelize reading the stream. Again, no such constraint exists with a file. 

Note that these semantic objections are orthoganol to the argument that streams can be viewed as tables [Confluent]. That argument is merely an abstraction whereas the rest of this post focusses on the real differences between streams and batches.

Size

Using Scala's (2.13) built-in streams, we can create a stream of Fibonacci numbers with:

val fibs: Stream[Int] = 0 #:: fibs.scanLeft(1)(_ + _) // from the docs: `scanLeft` is analogous to `foldLeft`

We can then pretend that this stream is a Seq just like any other.

val seq: Seq[Int] = fibs
println(seq.take(5).mkString(", ")) // 0, 1, 1, 2, 3

But what kind of Seq never terminates when you call on it a simple .size?

Aside from the fact that Seq is generally frowned upon (it makes no performance guarantees unlike Vector and List; Cats incidentally eschews its use and you can't do things like call sequence on it), we can't pretend that potentially infinite streams are the same as strictly finite sequences.

Empty Streams

... present problems. Paul Snively on the FS2 chat said:
I don't know if it matters, but keep in mind that the types of Stream.empty and Stream.emits(List.empty[A]) are not the same.
You can see in the REPL that this is true:

scala> Stream.emits(List.empty[String])
val res0: fs2.Stream[[x] =>> fs2.Pure[x], String] = Stream(..)
scala> Stream.empty
val res1: fs2.Stream[fs2.Pure, fs2.INothing] = Stream(..)

Things are even worse if you try to "run" the stream:

scala> Stream.emits(List.empty[String]).repeat(10)

This just hangs while also using an entire core. So does this:

scala> Stream.empty.repeat(10)

Effectful streams
Lucas Kasser @lkasser1 Jul 03 06:22
If I have a Stream[IO, A], is there a way to access the individual IOs? I'd like to be able to get a Stream[IO, IO[A]] so that I can retry individual elements in the stream.
I've looked through the docs, but I didn't see any function like uneval

Fabio Labella @SystemFw Jul 03 09:58
No, it's not possible because a Stream is not just a List of IOs
it's monadic, so it's more like a tree (some of the IOs depends on the result of previous ones)
Complete vs Incomplete Data

Some ciphers (for instance, RSA) need the whole data to de/encrypt. "Some modes of operation can make block ciphers [like AES] act as stream ciphers." [SO] This differs from a true streaming cipher like (ChaCha20) but by using Chunks, we can simulate it.

Grouping & Streaming in Spark
"Developing the translation layer (called runner) from Apache Beam to Apache Spark we faced an issue with the Spark Structured Streaming framework: the problem is that this framework does not support more than one aggregation in a streaming pipeline. For example, you cannot do a group by then a reduce by in a streaming pipeline. There is an open ticket in the Spark project, an ongoing design and an ongoing PR, but, as for now, they received no update since the summer 2019. As a consequence, the Beam runner based on this framework is on hold waiting for this feature from the Spark project." [Etienne Chauchot's blog]
Basically, if there are two grouping operations, op1 and op2, the grouping in op1 might make the datra to be fed into op2 out-of-date. It might have gone stale while it was living in op1's buffer.
"[S]treaming systems define the notion of watermark. It is what gives the system the notion of completeness of data in a constant flow of streaming data. It is the point in time when the system should not receive older elements. As streaming systems rely on windowing to divide this stream of data, the watermark can also be defined as the system notion of when all the data in a certain window can be expected to have arrived in the streaming pipeline. When the watermark passes the end of the window, the system outputs data." [ibid]

 

Tuesday, October 11, 2022

CI/CD in the Cloud

I asked on Discord:

Are there any tools to weave all the components together so both developers and the CI/CD tool can easily run the same suite of integration tests? I've done a PoC that takes one application in the pipeline and runs it locally while it talks to an AWS environment I've provisioned with Terraform. I intend to use docker-compose to bring up all apps altogether for integraion testing. The going is slow. How do other people tackle the same problem?

and user hubt gave me some great insights. Here is an abridged version of his/her answers:

Some people move to Kubernetes instead. there are tons of CI/CD tools built on it.

I have used shared dev environments for a long time. Not having individual dev environments has its drawbacks but it also has strengths too. 

One advantage is everyone sees the same data and database. At some point you start to run into problems where your bugs are based on specific data. With individual dev environments, these bugs are hard to reproduce for everyone unless there is a way of sharing both code and data. If you have a a super solid consistent data set that you have curated and built for every test then you are miles ahead of the game. but maintaining and updating that data set and test suite is very hard to do.

We have a promotion process from shared dev to staging to prod. We run continuous integration tests in dev and staging. People break those on a regular basis.

We don't have developers run them. We just [run] them automatically hourly. Our process works reasonably well. The hourly integration test cadence is driven more by how long the integration test suite takes rather than being strictly hourly.

[If somebody deletes a file from the dev environment thinking it wasn't needed] they would have broken dev and we would debug dev or staging to fix things.

Admittedly this probably doesn't scale for 100+ developers. I deal with a team of a few dozen developers so it isn't quite the same as supporting a team of 4 or a team of 100+.

We also have separate unit tests for some things. those are targeted and should be 100% successful. Integration tests can fail for a variety of different reasons unrelated to someone's code. Unit tests should not. So, yes the integration tests are more general.

[Regarding creating the dev environment on every test suite run] I think that only makes sense in very tightly controlled and constrained environments like if you have an application that has no database.

It's not like we are able to recreate our production environment regularly, so in some ways you want to handle things like you do production. Recreating environments per test run would make sense if you are shipping packaged software. But as a cloud product/web service it makes less sense.

Discord user dan_hill2802 also gave his insights here and here:

We have developers able to easily build AWS environments for development and testing and tear down again when done. The pipelines use the same mechanism to build independent environments in AWS, run tests, then tear down

We aim to have applications as cloud agnostic as possible, so a lot of application development is done with docker containers locally. We also use localstack.cloud [open source community edition] for mocking some of the cloud endpoints. However, it's still not quite representative enough so we enable developers to deploy the application and all supporting infrastructure in AWS, they can also attach to their environment for debugging. The "deploy" is a wrap up of a few steps, including Terraform apply, setting up database users for the app, seeding the database with baseline data. The teardown then does the reverse of that
The individual AWS stacks are mostly for infrastructure engineers working on the Terraform code, where local really isn't an option. We then made it available to the Devs too who were asking about being able to test in AWS

The tools we use (in order of priority)
  1. Terraform
  2. Make
  3. Kitchen
  4. Concourse CI (but could use any other CI tool)
Number 4 also allows Devs to create AWS environments without any AWS credentials.
Our approach

I gave up on my original plan to use Docker Compose. Instead, CI/CD for our Postgres oriented app is entirely in the cloud and looks like this:
  1. Upon each raised PR, a test environment is built using Terraform inside GitHubActions. 
  2. GitHubActions checks out the code, runs the Alembic DB scripts to prepare the Postgres database, and then runs the integration tests against this DB. 
  3. Postgres is torn down after the tests. This is partly to save money (the DB is  unused except during PRs) and partly to ensure nobody messes up the Alembic scripts. That is, we ensure that DB schema can always be built from scratch.
Our Athena oriented app is slightly different. We opted for a shared dev environment as hubt has (I'd have liked separate dev environments but with 4 developers, the juice is not seem worth the squeeze). However, we also employed dan_hill2802's idea of "seeding the database with baseline data".  

Our approach for this app did not involve creating an Athena instance (as I had inherited somebody elses ecosystem). Rather, we:
  1. Created some Python code to create synthetic test data. This also uses pyarrow.parquet to read the schema of a file used in manual tests to pad the synthetic data with columns which are not used as part of the test. Athena will complain if you don't do this.
  2. Upload the synthetic data to an S3 bucket that Athena can see by registering it with AWS Glue (think Hive metastore). You only have to register the files once and that was done manually since this Athena instance was built manually. Automating this was left as tech debt.
  3. Now, the integration tests can run SQL queries through Athen with boto3.client(service_name="athena", ...). This data will be the same on every run of this test suite.
These three steps happen upon each PR push. Failure prevents any promotion. The environment is shared so somebody running this test on their laptop could in theory interfere with a PR but we're too small a team to worry unduly about this.

Automation, automation, automation

I almost never log in to the AWS website as this is the antithesis of automation. Using the AWS CLI makes scripting the work so much easier.

If you're going to have scripts that automate everything, they need to create a new environment on a regular basis. If they don't, people will start putting things into the scripts that assume an environment (or a part of it) is already there. Worse, they make changes to the environment by hand (that's why we added the requirement to build the DB from scratch - see above). Think of this as a regression test for your build scripts.

10 second builds? Ha!

Deploying the environment is slow and can take well over 10 minutes. 

  • It takes approximately 4 minutes to create a Postgres DB in AWS and another 4 minutes to tear it down when done. 
  • It also takes a few minutes to create a Python environment (that's fresh each time we run the tests) since we need to download and installing all the dependencies. 
  • Finally, it appears that GitHubActions runs in the US while our DBs are in the EU and this cross-Atlantic traffic is slow (think loading hundreds of megs of baseline data into a DB).
Conclusion

Although our approach is not perfect, the perfect is the enemy of the good. Our approach is yielding results and there is an appetite to make it even better, but later. 

Baby steps.

Monday, October 10, 2022

AWS DevOps cheat sheet

Miscellaneous notes that I keep needing to refer to. 

Read a file in S3 from the command line

If you want to, say, look at the numbof lines in a file, run:

aws s3 cp s3://BUCKET/PATH - | wc -l

Note the critical hypen.

Avoid Pagination

The --no-cli-pager switch is the daddy here. If I want to list all my RDS databases in a script, I don't want them to be paginated nor truncated, so run:

aws rds describe-db-instances --no-cli-pager

Logging

Run something like this to get the logs:

aws logs tail YOUR_LOG_GROUP_NAME --since 5d

where YOUR_LOG_GROUP_NAME is in your container_definitions/options/awslogs-group of resource "aws_ecs_task_definition" "backend_task" if you're using Terraform.

The 5d is the last 5 days, but it could be, sah 1h (the last hour) or --follow if you want to tail it.

Caveat

Beware that you configure the health check of your services correctly. One of our services was returning an HTTP error 404 for the page the health checker was trying to hit. Everything was fine other than a page was missing. But AWS saw the 404 and kept deciding to kill the service. Oops.

Terraform

This isn't specifically related to AWS but since I use the two together, I'll post here.

Create new workspaces to isolate yourself from breaking other people's work with:

terraform workspace new staging

Then, if you really foul something up, you can delete the staging environment with:

terraform apply -destroy

Bad boys and girls might change the environment by hand. In this case, you have tell Terraform to import it with something like:

terraform import -var-file="vars/production.tfvars"  -var="x=y" ... aws_db_subnet_group.my_db my-db

You can see the dependency graph of your infrastructure with:

terraform graph | dot -Tsvg > ~/Pictures/graph.svg

Docker

This has helped me a few times when trying to work out locally why my Docker container does not work when depoyed in ECS. Say your container dies immediately. You can debug (on Ubuntu) the docker daemon itself with:

journalctl -f -u docker.service

Then you can see the IDs of short lived containers and get its logs (see the Docker docs)

Wednesday, October 5, 2022

SOTA MLOps Tools

Another post on the proliferation of tools. I'm writing this to just keep up-to-date on what the cool kids are talking about.

Feature Store

The most popular offering is Feast (FOSS and mostly Python)  but it's batch only.  There is also Tecton (closed source, managed patform) that claims to do streaming.

Features stores basically store features that are hard or expensive to compute. For example, in Edge ML (where smart devices process locally), some data may be derived from an aggregation of data the device does not otherwise have access to.

"Feature store is an increasingly loaded term that can be used by different people to refer to very different things... There are three main problems that a feature store can help address
  • Feature management: "features used for one model can be used in another"
  • Feature computation: "feature engineering logic, after being defined, needs to be computed... If the computation of this features isn't too expeisnive, it might be acceptable computing this feature each time [or] you might want to execute it only once the first time it is required, then store it"
  • Feature consistency: "feature definitions written in Python during development might need to be converted into the language used in production... Modern feature stores... unify the logic for both batch features and streaming."
"Some features stores only manage feature definitions without computing features from data; some feature stores do both.

"Out of 95 companies I surveyed in January 2022, only around 40% of them use a features store. Out of those who use a feature store, half of them build their own" [1]

However, note that some say Feature Stores are not worth the complexity [TowardsDataScience].

Model Serving

MLFlow is an open source project from Databricks. By putting its dependencies in your code, you can serve models via HTTP. However, you must still create a Docker container for its server

Metaflow was open sourced by Netflix. It's a framework that creates a DAG of the pipeline a data scientist may use.

Kubeflow is an open source offering from Google.

Seldon Core helps to deploy pickled SKLearn models. Because the models are pickled, any pre-processing of the input must be done before calling it. For example, you'd call the served model with something like:

curl -s -d '{"data": {"ndarray":[[1.0, 2.0, 5.0, 6.0]]}}' \ -X POST http://localhost:8003/seldon/seldon/sklearn/api/v1.0/predictions \ -H "Content-Type: application/json"

(from here).

Development Tools

Apache TVM is an "An end to end Machine Learning Compiler Framework for CPUs, GPUs and accelerators".

DVC is a Git based tool for versioning data.

[1] Designing Machine Learning Systems, Chip Huyen

Thursday, September 22, 2022

Rage against the Markup

Markup is pervasive in DevOps. But markup is also:

  • hard to refactor
  • has limited control flow
  • not type-safe
  • hard to test

Refactoring

I see in our codebase something like this:

      - name: Set env to Develop
        if: endsWith(github.ref, '/develop')
        run: |
          echo "ENVIRONMENT=develop" >> $GITHUB_ENV
      - name: Set env to Staging
        if: endsWith(github.ref, '/main')
        run: |
          echo "ENVIRONMENT=staging" >> $GITHUB_ENV
      - name: Set env to Productions
        if: endsWith(github.ref, '/production')
        run: |
          echo "ENVIRONMENT=production" >> $GITHUB_ENV

Not only is this ugly, it's copy-and-pasted everywhere. I can't refactor it and what's more, there is a...

(Lack) of Control

Imagine I want to create an AWS Postrgres instance with Terraform then provision the DB with a schema using the Python library, Alembic all via GitHub Actions? GHA can call the Terraform file and create a DB, but how do I get the URL of that Postgres instance so I can give it to Alembic for it to connect and create the tables? Weaving different technologies together in a Turing Complete language is easy; with markup: less so. I had to hack some calls to the AWS CLI command and parse the JSON it returned, all in bash.

Type-safety issue #1

An example of a lack of type safety can be found in any Terraform script. We had something like this:

resource "aws_ecs_task_definition" "compute_task" {
  family                   = var.task_name
  container_definitions    = <<DEFINITION
  [
    {
      "name": "${var.task_name}",
      "image": "${aws_ecr_repository.docker_container_registry.repository_url}:${var.tag}",
...

Notice the CloudControl JSON being injected into Terraform. Now, trying to add a reference to the aws_iam_role here (as is suggested on numerous websites - see a previous post here) is silently ignored. This wouldn't happen using, say, an SDK in a type-safe language, as you can obvioulsy only access the methods it offers you.

Type-safety issue #2

The secrets in GitHub actions can only be uppercase alpha-numeric and underscore, apparently. AWS identifiers can include alpha-numerics and a dash. Mess these up and you spend time fixing tedious bugs. Tiny types would help with this.

Types #3 

Another example: for creating a DB, we used the password M3dit8at!0n$ - seems OK, right? The DB built fine, the GitHub Actions script then also created the schema fine but we could not login to the DB. Cue hours of frantic checking that there were no network or permission issues. The problem? That bloody password includes characters that need to be escaped on the Linux CLI and that's how Terraform and Alembic were invoked! They were at least consistent - that is, the infrastructure was Terraformed and Alembic built the schema, but for the rest of us, the password didn't work.

In Java, a String is just a String and its content isn't going to break the program's flow. Not so in markup land.

Testing times

Which leads to testing. I made my changes in my GitHub Actions file to use the password in the secret ${{ secrets.DB_PASSWORD-staging }} and ran my mock GHS so:

act -j populate -s DB_PASSWORD-staging=...

and the whole thing worked wonderfully. Only when I tried to create the secret in-situ was I told that DB_PASSWORD-staging was an invalid name.

And how do you test this abomination?

Spot the errors
This was the result of some hasty copy-and-paste.

Solutions...?

What you can do in Scala 3 with metaprogramming is truly amazing. For example, proto-quill creates ASTs that are passed around and are checked against the DB at compile time! I think this might be a bit overkill and a more practical approach is IP4S that compile-time checks your URLs, ports etc. I have a couple of much-neglected projects to at least give the gist of a solution (here's one) that I'll expand on soon.

Thursday, September 15, 2022

Python webapps for Java Devs

Unlike Java, Python doesn't have mutlithreading built in. In The Quick Python Book, we read: "Python doesn’t use multiple processors well... the standard implementation of Python is not designed to use multiple cores. This is due to a feature called the global interpreter lock, or GIL... if you need concurrency out of the box, Python may not be for you." Multithreaded code is not a first class citizen in Python unlike Java so things are somewhat complicated. 

If you're running a FastAPI application, you're probably using Docker. But if you want to debug it on your laptop, you might need to run it by hand with uvicorn. A similar technique can be employed with Flask [SO].

[Uvicorn is an ASGI compliant web server. ASGI is the Asynchronous Server Gateway Interface, "a standard interface between async-capable Python web servers, frameworks, and applications".]

Uvicorn allows hot deployment of code, much like Tomcat. You can run it with something like:

env $(grep -v "#" .env | xargs) poetry run uvicorn src.api.main:app --reload

where your file src.api.main has a field called app that's a fastapi.FastAPI.

[Poetry is a dependency management tool that's a bit like SBT in that it's interactive. You start it off with poetry shell]

Another web framework supporting another standard (WSGI, "a specification that describes how a web server communicates with web applications") is Gunicorn. This is a pre-fork framework, that is to say, it is the layer above a call to a POSIX fork kernel call.

Simple Python webapps often use Alembic for DB creation and migration. To get yourself going, you call

alembic init db


to create a directory called 'db' of all the revisions, tell the autogenerated env.py files where to find the classes that represent tables etc, run 

alembic revision --autogenerate -m "snapshot"


(where snapshot is essentially a comment) and you'll find a file somewhere under db that contains Python code to create SQL tables. Now, just run it with 

alembic upgrade head

and you'll see the files in your database if you've correcly populated your .env file with a DATABASE_URL. (DotEnv seems to be the standard Python way of reading properties files or accessing environment variables if it can't find what it's looking for.) 

There's more to Alembic than that but you get the idea. It's like Java's Flyway.

It uses the SQLAlchemy library that is Python's answer to Java's Hibernate. I was pleasantly surprised to see that I had to call session.flush() to populate my POJO-equivalents with keys generated by the DB. Takes me back to the mid-noughties :)

For streaming, Python uses the smart-open library. Think of this as the Pythonic equivalent of FS2.

Finally, there's a Python equivalent of Java's Lombok that auto-generates getters and setters. It's the DataClasses libary.

Sunday, September 4, 2022

Kafka/FS2 trouble shooting

I had FS2 and Kafka working quite nicely together in example code ... until I started using explicit transactions. Then my code was pausing and finally failing with:

org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000ms while awaiting EndTxn(false)

Cranking up the logging on the client side to TRACE revealed a tight loop that output something like this (simplified):

2022-09-02 16:50:41,898 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Request TxnOffsetCommitRequestData(...) dequeued for sending
2022-09-02 16:50:41,998 | DEBUG | o.a.k.c.producer.internals.Sender - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending transactional request TxnOffsetCommitRequestData(...) to node 127.0.0.1:9092 (id: 1001 rack: null) with correlation ID 527
2022-09-02 16:50:41,998 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending TXN_OFFSET_COMMIT request with header RequestHeader(apiKey=TXN_OFFSET_COMMIT, apiVersion=3, clientId=CLIENT_ID, correlationId=527) and timeout 30000 to node 1001: TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TXN_OFFSET_COMMIT response from node 1001 for request with header RequestHeader(...)
2022-09-02 16:50:41,999 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received transactional response TxnOffsetCommitResponseData(...) for request TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TxnOffsetCommit response for consumer group MY_GROUP: {test_topic-1=UNKNOWN_TOPIC_OR_PARTITION}


The topic at least certainly did exist:

$ kafka-topics.sh  --bootstrap-server localhost:9092  --topic test_topic --describe
Topic: test_topic PartitionCount: 1 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: test_topic Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001

The 'solution' was to use more partitions than replicas. "I made the replication factor less than the number of partitions and it worked for me. It sounds odd to me but yes, it started working after it." [SO]

But why? Putting breakpoints in the Kafka code for all references to UNKNOWN_TOPIC_OR_PARTITION and running my client code again lead me to KafkaApis.handleTxnOffsetCommitRequest (which seems reasonable since the FS2-Kafka client is trying to handle the offsets manually). There I could see that my partition in org.apache.kafka.common.TopicPartition was 1 when the Kafka server was expecting 0. Oops. I had guessed this number to make the client compile and forgot to go back to fix it. 

So, the topic did exist but the partition did not. Creating more partition just means that there is something there to commit, sort of addressing the symptom not the cause. 

The takeaway point is that committing the transactions by hand requires knowledge of the structure of the topic.

Tuesday, August 16, 2022

More AWS/GitHub notes

If you don't fancy hand-crafting YAML to set up your AWS infrastructure, Amazon offers the Cloud Development Kit

They've taken an interesting architectural decision. They've written the logic in JavaScript executed via node. But to make it a polyglot kit, they've added what are essentially bindings that run the node executable. I guess this makes the logic less likely to diverge between languages but it also means more hassle setting up your environment.

What it means is that the Java (or the supported language of your choice) does not talk to the AWS cloud directly. It generates files that you must then feed to cdk. This is different to Fabric8's Kubernetes client or  the docker-java library, both of which allow you to control the containerization in the JVM with no further tooling required.

[Aside: Terraform have a similar toolkit to CDK here but I gave up due to the lack of documentation].

CDK set up

AWS's Java CDK binding needs node in the PATH environment variable. Note, IntelliJ doesn't seem to put it into the PATH by default and I was getting the unhelpful "SyntaxError: Unexpected token ?" - a very unintuitive message. It appears that the Java implementation executes this node command using ProcessBuilder (see JsiiRuntime.startRuntimeIfNeeded) . 

You install the node runtime with 

npm install -g aws-cdk

You can initialize a Java project with:

cdk init app --language java

You then write your Java code to describe your environment using the libraries pulled in by the generated pom.xml. Knowing exactly what to do can be hard so you might want to look at some example on GitHub here.

When you're done, run your Java code. Then you can call cdk synth to generate the metadata and cdk deploy to deploy it and as if by magic, your S3 bucket etc is deployed in AWS! Note you must run this in the top-level directory of your Java project. Apparently, state is saved and shared between the Java build and the call to cdk.

All an act

You can run GitHub actions locally with act. This really helps with debugging. 

If you're .github YAML looks like this:

jobs:
  code:

Then run Act with something this:

act -n --env-file .github/pr.yml -j code -s AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -s AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

where the -j means which YAML block you're running.

Remove the -n if you want it to be a real run rather than a dummy run.

If you want to use secrets, you can just set them as environment variables. For instance:

        env:
          AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY_ID }}
        run: /root/.local/bin/poetry run python -m pytest integration_tests

will use your system's environment variable AWS_ACCESS_KEY_ID as the value of AWS_ACCESS_KEY that the integration tests use.

Tuesday, July 19, 2022

Docker, ECS and access

AWS now offers the ability to remotely login to Docker containers running in ECS. The secret sauce in the Terraform script was to point execution_role_arn under the 

resource "aws_ecs_task_definition" "compute_task"

to the ARN of an aws_iam_role that has the right policy. A really good guide is here. However, I still had a few issues.

First, you need to install session-manager-plugin. I followed all the instructions to install it on Ubuntu here and it seemed to install without error. But, when I ran:

$ aws ecs execute-command --cluster CLUSTER_NAME --task TASK_ARN --container CONTAINER --interactive --command "/bin/bash"

SessionManagerPlugin is not found. Please refer to SessionManager Documentation here: http://docs.aws.amazon.com/console/systems-manager/session-manager-plugin-not-found

Which was odd as it evidently was installed:

henryp@adele:~$ session-manager-plugin 

The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.

Interestingly, the excellent IntelliJ AWS plugin could connect but I noticed that it used it's own session-manager-plugin even though it claimed to be exactly the same version.

So, I created a Docker image to run on my local machine that thas the session-manager-plugin installed. Amazon does not appear to offer this so I needed to build my own. I had a file called AwsDocker/Dockerfile that had this:

FROM amazon/aws-cli
RUN curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm" -o "session-manager-plugin.rpm" && \
    yum install -y ./session-manager-plugin.rpm

I built it with:

docker build --no-cache  -t my_aws AwsDocker/

and run with:

docker run -v $HOME/.aws/credentials:/root/.aws/credentials:ro -t -i my_aws --debug --region eu-west-2 ecs execute-command --cluster CLUSTER_NAME --task TASK_ARN  --interactive --command "/bin/bash"

And lo! I manage to login. (You can lose the --debug if you want as it's verbose but it does help sometimes).

Note that there are many ways [SO] to add the credentials to the local Docker image and the one I chose from the SO answer is a bit broken. The line above fixes it.

Friday, June 24, 2022

Free Monads

Remember in a previous post that the Free monad can represent a tree where each node has either a branch or a label (thus inferring only leaves have labels).

When we "convert method calls into references to an ADT [we get] the Free monad. When an ADT mirrors the arguments of related functions, it is called a Church encoding. Free is named because it can be generated for free for any S[_] . For example, we could set S to be the [algebras] and generate a data structure representation of our program... In FP, an algebra takes the place of an interface in Java... This is the layer where we define all side-effecting interactions of our system." [1]

My friend, Aaron Pritzlaff, has written a small project that (apart from demonstrating interoperability) does not use any Cats nor Zio magic. 

Here, we build case classes that are all instances of Operation[A] (where the A represents the type of the operation's result). This can conveniently be lifted into a Free[F, A] (where F is now our Operation) with .liftM

As it happens, this creates a type of Free[ called a Suspend[ that's essentially a placeholder for the A as when we flatMap on it, the function we use is A => Free[F, A]. And we can keep doing this until we have built up a tree of Free[Operation, A]s as we promised at the beginning of this post. A shallow path represents a for comprehension that terminates early. 

The important thing to remeber is that we built a data structure that is yet to be executed.

We would execute the true with a call to ".foldMap [which] has a marketing buzzword name: MapReduce." [1] In Aaron's implementation, we just recursively call .foldMap on the Free tree we constructed. Either we'll terminate with a pure value or a Suspended node that will be mapped to a natural transformation (using the ~> function) that maps not A to B, but F[_] to G[_]. This is our interpreter and for us, G must be a monad so we can call flatMap as we pop the stack.

As we work our way back through the stack of monads calling flatMap as we go, we' re invoking those functions,  A => Free[F, A], we spoke about earlier. But note that they're acting on the Gs that our interpreter instantiated.

Although it's good to understand, the Free monad might not be your best choice according to the Cats people:

Rob Norris @tpolecat Sep 24 17:49 2020

Free is out of style but it's still great for some applications. Tagless style is much more common now for writing DSLs and APIs in general. Free is good if you really need to limit what the user can do (the algebra is exactly your data type's (i.e., F's) constructors plus Monad. No more, no less. Sometimes you want this.

The end result is easier to use than tagless because you're just dealing with data. You don't have to thread an effect parameter (typically) everywhere. So your business logic can be very concrete, which is easier for beginners I think.

Fabio Labella @SystemFw Sep 24 19:35 2020

the idea behind Free is great for implementing advanced monads like IO or Stream (which are implemented with a similar strategy even though they don't use Free literally)

Daniel Spiewak @djspiewak Sep 25 15:49 2020

I use Free for prototyping sometimes. It can be a useful tool to see how your effect algebra teases apart at a granular level without actually committing to an implementation, but it really only works for algebraic effects, and you very quickly bump up against some annoying type metaprogramming problems if you want to push it.

I think we've pretty much all learned that there are better ways of encoding effect composition than Free, and simultaneously the mocking it enables, while powerful, isn't that useful in practice... It's still a cool toy though.

[1] Functional Programming for Mortals with Cats


Monday, June 6, 2022

Packaging Python

Java programmers don't know the meaning of classpath hell until they've played with Python. Here are some notes I took while ploughing through the excellent Practical MLOps (Gift & Deza). Following their instructions, I as attempting to get a ML model served using Flask in a Docker container. Spoiler: it didn't work out of the box.

Since the correct OnnxRuntime wheel for my Python runtime did not exist, I had to build onnxruntime with --build-wheel while making the artifact.

This is where I encountered my first dependency horror:

CMake 3.18 or higher is required.  You are running version 3.10.2

when running onnxruntime/build.sh. (You can put a new version first in your PATH and avoid having to install it at the OS level).

This finally yielded onnxruntime-1.12.0-cp36-cp36m-linux_x86_64.whl which could be installed into my environment with pip install WHEEL_FILE... except that cp number must correspond to your Python version (3.6 in this case).

Moving virtual environments between machines is hard. You'd be best advised to use pip freeze to capture the environment. But ignoring this advice yields an interesting insight into the Python dependency system:

The first problem is that if you've created the environment with python -m venv then the scripts have your directory structure backed into them, as a simple grep will demonstrate. Copying the entire directory structure up to the virtual environment solved that.

But running the code gave me "No module named ..." errors. Looking at the sys.path didn't show my site-packages [SO] despite me having run activate. Odd. OK, so I defined PYTHONPATH and then I could see my site-packages in sys.path.

Then, you want to use exactly the same Python version. No apt-get Python for us! We have to manually install it [SO]. When doing this on a Docker container, I had to:

RUN apt-get update
RUN apt-get install -y wget
RUN apt-get install -y gcc
RUN apt-get install -y make
RUN apt-get install -y zlib1g-dev

Note that this [SO] helped me to create a Docker container that just pauses the moment it starts. This allows you to login and inspect it without it instantly dying on a misconfiguration.

The next problem: there are many compiled binaries in your virtual environment.

# find $PYTHONPATH/ -name \*.so | wc -l
185

Copying these between architectures is theoretically possible but the "as complexity of the code increases [so does] the likelihood of being linked against a library that is not installed" [SO]

Indeed, when I ran my Python code, I got a Segmentation Fault which can happen if "there's something wrong with your Python installation." [SO]

Python builds

A quick addendum on the how Python builds projects: the standard way is no longer standard: "[A]s of the last few years all direct invocations of setup.py are effectively deprecated in favor of invocations via purpose-built and/or standards-based CLI tools like pip, build and tox" [Paul Gannsle's blog]

Wednesday, May 25, 2022

The CLI for busy Data Scientists and Engineers

I've been asked to give a talk on the command line interface for a mixed audience of Data Scientists and Engineers. Since the world is becoming ever more container based, I'm focussing on Linux.

Containers

Sometimes you need to diagnose things within the container. Jump into the container with:

docker exec -it CONTAINER_ID /bin/bash

To get the basic diagnostic tools mentioned below, you'll generally need to execute:

apt-get update               # you need to run this first
apt-get install net-tools    # gives you netstat
apt-get install iputils-ping # gives you ping
apt-get install procps       # gives you ps
apt-get install lsof        

Note these installations will all be gone next time you fire up the image as the underlying image does not change.

You can find out which module your favourite command belongs to by running something like this:

$ dpkg -S /bin/netstat
net-tools: /bin/netstat

Formatting

You can use regex expressions in grep with the -P switch. For example, let's search for lines that are strictly composed of two 5-letter words seperated by a space:

$ echo hello world  | grep -P "\w{5}\s\w{5}$"
hello world
$ echo hello wordle | grep -P "\w{5}\s\w{5}$"
$

You can extract elements from a string with an arbitrary delimiter with awk. For example, this takes the first and sixth elements from a line of CSV:

$ echo this,is,a,line,of,csv | awk -F',' '{print $1 " " $6}'
this csv
$

To prettify output, use column like this:

$ echo "hello, world. I love you!
goodbye cruelest world! So sad" | column -t
hello,   world.    I       love  you!
goodbye  cruelest  world!  So    sad

$

To print to standard out as well as to a file, use tee. For example:

$ echo And now for something completely different | tee /tmp/monty_python.txt
And now for something completely different
$ cat /tmp/monty_python.txt 
And now for something completely different

To capture everything typed and output to your terminal (very useful in training), use script:

$ script -f /tmp/my_keystrokes.log
Script started, file is /tmp/my_keystrokes.log
$ echo hello world
hello world
$ cat /tmp/my_keystrokes.log 
Script started on 2022-05-13 16:10:37+0100
$ echo hello world
hello world
$ cat /tmp/my_keystrokes.log 
$

Beware its recursive nature! Anyway, you stop it with an exit.

You can poll an output with watch. For example, this will keep an eye on the Netty threads in a Java application (IntelliJ as it happens):

watch "jstack `jps | grep Main | awk '{print \$1}'` | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that the $1 has been escaped and the quote mark within the quote has been triple escapted. The switch -A10 is just to show the 10 lines After what we pattern matched. Backticks execute a command within a command. Of course, we can avoid this escaping with:

$ watch "jstack $(jps | grep Main | awk '{print $1}') | grep -A10 ^\\\"Netty\ Builtin\ Server"

Note that $(...).

Resources

The command htop gives lots of information on a running system. Pressing P or M orders the resources by processor or memory usage respectively. VIRT and RES are your virtual memory (how much your application has asked for) and resident memory (how much it's actually using) the latter is normally the most important. The load average tells you how much work is backing up. Anything over the number of processors you have is suboptimal. How many processors do you have?

$ grep -c ^processor /proc/cpuinfo 
16
$

The top command also lists zombie tasks. I'm told that these are threads that are irretrievable stuck, probably due to some hardware driver issue.

File handles can be seen using lsof. This can be useful to see, for example, where something is logging. For instance, guessing that IntelliJ logs to a file that has log in its name, we can run:

$ lsof -p 12610 2>/dev/null | grep log$
java    12610 henryp   12w      REG              259,3    7393039 41035613 /home/henryp/.cache/JetBrains/IntelliJIdea2021.2/log/idea.log
$

The 2>/dev/null pipes errors (the 2) to a dark pit that is ignore.

To see what your filewall is dropping (useful when you've misconfigured a node), run:

$ sudo iptables -L -n -v -x

To see current network connections, run:

$ netstat -nap 

You might want to pipe that to grep LISTEN to see what processes are listening and on which port. Very useful if something already has control of port 8080.

For threads, you can see what's going on by accessing the /proc directory. While threads are easy to see in Java (jstack), Python is a little more opaque, not least because the Global Interpretter Lock (GIL) only really allows one physical thread of execution (even if Python can allow logical threads). To utilise more processors, you must start a heavyweight thread (see "multiprocessing" here). Anyway, find the process ID you're interested in and run something like:

$ sudo cat /proc/YOUR_PROCESS_ID/stack
[<0>] do_wait+0x1cb/0x230
[<0>] kernel_wait4+0x89/0x130
[<0>] __do_sys_wait4+0x95/0xa0
[<0>] __x64_sys_wait4+0x1e/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

as everything is a file in Unix, right? This happens to be the stack in Python code that is time.sleeping. You'll see a similar looking stack for a Java thread that happens to be waiting.

If you want to pin work to certain cores, use something like taskset. For example, if I wanted to run COMMAND on all but one of my 16 cores, I run:

taskset 0xFFFE COMMAND

This is very useful if some data munging is so intense it's bringing my system down. Using this, at least one thread is left for the OS.

Finally, vmstat gives you lots of information about the health of the box such as blocks being read/written from/to disk (bo/bi), the number of processes runnable (not necessarily running) and the number blcoked (r/b) and the number of context switches per second (cs)