Tuesday, October 11, 2022

CI/CD in the Cloud

I asked on Discord:

Are there any tools to weave all the components together so both developers and the CI/CD tool can easily run the same suite of integration tests? I've done a PoC that takes one application in the pipeline and runs it locally while it talks to an AWS environment I've provisioned with Terraform. I intend to use docker-compose to bring up all apps altogether for integraion testing. The going is slow. How do other people tackle the same problem?

and user hubt gave me some great insights. Here is an abridged version of his/her answers:

Some people move to Kubernetes instead. there are tons of CI/CD tools built on it.

I have used shared dev environments for a long time. Not having individual dev environments has its drawbacks but it also has strengths too. 

One advantage is everyone sees the same data and database. At some point you start to run into problems where your bugs are based on specific data. With individual dev environments, these bugs are hard to reproduce for everyone unless there is a way of sharing both code and data. If you have a a super solid consistent data set that you have curated and built for every test then you are miles ahead of the game. but maintaining and updating that data set and test suite is very hard to do.

We have a promotion process from shared dev to staging to prod. We run continuous integration tests in dev and staging. People break those on a regular basis.

We don't have developers run them. We just [run] them automatically hourly. Our process works reasonably well. The hourly integration test cadence is driven more by how long the integration test suite takes rather than being strictly hourly.

[If somebody deletes a file from the dev environment thinking it wasn't needed] they would have broken dev and we would debug dev or staging to fix things.

Admittedly this probably doesn't scale for 100+ developers. I deal with a team of a few dozen developers so it isn't quite the same as supporting a team of 4 or a team of 100+.

We also have separate unit tests for some things. those are targeted and should be 100% successful. Integration tests can fail for a variety of different reasons unrelated to someone's code. Unit tests should not. So, yes the integration tests are more general.

[Regarding creating the dev environment on every test suite run] I think that only makes sense in very tightly controlled and constrained environments like if you have an application that has no database.

It's not like we are able to recreate our production environment regularly, so in some ways you want to handle things like you do production. Recreating environments per test run would make sense if you are shipping packaged software. But as a cloud product/web service it makes less sense.

Discord user dan_hill2802 also gave his insights here and here:

We have developers able to easily build AWS environments for development and testing and tear down again when done. The pipelines use the same mechanism to build independent environments in AWS, run tests, then tear down

We aim to have applications as cloud agnostic as possible, so a lot of application development is done with docker containers locally. We also use localstack.cloud [open source community edition] for mocking some of the cloud endpoints. However, it's still not quite representative enough so we enable developers to deploy the application and all supporting infrastructure in AWS, they can also attach to their environment for debugging. The "deploy" is a wrap up of a few steps, including Terraform apply, setting up database users for the app, seeding the database with baseline data. The teardown then does the reverse of that
The individual AWS stacks are mostly for infrastructure engineers working on the Terraform code, where local really isn't an option. We then made it available to the Devs too who were asking about being able to test in AWS

The tools we use (in order of priority)
  1. Terraform
  2. Make
  3. Kitchen
  4. Concourse CI (but could use any other CI tool)
Number 4 also allows Devs to create AWS environments without any AWS credentials.
Our approach

I gave up on my original plan to use Docker Compose. Instead, CI/CD for our Postgres oriented app is entirely in the cloud and looks like this:
  1. Upon each raised PR, a test environment is built using Terraform inside GitHubActions. 
  2. GitHubActions checks out the code, runs the Alembic DB scripts to prepare the Postgres database, and then runs the integration tests against this DB. 
  3. Postgres is torn down after the tests. This is partly to save money (the DB is  unused except during PRs) and partly to ensure nobody messes up the Alembic scripts. That is, we ensure that DB schema can always be built from scratch.
Our Athena oriented app is slightly different. We opted for a shared dev environment as hubt has (I'd have liked separate dev environments but with 4 developers, the juice is not seem worth the squeeze). However, we also employed dan_hill2802's idea of "seeding the database with baseline data".  

Our approach for this app did not involve creating an Athena instance (as I had inherited somebody elses ecosystem). Rather, we:
  1. Created some Python code to create synthetic test data. This also uses pyarrow.parquet to read the schema of a file used in manual tests to pad the synthetic data with columns which are not used as part of the test. Athena will complain if you don't do this.
  2. Upload the synthetic data to an S3 bucket that Athena can see by registering it with AWS Glue (think Hive metastore). You only have to register the files once and that was done manually since this Athena instance was built manually. Automating this was left as tech debt.
  3. Now, the integration tests can run SQL queries through Athen with boto3.client(service_name="athena", ...). This data will be the same on every run of this test suite.
These three steps happen upon each PR push. Failure prevents any promotion. The environment is shared so somebody running this test on their laptop could in theory interfere with a PR but we're too small a team to worry unduly about this.

Automation, automation, automation

I almost never log in to the AWS website as this is the antithesis of automation. Using the AWS CLI makes scripting the work so much easier.

If you're going to have scripts that automate everything, they need to create a new environment on a regular basis. If they don't, people will start putting things into the scripts that assume an environment (or a part of it) is already there. Worse, they make changes to the environment by hand (that's why we added the requirement to build the DB from scratch - see above). Think of this as a regression test for your build scripts.

10 second builds? Ha!

Deploying the environment is slow and can take well over 10 minutes. 

  • It takes approximately 4 minutes to create a Postgres DB in AWS and another 4 minutes to tear it down when done. 
  • It also takes a few minutes to create a Python environment (that's fresh each time we run the tests) since we need to download and installing all the dependencies. 
  • Finally, it appears that GitHubActions runs in the US while our DBs are in the EU and this cross-Atlantic traffic is slow (think loading hundreds of megs of baseline data into a DB).
Conclusion

Although our approach is not perfect, the perfect is the enemy of the good. Our approach is yielding results and there is an appetite to make it even better, but later. 

Baby steps.

Monday, October 10, 2022

AWS DevOps cheat sheet

Miscellaneous notes that I keep needing to refer to. 

Read a file in S3 from the command line

If you want to, say, look at the numbof lines in a file, run:

aws s3 cp s3://BUCKET/PATH - | wc -l

Note the critical hypen.

Avoid Pagination

The --no-cli-pager switch is the daddy here. If I want to list all my RDS databases in a script, I don't want them to be paginated nor truncated, so run:

aws rds describe-db-instances --no-cli-pager

Logging

Run something like this to get the logs:

aws logs tail YOUR_LOG_GROUP_NAME --since 5d

where YOUR_LOG_GROUP_NAME is in your container_definitions/options/awslogs-group of resource "aws_ecs_task_definition" "backend_task" if you're using Terraform.

The 5d is the last 5 days, but it could be, sah 1h (the last hour) or --follow if you want to tail it.

Caveat

Beware that you configure the health check of your services correctly. One of our services was returning an HTTP error 404 for the page the health checker was trying to hit. Everything was fine other than a page was missing. But AWS saw the 404 and kept deciding to kill the service. Oops.

Terraform

This isn't specifically related to AWS but since I use the two together, I'll post here.

Create new workspaces to isolate yourself from breaking other people's work with:

terraform workspace new staging

Then, if you really foul something up, you can delete the staging environment with:

terraform apply -destroy

Bad boys and girls might change the environment by hand. In this case, you have tell Terraform to import it with something like:

terraform import -var-file="vars/production.tfvars"  -var="x=y" ... aws_db_subnet_group.my_db my-db

You can see the dependency graph of your infrastructure with:

terraform graph | dot -Tsvg > ~/Pictures/graph.svg

Docker

This has helped me a few times when trying to work out locally why my Docker container does not work when depoyed in ECS. Say your container dies immediately. You can debug (on Ubuntu) the docker daemon itself with:

journalctl -f -u docker.service

Then you can see the IDs of short lived containers and get its logs (see the Docker docs)

Wednesday, October 5, 2022

SOTA MLOps Tools

Another post on the proliferation of tools. I'm writing this to just keep up-to-date on what the cool kids are talking about.

Feature Store

The most popular offering is Feast (FOSS and mostly Python)  but it's batch only.  There is also Tecton (closed source, managed patform) that claims to do streaming.

Features stores basically store features that are hard or expensive to compute. For example, in Edge ML (where smart devices process locally), some data may be derived from an aggregation of data the device does not otherwise have access to.

"Feature store is an increasingly loaded term that can be used by different people to refer to very different things... There are three main problems that a feature store can help address
  • Feature management: "features used for one model can be used in another"
  • Feature computation: "feature engineering logic, after being defined, needs to be computed... If the computation of this features isn't too expeisnive, it might be acceptable computing this feature each time [or] you might want to execute it only once the first time it is required, then store it"
  • Feature consistency: "feature definitions written in Python during development might need to be converted into the language used in production... Modern feature stores... unify the logic for both batch features and streaming."
"Some features stores only manage feature definitions without computing features from data; some feature stores do both.

"Out of 95 companies I surveyed in January 2022, only around 40% of them use a features store. Out of those who use a feature store, half of them build their own" [1]

However, note that some say Feature Stores are not worth the complexity [TowardsDataScience].

Model Serving

MLFlow is an open source project from Databricks. By putting its dependencies in your code, you can serve models via HTTP. However, you must still create a Docker container for its server

Metaflow was open sourced by Netflix. It's a framework that creates a DAG of the pipeline a data scientist may use.

Kubeflow is an open source offering from Google.

Seldon Core helps to deploy pickled SKLearn models. Because the models are pickled, any pre-processing of the input must be done before calling it. For example, you'd call the served model with something like:

curl -s -d '{"data": {"ndarray":[[1.0, 2.0, 5.0, 6.0]]}}' \ -X POST http://localhost:8003/seldon/seldon/sklearn/api/v1.0/predictions \ -H "Content-Type: application/json"

(from here).

Development Tools

Apache TVM is an "An end to end Machine Learning Compiler Framework for CPUs, GPUs and accelerators".

DVC is a Git based tool for versioning data.

[1] Designing Machine Learning Systems, Chip Huyen

Thursday, September 22, 2022

Rage against the Markup

Markup is pervasive in DevOps. But markup is also:

  • hard to refactor
  • has limited control flow
  • not type-safe
  • hard to test

Refactoring

I see in our codebase something like this:

      - name: Set env to Develop
        if: endsWith(github.ref, '/develop')
        run: |
          echo "ENVIRONMENT=develop" >> $GITHUB_ENV
      - name: Set env to Staging
        if: endsWith(github.ref, '/main')
        run: |
          echo "ENVIRONMENT=staging" >> $GITHUB_ENV
      - name: Set env to Productions
        if: endsWith(github.ref, '/production')
        run: |
          echo "ENVIRONMENT=production" >> $GITHUB_ENV

Not only is this ugly, it's copy-and-pasted everywhere. I can't refactor it and what's more, there is a...

(Lack) of Control

Imagine I want to create an AWS Postrgres instance with Terraform then provision the DB with a schema using the Python library, Alembic all via GitHub Actions? GHA can call the Terraform file and create a DB, but how do I get the URL of that Postgres instance so I can give it to Alembic for it to connect and create the tables? Weaving different technologies together in a Turing Complete language is easy; with markup: less so. I had to hack some calls to the AWS CLI command and parse the JSON it returned, all in bash.

Type-safety issue #1

An example of a lack of type safety can be found in any Terraform script. We had something like this:

resource "aws_ecs_task_definition" "compute_task" {
  family                   = var.task_name
  container_definitions    = <<DEFINITION
  [
    {
      "name": "${var.task_name}",
      "image": "${aws_ecr_repository.docker_container_registry.repository_url}:${var.tag}",
...

Notice the CloudControl JSON being injected into Terraform. Now, trying to add a reference to the aws_iam_role here (as is suggested on numerous websites - see a previous post here) is silently ignored. This wouldn't happen using, say, an SDK in a type-safe language, as you can obvioulsy only access the methods it offers you.

Type-safety issue #2

The secrets in GitHub actions can only be uppercase alpha-numeric and underscore, apparently. AWS identifiers can include alpha-numerics and a dash. Mess these up and you spend time fixing tedious bugs. Tiny types would help with this.

Types #3 

Another example: for creating a DB, we used the password M3dit8at!0n$ - seems OK, right? The DB built fine, the GitHub Actions script then also created the schema fine but we could not login to the DB. Cue hours of frantic checking that there were no network or permission issues. The problem? That bloody password includes characters that need to be escaped on the Linux CLI and that's how Terraform and Alembic were invoked! They were at least consistent - that is, the infrastructure was Terraformed and Alembic built the schema, but for the rest of us, the password didn't work.

In Java, a String is just a String and its content isn't going to break the program's flow. Not so in markup land.

Testing times

Which leads to testing. I made my changes in my GitHub Actions file to use the password in the secret ${{ secrets.DB_PASSWORD-staging }} and ran my mock GHS so:

act -j populate -s DB_PASSWORD-staging=...

and the whole thing worked wonderfully. Only when I tried to create the secret in-situ was I told that DB_PASSWORD-staging was an invalid name.

And how do you test this abomination?

Spot the errors
This was the result of some hasty copy-and-paste.

Solutions...?

What you can do in Scala 3 with metaprogramming is truly amazing. For example, proto-quill creates ASTs that are passed around and are checked against the DB at compile time! I think this might be a bit overkill and a more practical approach is IP4S that compile-time checks your URLs, ports etc. I have a couple of much-neglected projects to at least give the gist of a solution (here's one) that I'll expand on soon.

Thursday, September 15, 2022

Python webapps for Java Devs

Unlike Java, Python doesn't have mutlithreading built in. In The Quick Python Book, we read: "Python doesn’t use multiple processors well... the standard implementation of Python is not designed to use multiple cores. This is due to a feature called the global interpreter lock, or GIL... if you need concurrency out of the box, Python may not be for you." Multithreaded code is not a first class citizen in Python unlike Java so things are somewhat complicated. 

If you're running a FastAPI application, you're probably using Docker. But if you want to debug it on your laptop, you might need to run it by hand with uvicorn. A similar technique can be employed with Flask [SO].

[Uvicorn is an ASGI compliant web server. ASGI is the Asynchronous Server Gateway Interface, "a standard interface between async-capable Python web servers, frameworks, and applications".]

Uvicorn allows hot deployment of code, much like Tomcat. You can run it with something like:

env $(grep -v "#" .env | xargs) poetry run uvicorn src.api.main:app --reload

where your file src.api.main has a field called app that's a fastapi.FastAPI.

[Poetry is a dependency management tool that's a bit like SBT in that it's interactive. You start it off with poetry shell]

Another web framework supporting another standard (WSGI, "a specification that describes how a web server communicates with web applications") is Gunicorn. This is a pre-fork framework, that is to say, it is the layer above a call to a POSIX fork kernel call.

Simple Python webapps often use Alembic for DB creation and migration. To get yourself going, you call

alembic init db


to create a directory called 'db' of all the revisions, tell the autogenerated env.py files where to find the classes that represent tables etc, run 

alembic revision --autogenerate -m "snapshot"


(where snapshot is essentially a comment) and you'll find a file somewhere under db that contains Python code to create SQL tables. Now, just run it with 

alembic upgrade head

and you'll see the files in your database if you've correcly populated your .env file with a DATABASE_URL. (DotEnv seems to be the standard Python way of reading properties files or accessing environment variables if it can't find what it's looking for.) 

There's more to Alembic than that but you get the idea. It's like Java's Flyway.

It uses the SQLAlchemy library that is Python's answer to Java's Hibernate. I was pleasantly surprised to see that I had to call session.flush() to populate my POJO-equivalents with keys generated by the DB. Takes me back to the mid-noughties :)

For streaming, Python uses the smart-open library. Think of this as the Pythonic equivalent of FS2.

Finally, there's a Python equivalent of Java's Lombok that auto-generates getters and setters. It's the DataClasses libary.

Sunday, September 4, 2022

Kafka/FS2 trouble shooting

I had FS2 and Kafka working quite nicely together in example code ... until I started using explicit transactions. Then my code was pausing and finally failing with:

org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000ms while awaiting EndTxn(false)

Cranking up the logging on the client side to TRACE revealed a tight loop that output something like this (simplified):

2022-09-02 16:50:41,898 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Request TxnOffsetCommitRequestData(...) dequeued for sending
2022-09-02 16:50:41,998 | DEBUG | o.a.k.c.producer.internals.Sender - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending transactional request TxnOffsetCommitRequestData(...) to node 127.0.0.1:9092 (id: 1001 rack: null) with correlation ID 527
2022-09-02 16:50:41,998 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending TXN_OFFSET_COMMIT request with header RequestHeader(apiKey=TXN_OFFSET_COMMIT, apiVersion=3, clientId=CLIENT_ID, correlationId=527) and timeout 30000 to node 1001: TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TXN_OFFSET_COMMIT response from node 1001 for request with header RequestHeader(...)
2022-09-02 16:50:41,999 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received transactional response TxnOffsetCommitResponseData(...) for request TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TxnOffsetCommit response for consumer group MY_GROUP: {test_topic-1=UNKNOWN_TOPIC_OR_PARTITION}


The topic at least certainly did exist:

$ kafka-topics.sh  --bootstrap-server localhost:9092  --topic test_topic --describe
Topic: test_topic PartitionCount: 1 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: test_topic Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001

The 'solution' was to use more partitions than replicas. "I made the replication factor less than the number of partitions and it worked for me. It sounds odd to me but yes, it started working after it." [SO]

But why? Putting breakpoints in the Kafka code for all references to UNKNOWN_TOPIC_OR_PARTITION and running my client code again lead me to KafkaApis.handleTxnOffsetCommitRequest (which seems reasonable since the FS2-Kafka client is trying to handle the offsets manually). There I could see that my partition in org.apache.kafka.common.TopicPartition was 1 when the Kafka server was expecting 0. Oops. I had guessed this number to make the client compile and forgot to go back to fix it. 

So, the topic did exist but the partition did not. Creating more partition just means that there is something there to commit, sort of addressing the symptom not the cause. 

The takeaway point is that committing the transactions by hand requires knowledge of the structure of the topic.

Tuesday, August 16, 2022

More AWS/GitHub notes

If you don't fancy hand-crafting YAML to set up your AWS infrastructure, Amazon offers the Cloud Development Kit

They've taken an interesting architectural decision. They've written the logic in JavaScript executed via node. But to make it a polyglot kit, they've added what are essentially bindings that run the node executable. I guess this makes the logic less likely to diverge between languages but it also means more hassle setting up your environment.

What it means is that the Java (or the supported language of your choice) does not talk to the AWS cloud directly. It generates files that you must then feed to cdk. This is different to Fabric8's Kubernetes client or  the docker-java library, both of which allow you to control the containerization in the JVM with no further tooling required.

[Aside: Terraform have a similar toolkit to CDK here but I gave up due to the lack of documentation].

CDK set up

AWS's Java CDK binding needs node in the PATH environment variable. Note, IntelliJ doesn't seem to put it into the PATH by default and I was getting the unhelpful "SyntaxError: Unexpected token ?" - a very unintuitive message. It appears that the Java implementation executes this node command using ProcessBuilder (see JsiiRuntime.startRuntimeIfNeeded) . 

You install the node runtime with 

npm install -g aws-cdk

You can initialize a Java project with:

cdk init app --language java

You then write your Java code to describe your environment using the libraries pulled in by the generated pom.xml. Knowing exactly what to do can be hard so you might want to look at some example on GitHub here.

When you're done, run your Java code. Then you can call cdk synth to generate the metadata and cdk deploy to deploy it and as if by magic, your S3 bucket etc is deployed in AWS! Note you must run this in the top-level directory of your Java project. Apparently, state is saved and shared between the Java build and the call to cdk.

All an act

You can run GitHub actions locally with act. This really helps with debugging. 

If you're .github YAML looks like this:

jobs:
  code:

Then run Act with something this:

act -n --env-file .github/pr.yml -j code -s AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -s AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

where the -j means which YAML block you're running.

Remove the -n if you want it to be a real run rather than a dummy run.

If you want to use secrets, you can just set them as environment variables. For instance:

        env:
          AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY_ID }}
        run: /root/.local/bin/poetry run python -m pytest integration_tests

will use your system's environment variable AWS_ACCESS_KEY_ID as the value of AWS_ACCESS_KEY that the integration tests use.