Tuesday, October 11, 2022

CI/CD in the Cloud

I asked on Discord:

Are there any tools to weave all the components together so both developers and the CI/CD tool can easily run the same suite of integration tests? I've done a PoC that takes one application in the pipeline and runs it locally while it talks to an AWS environment I've provisioned with Terraform. I intend to use docker-compose to bring up all apps altogether for integraion testing. The going is slow. How do other people tackle the same problem?

and user hubt gave me some great insights. Here is an abridged version of his/her answers:

Some people move to Kubernetes instead. there are tons of CI/CD tools built on it.

I have used shared dev environments for a long time. Not having individual dev environments has its drawbacks but it also has strengths too. 

One advantage is everyone sees the same data and database. At some point you start to run into problems where your bugs are based on specific data. With individual dev environments, these bugs are hard to reproduce for everyone unless there is a way of sharing both code and data. If you have a a super solid consistent data set that you have curated and built for every test then you are miles ahead of the game. but maintaining and updating that data set and test suite is very hard to do.

We have a promotion process from shared dev to staging to prod. We run continuous integration tests in dev and staging. People break those on a regular basis.

We don't have developers run them. We just [run] them automatically hourly. Our process works reasonably well. The hourly integration test cadence is driven more by how long the integration test suite takes rather than being strictly hourly.

[If somebody deletes a file from the dev environment thinking it wasn't needed] they would have broken dev and we would debug dev or staging to fix things.

Admittedly this probably doesn't scale for 100+ developers. I deal with a team of a few dozen developers so it isn't quite the same as supporting a team of 4 or a team of 100+.

We also have separate unit tests for some things. those are targeted and should be 100% successful. Integration tests can fail for a variety of different reasons unrelated to someone's code. Unit tests should not. So, yes the integration tests are more general.

[Regarding creating the dev environment on every test suite run] I think that only makes sense in very tightly controlled and constrained environments like if you have an application that has no database.

It's not like we are able to recreate our production environment regularly, so in some ways you want to handle things like you do production. Recreating environments per test run would make sense if you are shipping packaged software. But as a cloud product/web service it makes less sense.

Discord user dan_hill2802 also gave his insights here and here:

We have developers able to easily build AWS environments for development and testing and tear down again when done. The pipelines use the same mechanism to build independent environments in AWS, run tests, then tear down

We aim to have applications as cloud agnostic as possible, so a lot of application development is done with docker containers locally. We also use localstack.cloud [open source community edition] for mocking some of the cloud endpoints. However, it's still not quite representative enough so we enable developers to deploy the application and all supporting infrastructure in AWS, they can also attach to their environment for debugging. The "deploy" is a wrap up of a few steps, including Terraform apply, setting up database users for the app, seeding the database with baseline data. The teardown then does the reverse of that
The individual AWS stacks are mostly for infrastructure engineers working on the Terraform code, where local really isn't an option. We then made it available to the Devs too who were asking about being able to test in AWS

The tools we use (in order of priority)
  1. Terraform
  2. Make
  3. Kitchen
  4. Concourse CI (but could use any other CI tool)
Number 4 also allows Devs to create AWS environments without any AWS credentials.
Our approach

I gave up on my original plan to use Docker Compose. Instead, CI/CD for our Postgres oriented app is entirely in the cloud and looks like this:
  1. Upon each raised PR, a test environment is built using Terraform inside GitHubActions. 
  2. GitHubActions checks out the code, runs the Alembic DB scripts to prepare the Postgres database, and then runs the integration tests against this DB. 
  3. Postgres is torn down after the tests. This is partly to save money (the DB is  unused except during PRs) and partly to ensure nobody messes up the Alembic scripts. That is, we ensure that DB schema can always be built from scratch.
Our Athena oriented app is slightly different. We opted for a shared dev environment as hubt has (I'd have liked separate dev environments but with 4 developers, the juice is not seem worth the squeeze). However, we also employed dan_hill2802's idea of "seeding the database with baseline data".  

Our approach for this app did not involve creating an Athena instance (as I had inherited somebody elses ecosystem). Rather, we:
  1. Created some Python code to create synthetic test data. This also uses pyarrow.parquet to read the schema of a file used in manual tests to pad the synthetic data with columns which are not used as part of the test. Athena will complain if you don't do this.
  2. Upload the synthetic data to an S3 bucket that Athena can see by registering it with AWS Glue (think Hive metastore). You only have to register the files once and that was done manually since this Athena instance was built manually. Automating this was left as tech debt.
  3. Now, the integration tests can run SQL queries through Athen with boto3.client(service_name="athena", ...). This data will be the same on every run of this test suite.
These three steps happen upon each PR push. Failure prevents any promotion. The environment is shared so somebody running this test on their laptop could in theory interfere with a PR but we're too small a team to worry unduly about this.

Automation, automation, automation

I almost never log in to the AWS website as this is the antithesis of automation. Using the AWS CLI makes scripting the work so much easier.

If you're going to have scripts that automate everything, they need to create a new environment on a regular basis. If they don't, people will start putting things into the scripts that assume an environment (or a part of it) is already there. Worse, they make changes to the environment by hand (that's why we added the requirement to build the DB from scratch - see above). Think of this as a regression test for your build scripts.

10 second builds? Ha!

Deploying the environment is slow and can take well over 10 minutes. 

  • It takes approximately 4 minutes to create a Postgres DB in AWS and another 4 minutes to tear it down when done. 
  • It also takes a few minutes to create a Python environment (that's fresh each time we run the tests) since we need to download and installing all the dependencies. 
  • Finally, it appears that GitHubActions runs in the US while our DBs are in the EU and this cross-Atlantic traffic is slow (think loading hundreds of megs of baseline data into a DB).
Conclusion

Although our approach is not perfect, the perfect is the enemy of the good. Our approach is yielding results and there is an appetite to make it even better, but later. 

Baby steps.

Monday, October 10, 2022

AWS DevOps cheat sheet

Miscellaneous notes that I keep needing to refer to. 

Read a file in S3 from the command line

If you want to, say, look at the numbof lines in a file, run:

aws s3 cp s3://BUCKET/PATH - | wc -l

Note the critical hypen.

Avoid Pagination

The --no-cli-pager switch is the daddy here. If I want to list all my RDS databases in a script, I don't want them to be paginated nor truncated, so run:

aws rds describe-db-instances --no-cli-pager

Logging

Run something like this to get the logs:

aws logs tail YOUR_LOG_GROUP_NAME --since 5d

where YOUR_LOG_GROUP_NAME is in your container_definitions/options/awslogs-group of resource "aws_ecs_task_definition" "backend_task" if you're using Terraform.

The 5d is the last 5 days, but it could be, sah 1h (the last hour) or --follow if you want to tail it.

Caveat

Beware that you configure the health check of your services correctly. One of our services was returning an HTTP error 404 for the page the health checker was trying to hit. Everything was fine other than a page was missing. But AWS saw the 404 and kept deciding to kill the service. Oops.

Terraform

This isn't specifically related to AWS but since I use the two together, I'll post here.

Create new workspaces to isolate yourself from breaking other people's work with:

terraform workspace new staging

Then, if you really foul something up, you can delete the staging environment with:

terraform apply -destroy

Bad boys and girls might change the environment by hand. In this case, you have tell Terraform to import it with something like:

terraform import -var-file="vars/production.tfvars"  -var="x=y" ... aws_db_subnet_group.my_db my-db

You can see the dependency graph of your infrastructure with:

terraform graph | dot -Tsvg > ~/Pictures/graph.svg

Docker

This has helped me a few times when trying to work out locally why my Docker container does not work when depoyed in ECS. Say your container dies immediately. You can debug (on Ubuntu) the docker daemon itself with:

journalctl -f -u docker.service

Then you can see the IDs of short lived containers and get its logs (see the Docker docs)

Wednesday, October 5, 2022

SOTA MLOps Tools

Another post on the proliferation of tools. I'm writing this to just keep up-to-date on what the cool kids are talking about.

Feature Store

The most popular offering is Feast (FOSS and mostly Python)  but it's batch only.  There is also Tecton (closed source, managed patform) that claims to do streaming.

Features stores basically store features that are hard or expensive to compute. For example, in Edge ML (where smart devices process locally), some data may be derived from an aggregation of data the device does not otherwise have access to.

"Feature store is an increasingly loaded term that can be used by different people to refer to very different things... There are three main problems that a feature store can help address
  • Feature management: "features used for one model can be used in another"
  • Feature computation: "feature engineering logic, after being defined, needs to be computed... If the computation of this features isn't too expeisnive, it might be acceptable computing this feature each time [or] you might want to execute it only once the first time it is required, then store it"
  • Feature consistency: "feature definitions written in Python during development might need to be converted into the language used in production... Modern feature stores... unify the logic for both batch features and streaming."
"Some features stores only manage feature definitions without computing features from data; some feature stores do both.

"Out of 95 companies I surveyed in January 2022, only around 40% of them use a features store. Out of those who use a feature store, half of them build their own" [1]

However, note that some say Feature Stores are not worth the complexity [TowardsDataScience].

Model Serving

MLFlow is an open source project from Databricks. By putting its dependencies in your code, you can serve models via HTTP. However, you must still create a Docker container for its server

Metaflow was open sourced by Netflix. It's a framework that creates a DAG of the pipeline a data scientist may use.

Kubeflow is an open source offering from Google.

Seldon Core helps to deploy pickled SKLearn models. Because the models are pickled, any pre-processing of the input must be done before calling it. For example, you'd call the served model with something like:

curl -s -d '{"data": {"ndarray":[[1.0, 2.0, 5.0, 6.0]]}}' \ -X POST http://localhost:8003/seldon/seldon/sklearn/api/v1.0/predictions \ -H "Content-Type: application/json"

(from here).

Development Tools

Apache TVM is an "An end to end Machine Learning Compiler Framework for CPUs, GPUs and accelerators".

DVC is a Git based tool for versioning data.

[1] Designing Machine Learning Systems, Chip Huyen