Thursday, September 22, 2022

Rage against the Markup

Markup is pervasive in DevOps. But markup is also:

  • hard to refactor
  • has limited control flow
  • not type-safe
  • hard to test

Refactoring

I see in our codebase something like this:

      - name: Set env to Develop
        if: endsWith(github.ref, '/develop')
        run: |
          echo "ENVIRONMENT=develop" >> $GITHUB_ENV
      - name: Set env to Staging
        if: endsWith(github.ref, '/main')
        run: |
          echo "ENVIRONMENT=staging" >> $GITHUB_ENV
      - name: Set env to Productions
        if: endsWith(github.ref, '/production')
        run: |
          echo "ENVIRONMENT=production" >> $GITHUB_ENV

Not only is this ugly, it's copy-and-pasted everywhere. I can't refactor it and what's more, there is a...

(Lack) of Control

Imagine I want to create an AWS Postrgres instance with Terraform then provision the DB with a schema using the Python library, Alembic all via GitHub Actions? GHA can call the Terraform file and create a DB, but how do I get the URL of that Postgres instance so I can give it to Alembic for it to connect and create the tables? Weaving different technologies together in a Turing Complete language is easy; with markup: less so. I had to hack some calls to the AWS CLI command and parse the JSON it returned, all in bash.

Type-safety issue #1

An example of a lack of type safety can be found in any Terraform script. We had something like this:

resource "aws_ecs_task_definition" "compute_task" {
  family                   = var.task_name
  container_definitions    = <<DEFINITION
  [
    {
      "name": "${var.task_name}",
      "image": "${aws_ecr_repository.docker_container_registry.repository_url}:${var.tag}",
...

Notice the CloudControl JSON being injected into Terraform. Now, trying to add a reference to the aws_iam_role here (as is suggested on numerous websites - see a previous post here) is silently ignored. This wouldn't happen using, say, an SDK in a type-safe language, as you can obvioulsy only access the methods it offers you.

Type-safety issue #2

The secrets in GitHub actions can only be uppercase alpha-numeric and underscore, apparently. AWS identifiers can include alpha-numerics and a dash. Mess these up and you spend time fixing tedious bugs. Tiny types would help with this.

Types #3 

Another example: for creating a DB, we used the password M3dit8at!0n$ - seems OK, right? The DB built fine, the GitHub Actions script then also created the schema fine but we could not login to the DB. Cue hours of frantic checking that there were no network or permission issues. The problem? That bloody password includes characters that need to be escaped on the Linux CLI and that's how Terraform and Alembic were invoked! They were at least consistent - that is, the infrastructure was Terraformed and Alembic built the schema, but for the rest of us, the password didn't work.

In Java, a String is just a String and its content isn't going to break the program's flow. Not so in markup land.

Testing times

Which leads to testing. I made my changes in my GitHub Actions file to use the password in the secret ${{ secrets.DB_PASSWORD-staging }} and ran my mock GHS so:

act -j populate -s DB_PASSWORD-staging=...

and the whole thing worked wonderfully. Only when I tried to create the secret in-situ was I told that DB_PASSWORD-staging was an invalid name.

And how do you test this abomination?

Spot the errors
This was the result of some hasty copy-and-paste.

Solutions...?

What you can do in Scala 3 with metaprogramming is truly amazing. For example, proto-quill creates ASTs that are passed around and are checked against the DB at compile time! I think this might be a bit overkill and a more practical approach is IP4S that compile-time checks your URLs, ports etc. I have a couple of much-neglected projects to at least give the gist of a solution (here's one) that I'll expand on soon.

Thursday, September 15, 2022

Python webapps for Java Devs

Unlike Java, Python doesn't have mutlithreading built in. In The Quick Python Book, we read: "Python doesn’t use multiple processors well... the standard implementation of Python is not designed to use multiple cores. This is due to a feature called the global interpreter lock, or GIL... if you need concurrency out of the box, Python may not be for you." Multithreaded code is not a first class citizen in Python unlike Java so things are somewhat complicated. 

If you're running a FastAPI application, you're probably using Docker. But if you want to debug it on your laptop, you might need to run it by hand with uvicorn. A similar technique can be employed with Flask [SO].

[Uvicorn is an ASGI compliant web server. ASGI is the Asynchronous Server Gateway Interface, "a standard interface between async-capable Python web servers, frameworks, and applications".]

Uvicorn allows hot deployment of code, much like Tomcat. You can run it with something like:

env $(grep -v "#" .env | xargs) poetry run uvicorn src.api.main:app --reload

where your file src.api.main has a field called app that's a fastapi.FastAPI.

[Poetry is a dependency management tool that's a bit like SBT in that it's interactive. You start it off with poetry shell]

Another web framework supporting another standard (WSGI, "a specification that describes how a web server communicates with web applications") is Gunicorn. This is a pre-fork framework, that is to say, it is the layer above a call to a POSIX fork kernel call.

Simple Python webapps often use Alembic for DB creation and migration. To get yourself going, you call

alembic init db


to create a directory called 'db' of all the revisions, tell the autogenerated env.py files where to find the classes that represent tables etc, run 

alembic revision --autogenerate -m "snapshot"


(where snapshot is essentially a comment) and you'll find a file somewhere under db that contains Python code to create SQL tables. Now, just run it with 

alembic upgrade head

and you'll see the files in your database if you've correcly populated your .env file with a DATABASE_URL. (DotEnv seems to be the standard Python way of reading properties files or accessing environment variables if it can't find what it's looking for.) 

There's more to Alembic than that but you get the idea. It's like Java's Flyway.

It uses the SQLAlchemy library that is Python's answer to Java's Hibernate. I was pleasantly surprised to see that I had to call session.flush() to populate my POJO-equivalents with keys generated by the DB. Takes me back to the mid-noughties :)

For streaming, Python uses the smart-open library. Think of this as the Pythonic equivalent of FS2.

Finally, there's a Python equivalent of Java's Lombok that auto-generates getters and setters. It's the DataClasses libary.

Sunday, September 4, 2022

Kafka/FS2 trouble shooting

I had FS2 and Kafka working quite nicely together in example code ... until I started using explicit transactions. Then my code was pausing and finally failing with:

org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000ms while awaiting EndTxn(false)

Cranking up the logging on the client side to TRACE revealed a tight loop that output something like this (simplified):

2022-09-02 16:50:41,898 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Request TxnOffsetCommitRequestData(...) dequeued for sending
2022-09-02 16:50:41,998 | DEBUG | o.a.k.c.producer.internals.Sender - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending transactional request TxnOffsetCommitRequestData(...) to node 127.0.0.1:9092 (id: 1001 rack: null) with correlation ID 527
2022-09-02 16:50:41,998 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Sending TXN_OFFSET_COMMIT request with header RequestHeader(apiKey=TXN_OFFSET_COMMIT, apiVersion=3, clientId=CLIENT_ID, correlationId=527) and timeout 30000 to node 1001: TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.apache.kafka.clients.NetworkClient - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TXN_OFFSET_COMMIT response from node 1001 for request with header RequestHeader(...)
2022-09-02 16:50:41,999 | TRACE | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received transactional response TxnOffsetCommitResponseData(...) for request TxnOffsetCommitRequestData(...)
2022-09-02 16:50:41,999 | DEBUG | o.a.k.c.p.i.TransactionManager - [Producer clientId=CLIENT_ID, transactionalId=TX_ID] Received TxnOffsetCommit response for consumer group MY_GROUP: {test_topic-1=UNKNOWN_TOPIC_OR_PARTITION}


The topic at least certainly did exist:

$ kafka-topics.sh  --bootstrap-server localhost:9092  --topic test_topic --describe
Topic: test_topic PartitionCount: 1 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: test_topic Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001

The 'solution' was to use more partitions than replicas. "I made the replication factor less than the number of partitions and it worked for me. It sounds odd to me but yes, it started working after it." [SO]

But why? Putting breakpoints in the Kafka code for all references to UNKNOWN_TOPIC_OR_PARTITION and running my client code again lead me to KafkaApis.handleTxnOffsetCommitRequest (which seems reasonable since the FS2-Kafka client is trying to handle the offsets manually). There I could see that my partition in org.apache.kafka.common.TopicPartition was 1 when the Kafka server was expecting 0. Oops. I had guessed this number to make the client compile and forgot to go back to fix it. 

So, the topic did exist but the partition did not. Creating more partition just means that there is something there to commit, sort of addressing the symptom not the cause. 

The takeaway point is that committing the transactions by hand requires knowledge of the structure of the topic.