Saturday, December 23, 2023

Cloud native

A cloud native approach to writing code is that the instance in which it lives can die at any time.

"Users sometimes explicitly send the SIGKILL signal to a process using kill -KILL or kill -9. However, this is generally a mistak. A well-designed application will have a handler for SIGTERM that causes the application to exit gracefully, cleaning up temporary files and realeasing other resources beforehand. Killing a process with SIGKILL bypasses the SIGTERM handler." - The Linux Programming Interface (Micahel Kerrisk)
Using docker stop sends SIGTERM.
Using docker kill sends SIGKILL.

The latter does not give the JVM a chance to clean up. In fact, no process in any language has the chance to clean up with SIGKILL. (SIGTERM on any thread - not just main - causes the whole JVM process to end and shutdown hooks to execute.) 

A Tini problem...

If the JVM process creates another process is killed with SIGKILL, that process carries on living but its parent becomes (on Ubuntu 20.04.6 LTS) systemd which in turn is owned by init (PID 1).

Running your JVM directly in a Docker container has some issues. This revolves around Linux treating PID 1 as special. And the ENTRYPOINT for any Docker container is PID 1.

In Linux, PID 1 should be init. On my Linux machine, I see:

$ ps -ef | head -2
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Oct21 ?        00:18:23 /sbin/init splash

This process serves a special purpose. It handles SIGnals and zombie processes. Java is not built with that in mind so it's best to bootstrap it with a small process called tini. There's a good discussion why this is important here on GitHub. Basically, Tini will forward the signal that killed the JVM onto any zombies that are left behind. This gives them the chance to clean up too. 

It also passes the JVM's exit code on so we can know how it failed. Exit codes 0-127 are reserved [SO] and the value of the kill (kill -l lists them) is added to 128. If you want to set the exit code in the shutdown hook, note you need to call Runtime.halt rather than Runtime.exit (to which System.exit delegates). The exit method will cause the JVM to hang in this situation [SO].

Tuesday, December 12, 2023

ML and Logs (pt2)

Further to my attempt to use machine learning to make sense of huge amounts of logs, I've been looking at the results. My PoC can:

Find log entries with the highest information

When debugging my Kafka cluster, these lines had the highest average entropy:

kafka1: 2023-07-04 14:14:18,861 [RaftManager id=1] Connection to node 3 (kafka3/172.31.0.4:9098) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka1: 2023-07-04 14:17:32,605 [RaftManager id=1] Connection to node 2 (kafka2/172.31.0.3:9098) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka2: 2023-07-04 14:17:31,957 [TransactionCoordinator id=2] Connection to node 3 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka1: 2023-07-04 14:17:32,605 [RaftManager id=1] Node 2 disconnected. (org.apache.kafka.clients.NetworkClient)

kafka2: 2023-07-04 14:17:31,957 [TransactionCoordinator id=2] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient)

As it happened, this correctly highlighted my problem (Docker Compose networking was misconfigured). But I don't know if I got lucky.

Bucket similar-but-different lines

Using the same algorithm as Twitter, we can bucket similar but lexically different lines, for example:

2023-07-04 14:14:21,480 [QuorumController id=3] ConfigResource(type=TOPIC, name='__consumer_offsets'): set configuration cleanup.policy to compact (org.apache.kafka.controller.ConfigurationControlManager)

2023-07-04 14:14:21,489 [QuorumController id=3] ConfigResource(type=TOPIC, name='__consumer_offsets'): set configuration compression.type to producer (org.apache.kafka.controller.ConfigurationControlManager)

This means that we can:

    • discard boilerplate lines of little value like those above
    • check the distribution of all nodes in a given bucket (for example, if one node is under-represented within a bucket - that is, not logging the same as its peers - this might be an issue).
There's one slight gotcha here: in the Kafka example above, we're using the Raft protocol so it's not too surprising that the number of nodes is N-1 for some configurations as one has been elected leader and the others are followers.

Trace high information tokens through the system

Words with high entropy can be traced across my cluster. For instance, my PoC classified wUi1RthMRPabI8rHS_Snig as possessing high information. This happens to be an internal Kafka UUID for a topic and tracing its occurrence through the logs show that despite Docker network issues, all nodes agreed on the topic ID as did the client. So, clearly some communication was happening despite the misconfiguration.

Investigation

I finally solved my Kafka problem. The Kafka client was running on the host OS and could see the individual Kafka containers but these brokers could not talk to each other. The reason was they needed to advertise themselves both as localhost (for the sake of the Kafka client that lives outside Docker) and also using their internal names (so they could talk within the Docker network).

My PoC could not tell me exactly what the problem was but it successfully highlighted the suspects.

The PoC

So, how does the PoC work? For the entropy, we train the model on a dictionary of English words so it can learn what is a "normal" word, rather than say wUi1RthMRPabI8rHS_Snig. We disregard lines that are fewer than 6 words (including the FQN of the classes - each package being one word); take the average entropy and present the lines that look the most informative.

For the LSH, we use one-hot encoding of word shingles to create our vectors.

Future plans

I'd like to show the graph of paths the high-entropy words take through the system (node and log line).

I'd also like to try other systems. Maybe I got lucky with Kafka as there are lovely, high-entropy UUID scattered throughout the logs (for example, consumer group IDs).

Thirdly, this PoC has been great for small amounts of data, but what about big data? It really needs to be rewritten in a JVM language and made to run in Spark.