Wednesday, August 30, 2023

Can we apply ML to logging?

Kibana is a Typescript/Javascript product to create visuals of logs. OpenSearch's Dashboards is the Apache licensed fork of this. Kibana is great when you know what you are looking for. But what if you don't?

Example

I have a small Kafka cluster of three nodes using the Raft protocol. I send messages then check a consumer has read all the messages. This integration test passes every time. There are no ERRORs. However, every so often, this test takes over 2 minutes when it should take about 20 seconds.

The number of lines on a good run is 4.5k and on the bad run about 20k. Twenty thousand lines is a lot to go through when you don't know what you're looking for.

I slightly adapted the code here to turn my logs into TF-IDF vectors and used Locality Sensitive Hashing to map them to a lower dimensional space. Now, we can visualise what's going on. 

The good run looks like this:




Note that there are two dominant lines that map to:

[2023-07-04 14:13:10,089] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:13:10,089] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

repeated over and over for about 10 seconds.

The bad run looks like this:



Here, the dominant lines in the diagram that are from:

[2023-07-04 14:16:21,755] WARN [TransactionCoordinator id=2] Connection to node 1 (localhost/127.0.0.1:9091) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] WARN [TransactionCoordinator id=2] Connection to node 3 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2023-07-04 14:16:21,805] INFO [TransactionCoordinator id=2] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient)

again being repeated but this time, it lasts for about 2 minutes.

[The code in Ethen Lui's GitHub is really quite clever. Rather than using MinHashing, he's projecting the feature vectors against some randomly generated vectors and making a bit map from it. This can be turned into a single integer which represents the feature's bucket. Note that the number of vectors does not really change the dimensionality of the space but it does change how consistent different runs are - more vectors leads to greater repeatability]

Still at something of a loss, I checked out Bitnami's Kafka instances (here), changed the logging in bitnami/kafka/3.5/debian-11/rootfs/opt/bitnami/scripts/kafka/postunpack.sh by adding the line:

replace_in_file "${KAFKA_CONF_DIR}/log4j.properties" "INFO" "DEBUG"

and built the Docker image again. Now it gives me DEBUG statements.

Fix

The problem of non-determinism is still foxing me but the solution became clear with all these mentions of localhost. We need the client to communicate with the cluster on localhost because the client is unaware that the Kafka instances are hosted in Docker. However, each broker does need to know it's talking to another Docker container as the ports of its peers are not available within its own sandbox. 

The solution was to use slightly different values for the listeners as the advertised listeners (KAFKA_CFG_LISTENERS vs KAFKA_CFG_ADVERTISED_LISTENERS. Note that Bitnami expects environment variables prepended with KAFKA_CFG_ and periods as underscores before it converts them into a Kafka-friendly server.properties file). 

The listeners were of the form OUTSIDE://:9111 while the advertised listeners were of the form OUTSIDE://localhost:9111. The label OUTSIDE apparently is arbitrary. It's just used as a reference, say in listener.security.protocol (in Kafka-speak; munge with the necessary Bitnami mappings to make it appear in server.properties) where you'll see something like OUTSIDE:PLAINTEXT

Conclusion

Although I've fixed the Kafka issue I was facing, applying ML to the logs was only a partial help. I still need to understand the Kafka Raft code better before it can truly be of use.