Tuesday, December 12, 2023

ML and Logs (pt2)

Further to my attempt to use machine learning to make sense of huge amounts of logs, I've been looking at the results. My PoC can:

Find log entries with the highest information

When debugging my Kafka cluster, these lines had the highest average entropy:

kafka1: 2023-07-04 14:14:18,861 [RaftManager id=1] Connection to node 3 (kafka3/172.31.0.4:9098) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka1: 2023-07-04 14:17:32,605 [RaftManager id=1] Connection to node 2 (kafka2/172.31.0.3:9098) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka2: 2023-07-04 14:17:31,957 [TransactionCoordinator id=2] Connection to node 3 (localhost/127.0.0.1:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

kafka1: 2023-07-04 14:17:32,605 [RaftManager id=1] Node 2 disconnected. (org.apache.kafka.clients.NetworkClient)

kafka2: 2023-07-04 14:17:31,957 [TransactionCoordinator id=2] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient)

As it happened, this correctly highlighted my problem (Docker Compose networking was misconfigured). But I don't know if I got lucky.

Bucket similar-but-different lines

Using the same algorithm as Twitter, we can bucket similar but lexically different lines, for example:

2023-07-04 14:14:21,480 [QuorumController id=3] ConfigResource(type=TOPIC, name='__consumer_offsets'): set configuration cleanup.policy to compact (org.apache.kafka.controller.ConfigurationControlManager)

2023-07-04 14:14:21,489 [QuorumController id=3] ConfigResource(type=TOPIC, name='__consumer_offsets'): set configuration compression.type to producer (org.apache.kafka.controller.ConfigurationControlManager)

This means that we can:

    • discard boilerplate lines of little value like those above
    • check the distribution of all nodes in a given bucket (for example, if one node is under-represented within a bucket - that is, not logging the same as its peers - this might be an issue).
There's one slight gotcha here: in the Kafka example above, we're using the Raft protocol so it's not too surprising that the number of nodes is N-1 for some configurations as one has been elected leader and the others are followers.

Trace high information tokens through the system

Words with high entropy can be traced across my cluster. For instance, my PoC classified wUi1RthMRPabI8rHS_Snig as possessing high information. This happens to be an internal Kafka UUID for a topic and tracing its occurrence through the logs show that despite Docker network issues, all nodes agreed on the topic ID as did the client. So, clearly some communication was happening despite the misconfiguration.

Investigation

I finally solved my Kafka problem. The Kafka client was running on the host OS and could see the individual Kafka containers but these brokers could not talk to each other. The reason was they needed to advertise themselves both as localhost (for the sake of the Kafka client that lives outside Docker) and also using their internal names (so they could talk within the Docker network).

My PoC could not tell me exactly what the problem was but it successfully highlighted the suspects.

The PoC

So, how does the PoC work? For the entropy, we train the model on a dictionary of English words so it can learn what is a "normal" word, rather than say wUi1RthMRPabI8rHS_Snig. We disregard lines that are fewer than 6 words (including the FQN of the classes - each package being one word); take the average entropy and present the lines that look the most informative.

For the LSH, we use one-hot encoding of word shingles to create our vectors.

Future plans

I'd like to show the graph of paths the high-entropy words take through the system (node and log line).

I'd also like to try other systems. Maybe I got lucky with Kafka as there are lovely, high-entropy UUID scattered throughout the logs (for example, consumer group IDs).

Thirdly, this PoC has been great for small amounts of data, but what about big data? It really needs to be rewritten in a JVM language and made to run in Spark.

No comments:

Post a Comment