Monday, December 9, 2024

Diagnosing Kafka

The Problem

I have an integration test that create a cluster of 4 Kafka brokers using the Kraft protocol (ie, no Zookeeper). It then kills one and upon sending more messages, expects a consumer to receive them. Approximately 1 in 4 times, it fails as no messages are received. This 1 in 4 figure seems suspicious...

Killing a node sometimes meant the cluster was in a bad state. The other brokers kept regularly barfing UnknownHostException as they tried to talk to the missing node. Looking at the topics shed some light on the problem.

$KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic test_topic
Topic: test_topic TopicId: MaHwLf2jQ62VNUxtiFUGvw PartitionCount: 2 ReplicationFactor: 1 Configs: 
Topic: test_topic Partition: 0 Leader: 4 Replicas: 4 Isr: 4
Topic: test_topic Partition: 1 Leader: none Replicas: 1 Isr: 1

The Solution

Having modified the test so that it now creates the NewTopic with a replication factor of 2, the same command now gives:

Topic: test_topic TopicId: w37ZKZnqR-2AdmT76oiWsw PartitionCount: 2 ReplicationFactor: 2 Configs: 
Topic: test_topic Partition: 0 Leader: 4 Replicas: 4,1 Isr: 4,1
Topic: test_topic Partition: 1 Leader: 1 Replicas: 1,2 Isr: 1,2

The test now passes every time (so far). We appear to have fixed the problem.

The Reason

Kafka replicates data over nodes (the Replicas above). These replicas may or may not be in-synch replications (Isr). These ISRs acknowledges the latest writes to the leader within a specified time and only they are considered for clean leader election.
"Producers write data to and consumers read data from topic partition leaders.  Synchronous data replication from leaders to followers ensures that messages are copied to more than one broker. Kafka producers can set the acks configuration parameter to control when a write is considered successful." [Disaster Recovery for Multi- Datacenter Apache Kafka Deployments, Confluent]
Conclusion

But what if this were a production topic not an integration test - how would you fix it? Well, if the data is not replicated, when the broker(s) hosting it die, you've lost the data. You can configure the topic to use unclean.leader.election.enable by using the a Kafka CLI tool. But it's a trade off. "If we allow out-of-sync replicas to become leaders, we will have data loss and data inconsistencies." [Conduktor]

No comments:

Post a Comment