Late to the party, I read this (Exactly Once Support in Kafka) where the author quotes from this Reddit chat.
"Another objection I’ve heard to this is that it isn’t really “exactly once” but actually “effectively once”. I don’t disagree that that phase is better (though less commonly understood) but I’d point out that we’re still debating the definitions of undefined terms!
"...To my mind this is what people mean by exactly-once delivery in the context of pub/sub messaging [my emphasis]: namely that you can publish messages and they will be delivered one time exactly by one or more receiving application." [Jay Kreps on Medium]
"Kafka did not solve the Two Generals Problem... To achieve exactly-once processing semantics, we must have a closed system with end-to-end support for modeling input, output, and processor state as a single, atomic operation... Kafka provides exactly-once processing semantics because it’s a closed system. There is still a lot of difficulty in ensuring those semantics are maintained across external services, but Confluent attempts to ameliorate this through APIs and tooling. But that’s just it: it’s not exactly-once semantics in a building block that’s the hard thing, it’s building loosely coupled systems that agree on the state of the world." [BraveNewGeek]
Idempotency is the degenerate state, ie an example that actually belongs to another, probably simpler class (a triangle with one side of zero length is a degenerate case in that it's actually a much simpler geometric structure - namely, a line).
Like the Monty Hall problem, the answer depends on the exact semantics of the question.
Confluent have given Kafka "effectively once" semantics by adding a monotonically incrementing ID to each message (much like TCP does). See "what exactly once semantics mean in Apache Kafka, why it is a hard problem, and how the new idempotence and transactions features in Kafka enable correct exactly once stream processing using Kafka’s Streams API." [Confluent Blog]
- The producer send operation is now idempotent.
- Kafka now supports atomic writes across multiple partitions through the new transactions API... on the Consumer side, you have two options for reading transactional messages, expressed through the “isolation.level” consumer config
The problem:
"The broker can crash after writing a message but before it sends an ack back to the producer. It can also crash before even writing the message to the topic. Since there is no way for the producer to know the nature of the failure, it is forced to assume that the message was not written successfully and to retry it. In some cases, this will cause the same message to be duplicated in the Kafka partition log, causing the end consumer to receive it more than once... The producer send operation is now idempotent."
"Lastly, ordering is only guaranteed if max.in.flight.requests.per.connection == 1:
"Producer configuration settings from the Apache Kafka documentation: max.in.flight.requests.per.connection (default: 5): The maximum number of unacknowledged requests the client will send on a single connection before blocking. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled)."
[SO]
But although Confluent has given Kafka exactly once semantics, these semantics do not extend outside of its closed system.
From Confluent again: "Even when they use Kafka as a source for stream processing and need to recover from a failure, they can only rewind their Kafka offset to reconsume and reprocess messages, but cannot rollback the associated state in an external system, leading to incorrect results when the state update is not idempotent.”
"a consumer reads messages from a Kafka topic, some processing logic transforms those messages [and] writes the resulting messages to another Kafka." This is a closed system.
Conclusion
TCP also handles dupes and gives idempotency but just because your app uses TCP it doesn't mean it is idempotent.
No comments:
Post a Comment