Wednesday, October 28, 2020

Streaming architecture


Some miscellaneous notes on streaming.

Streaming Semtantics are Difficult

"The assumption 'we receive all messages' is actually challenging for a streaming system (even for a batch processing system, though presumably the problem of late-arriving data is often simply ignored here). You never know whether you actually have received 'all messages' -- because of the possibility of late-arriving data. If you receive a late-arriving message, what do you want to happen? Re-process/re-sort 'all' the messages again (now including the late-arriving message), or ignore the late-arriving message (thus computing incorrect results)? In a sense, any such global ordering achieved by 'let's sort all of them' is either very costly or best effort." [StackOverflow, Michael G. Noll
Technologist, Office of the CTO at Confluent]

Kafka

"Brokers are not interchangeable, and clients will need to communicate directly with the broker that contains the lead replica of each partition they produce to or consume from. You can’t place all brokers behind a single load balancer address. You need a way to route messages to specific brokers." [Confluent Blog]

"But whatever you do, do not try to build an ESB around Kafka—it is an anti-pattern that will create inflexibility and unwanted dependencies." [Kai Waehner, Confluent]

Kafka as a Database

This comes from a talk by Martin Klepperman (Kafka as a Database).

He takes the canonical example used in SQL databases and extends it to Kafka. Let's send two parts of a financial transaction as a single message to Kafka. We then have Kafka separate it into two messages: one to a credit topic, one to a debit topic. We now have atomicity and durability but not isolation. If you think about it, this is the same as ReadCommitted

Note, if you crash half way between these two operations, you will need to dedupe or using Kafka's once-only semantics but you still have durability.

Example 2: imagine two users try to register the same user name. Can we enforce uniqueness? Yes, partition the topic on username and one will be before the other and all consumers will agree. We've achieved serializability (ie, an isolation level) and have lost parallelization.

Kafka Leaders and Followers - Gotcha

Jason Gustafson (Confluent) describes how Kafka logs can be truncated.

Say the leader writes locally some data but then dies before it can replicate that data to all of its ISRs. The remaining ISRs replicas nominate a new leader [via the "controller"] but it's an ISR that did not receive the data. Those ISRs that did receive it now have to remove it as they follow-the-leader but the leader never had it. Their logs are truncated to the high watermark.

Note that clients will not see this data as there is the notion of a watermark that is the highest commit ID across all ISRs. (Slow ISRs can be dropped from the set but this invariant still holds. The dropped ISRs can rejoin later when they have the requesite data).  The leader defines the high watermark. This piggybacks on the response to a fetch request. Consequently, the high watermark on the follower is not necessarily that on the leader.

Committed data is never lost.

Terminology

Reading documentation, references to out-of-bound data kept cropping up so I thought it worth making a note here:

"Out-of-band data is the data transferred through a stream that is independent from the main in-band data stream. An out-of-band data mechanism provides a conceptually independent channel, which allows any data sent via that mechanism to be kept separate from in-band data. 

"But using an out-of-band mechanism, the sending end can send the message to the receiving end out of band. The receiving end will be notified in some fashion of the arrival of out-of-band data, and it can read the out of band data and know that this is a message intended for it from the sending end, independent of the data from the data source." [Wikipedia]


No comments:

Post a Comment