In our topology, we have a Databricks Spark Structured Streaming job reading from an HDInsights Kafka cluster that is VNet injected into our subscription. So, disaster recovery has to take into account two things:
- the Kafka cluster
- the blob storage where the data is landed by SSS.
This Microsoft document outlines the usual way in securing your data in Kafka (high replication factors for disk writes; high insync replicas for in-memory writes; high acknowledgement factor etc). In addition, an HDInsight cluster are backed by managed disks that provide "three replicas of your data" each witihin their own availability zone that's "equipped with independent power, cooling, and networking" [Azure docs].
So, within a region, things look peachy. Now, how do we get these Kafka messages replicating across region? The HDInisghts documentation suggests using Apache MirrorMaker but note one critical thing it says:
"Mirroring should not be considered as a means to achieve fault-tolerance. The offset to items within a topic are different between the primary and secondary clusters, so clients cannot use the two interchangeably."
This is worrying. Indeed, there is a KIP to make MirrorMaker 2 fix this problem and others like it (like differences in partitions within topics of the same name; messages entering infinite loops etc). Confluent is pushing its Replicator that (it claims) is a more complete solution (there's a Docker trial for it here). And, there is Brooklin, but Azure says this would be self-managed.
Spark and Blobs
At the moment, all the data goes into one region. The business is aware that in the event of, say, a terrorist attack of the data centres, data will be lost. But even the infrastructure guys have the data being replicated from one region to another, note this caveat in the Azure documentation (emphasis mine):
"Because data is written asynchronously from the primary region to the secondary region, there is always a delay before a write to the primary region is copied to the secondary region. If the primary region becomes unavailable, the most recent writes may not yet have been copied to the secondary region."
But let's say that nothing so dramatic happens. Let's assume our data is there once we get our region back on its feet. In the meantime, what has been landed by SSS in the backup region is incompatible with what came before. This is because Spark stores its Kafka offsets in a folder in Gen2. It's just as well Spark is not writing to the directory that the erstwhile live region was using. If we had been writing to a directory that was common to both regions, some finagling would have to be done as we points the Spark job at another directory, effecting the RTO if not RPO.
Aside: A disaster of a different kind
In the old days of on-prem Hadoop clusters, you might see a problem where too many files were created. The consequence would be the Name Node goes down. Sometimes deciphering the Azure documentation is hard but this link says the "Maximum number of blob containers, blobs, file shares, tables, queues, entities, or messages per storage account" has "No limit".
Hopefully (caveat: I have not tested this in the wild) this problem has gone away.