Kafka Reliability & Disaster Scenarios
Kafka Reliability & Disaster Scenarios
Kafka's reliability model is deceptively simple: replicate data across brokers, elect a leader per partition, acknowledge writes to enough replicas. The nuance — and the source of every production outage — lives in the gaps between those steps. This lesson covers the three scenarios that have caused the most real-world data loss and availability failures at production scale: unclean leader elections, the many paths to acknowledged-but-lost writes, and the operational complexity of cross-datacenter replication.
Unclean Leader Elections: Trading Safety for Availability
Every Kafka partition has exactly one leader at any moment. When that leader fails, the controller must elect a new one. The clean path: elect only from the In-Sync Replica (ISR) set — the subset of replicas that have fully caught up with the leader. Replicas in the ISR are guaranteed to hold all messages acknowledged by the previous leader, so a new leader elected from this set has no data gap.
The dangerous path is the unclean election: electing a replica that is not in the ISR because no ISR replica is available. This happens when the leader fails while its followers are lagging behind — a scenario that occurs more often than intuition suggests, especially during rolling restarts or network partitions that outlast replica.lag.time.max.ms (default: 30 seconds).
unclean.leader.election.enable=true is the default in some Kafka distributions prior to 3.x. With this on, a broker that is significantly behind can win a leader election, becoming authoritative while permanently discarding every message the previous leader acknowledged but this replica never received. Producers get no error. Consumers skip ahead silently. This is silent, permanent data loss.Set unclean.leader.election.enable=false cluster-wide for any workload where data loss is unacceptable. The trade-off: if the ISR shrinks to zero (all caught-up replicas fail simultaneously), the partition becomes unavailable until at least one ISR member recovers. This is the correct behavior — unavailability is recoverable; data loss is not. Reserve true only for high-throughput logging topics where you explicitly accept loss in exchange for continuous availability.
Data Loss Scenarios: The Full Taxonomy
Unclean elections are one path to data loss. Senior operators maintain a mental map of all paths, because each requires a different mitigation:
- Producer acks=0 or acks=1: With
acks=0, the producer fires and forgets — any broker failure before the write lands on disk loses the message. Withacks=1, the leader acknowledges after writing to its own log but before any follower replicates; if the leader crashes in that window, the message is lost even though the producer received a success response. Useacks=all(equivalent toacks=-1) for any topic that matters. - ISR shrinkage below min.insync.replicas: With
acks=allandmin.insync.replicas=2, a write requires acknowledgment from at least 2 replicas. If only one replica is in the ISR (due to broker failure or lag), the producer receivesNotEnoughReplicasException. This is the correct behavior — the cluster rejects the write rather than accepting it unsafely. Monitor ISR size via JMX metrickafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. - Log truncation on follower promotion: When a follower is elected leader, it truncates its log to the last committed offset — the high-water mark. Any messages the old leader wrote after the high-water mark but before crashing are discarded. With
acks=allthis cannot happen by definition (those messages would not have been acknowledged), but withacks=1it is a real loss scenario. - Consumer offset commits ahead of processing: A consumer can commit an offset before fully processing (or durably storing) the corresponding message. If the process crashes after commit but before the side effect completes, the message is effectively lost from the consumer's perspective — it will never be reprocessed. Commit offsets only after your processing side effect is durable.
- Compacted topic tombstone expiry: Log-compacted topics delete tombstone records after
delete.retention.ms(default: 24 hours). A downstream consumer that was offline longer than this interval misses the deletes and retains stale state indefinitely.
acks=all + min.insync.replicas=N-1 (where N is the replication factor) + unclean.leader.election.enable=false gives you the strongest durability guarantee Kafka can provide. This is the baseline for any financially or contractually sensitive topic at big-tech scale. Stripe, LinkedIn, and Confluent's own internal clusters all run with this configuration.Multi-Cluster Replication: MirrorMaker 2 in Production
A single Kafka cluster, no matter how well-tuned, is a single failure domain. Network partitions, availability zone failures, and datacenter-level events require a multi-cluster topology. Kafka's native answer is MirrorMaker 2 (MM2), introduced in KIP-382 and built on the Kafka Connect framework. MM2 replaces the original MirrorMaker — which had serious offset-mapping and consumer-group replication gaps — and is the production-standard tool at LinkedIn, Confluent, and AWS MSK.
MM2 provides three key capabilities that its predecessor lacked:
- Offset translation: Consumer group offsets are replicated between clusters with a translation layer that accounts for the fact that the same logical message may have different offsets in the source and target clusters. This is essential for failover — consumers can resume from the correct position in the target cluster without replaying the entire topic.
- Topic namespace isolation: Topics are prefixed with the source cluster alias (e.g.,
us-east.paymentson theeu-westcluster), preventing collisions and making topology explicit. - Heartbeat and checkpoint topics: MM2 writes synthetic records to
mm2-heartbeatsandmm2-checkpoints— used by failover tooling (like theRemoteClusterUtilsAPI) to calculate consumer offset translation on-demand.
Active-Active vs. Active-Passive: The Operational Trade-off
MM2 supports both topologies, and the choice has significant operational consequences:
Active-passive is operationally simpler. One cluster is the source of truth; the other is a warm standby. Failover is a deliberate operational action: redirect producers to the standby, verify consumer offset translation, cut DNS. The risk is that the standby cluster is never exercised under production load, meaning failover is slower and more error-prone when you actually need it. Aim for a regular DR drill — at minimum quarterly — to validate the runbook.
Active-active routes different topic namespaces to different clusters and allows each cluster to mirror the other's namespaces. This gives you zero-RPO for reads and automatic fan-out, but introduces the hardest problem in distributed systems: avoiding message cycles (a message replicated to cluster B must not be re-replicated back to cluster A). MM2 uses topic prefixing and the replication.policy.class setting to break cycles, but this must be explicitly tested.
kafka.connect:type=MirrorSourceConnector,target=(*),topic=(*),name=replication-latency-ms. Alert when lag exceeds your RTO budget. At LinkedIn's scale (trillions of messages per day cross-datacenter), this metric is on every SRE dashboard. A sustained lag spike is usually the first sign of a network brownout or a target-cluster broker failure before the cluster itself reports degradation.Disaster Recovery Runbook: The First 15 Minutes
When the primary cluster becomes unavailable, the sequence of decisions matters as much as the technical mechanics:
- Confirm scope (minutes 0-2): Is it one broker, the entire cluster, or the network path between clusters? Check
ActiveControllerCount,OfflinePartitionsCount, and MM2 lag metrics. A single broker failure in a properly configured cluster is self-healing — do not trigger failover. - Declare incident and freeze writes (minutes 2-5): If cluster-level failure is confirmed, coordinate with the on-call to freeze writes to the primary. This prevents producers from continuing to write to a cluster whose fate is uncertain, which complicates offset reconciliation later.
- Translate consumer offsets (minutes 5-10): Use
RemoteClusterUtils.translateOffsets()or thekafka-consumer-groups.sh --reset-offsetsworkflow to map consumer group checkpoints from the primary to the DR cluster. This is the step most teams under-practice and most often get wrong under pressure. - Redirect traffic and validate (minutes 10-15): Update bootstrap server configuration in producer and consumer clients (typically via service discovery or environment config), restart consumers on the DR cluster, and validate that consumer groups are progressing and that key business metrics (e.g., payment processing rate) are recovering.