Reliability, Availability & Resilience

Failover & Disaster Recovery

18 min Lesson 3 of 10

Failover & Disaster Recovery

Redundancy (covered in the previous lesson) answers the question "do we have spare capacity?" Failover answers the follow-up: "how do we actually switch to that spare capacity when something breaks?" And disaster recovery (DR) answers the hardest question: "how do we survive a complete site, region, or data-centre failure and come back with minimal data loss?" These three concepts are tightly coupled, and getting them right is the difference between a system that says 99.9% available and one that actually is.

Active-Passive vs Active-Active Failover

There are two fundamental ways to arrange redundant components:

Active-Passive (a.k.a. Primary-Standby): One node handles all traffic at any given time (the active node). One or more additional nodes sit idle, continuously receiving replicated state from the active node, but serving no requests (the passive / standby nodes). When the active node fails, a promotion process designates the best standby as the new active node, and traffic is redirected.

Active-Active: All nodes handle live traffic simultaneously. Capacity, load, and (in some configurations) state are shared across nodes. When one node fails, the others absorb its traffic — no explicit "promotion" step is needed because every node is already active.

Active-Passive vs Active-Active failover topology Active-Passive Client Health Monitor / VIP traffic idle Active Node serves all traffic Passive Node standby only repl. On failure: promote Passive → Active Downtime = detection + promotion time (typically 10 s – 3 min) Active-Active Client Load Balancer Node A 50% traffic Node B 50% traffic sync / repl. On failure: remaining node absorbs all traffic Downtime ≈ 0 (no promotion step) But: write conflicts possible; higher complexity
Active-Passive: one node serves traffic, the standby promotes on failure. Active-Active: all nodes serve traffic simultaneously; a failed node's load is absorbed instantly.

Trade-offs: Choosing Between the Two Models

Neither model is universally better. The right choice depends on your RTO, RPO, and the nature of your data:

  • Active-Passive is simpler to reason about — writes go to exactly one node, so there is no risk of write conflicts. The downside is that the standby sits idle, wasting hardware, and there is a measurable failover window (typically 10 seconds to 3 minutes depending on detection and promotion speed). MySQL Group Replication in single-primary mode, PostgreSQL with Patroni, and most relational DB configurations default to this model.
  • Active-Active eliminates the promotion delay and doubles your write throughput. But if two nodes accept writes simultaneously, you need a conflict-resolution strategy — either avoid conflicts by routing writes for the same key to the same node (consistent hashing), or accept last-write-wins semantics, or use a consensus protocol (Raft, Paxos). Cassandra, DynamoDB, and CockroachDB are built for active-active multi-region writes.
Practical rule of thumb: Use active-passive for stateful databases where write consistency is paramount. Use active-active for stateless services (API servers, cache nodes) or for databases specifically designed for it. Mixing the two in a single service (e.g., active-active app tier, active-passive DB tier) is extremely common and usually optimal.

Recovery Objectives: RTO and RPO

Two metrics define what "disaster recovery" means for your business:

Recovery Time Objective (RTO) — the maximum acceptable duration of an outage. How long can the system be unavailable before the business is materially harmed? A payment processor might set RTO to 30 seconds. An internal analytics dashboard might tolerate 4 hours.

Recovery Point Objective (RPO) — the maximum acceptable amount of data loss, measured in time. How much work can we afford to lose? RPO = 0 means zero data loss (every committed write must be recoverable). RPO = 1 hour means we tolerate losing up to one hour of transactions.

RTO and RPO illustrated on a timeline time Last Backup / Checkpoint DISASTER Service Restored RPO (data loss) RTO (downtime) normal operation outage window recovered
RPO measures the data-loss window (last backup to disaster). RTO measures the outage window (disaster to full service restoration). Both are business decisions, not technical ones.

RTO and RPO are business decisions first, technical decisions second. A business must decide what data loss and what downtime it can financially or reputationally tolerate. The engineering team then selects the replication mode, backup frequency, and standby topology that satisfies those targets.

Typical targets by business type (rough guidance):

  • Financial transactions / payments: RTO < 1 min, RPO = 0 (zero data loss, synchronous replication mandatory)
  • E-commerce checkout: RTO < 5 min, RPO < 30 sec
  • SaaS application: RTO < 30 min, RPO < 5 min
  • Internal tooling: RTO < 4 hrs, RPO < 1 hr

DR Deployment Patterns

Three deployment patterns map roughly to increasing cost and decreasing RTO/RPO:

  1. Backup and Restore — Periodic snapshots (e.g., nightly) written to a separate location. RTO hours to days, RPO equal to the snapshot interval. Cheapest; suitable only for non-critical data.
  2. Pilot Light — A minimal version of the system (database replica, configuration) runs continuously in the DR site, but application servers are stopped. On disaster, start the app servers and redirect DNS. RTO 15–60 minutes, RPO near zero for the DB (continuous replication).
  3. Warm Standby / Multi-Site Active — A scaled-down but fully operational copy of the system runs at all times in a second region. RTO seconds to minutes. The "active-active multi-region" approach is the extreme end of this: full capacity in every region, RTO near zero, but highest cost and operational complexity.
Common pitfall — untested DR: Many teams set up a standby, declare an RPO/RTO target, and never test the actual failover. In practice, the first time you run a real failover is during a real incident — the worst possible moment to discover the process is broken. Schedule regular DR drills (chaos engineering or tabletop exercises) where you actually promote the standby and verify that applications reconnect, that replication lag was within RPO, and that DNS propagation happened within RTO. Netflix famously automated this with Chaos Monkey.

Failover at the Database Layer

Database failover is the hardest part because writes must not be lost or duplicated. The key question during promotion is: how far behind was the replica at the time of failure? This gap is the replication lag, and it directly determines whether your actual RPO met your target.

To keep replication lag near zero for RPO = 0 requirements, use synchronous replication: the primary waits for at least one replica to acknowledge a write before confirming it to the client. This adds latency (~1–5 ms for cross-datacenter round-trips) but guarantees zero data loss on failover. PostgreSQL synchronous_commit = on, MySQL Group Replication in single-primary mode, and AWS Aurora's six-way sync replication all implement this pattern.

For higher-availability scenarios where you can tolerate a small RPO in exchange for lower write latency, asynchronous replication is used — the primary confirms to the client immediately, and the replica catches up in the background. The trade-off is explicit: if the primary crashes before the replica catches up, those recent writes are gone.

Key idea: RPO = 0 requires synchronous replication; RPO > 0 permits asynchronous replication. There is no way to achieve zero data loss without paying the latency cost of waiting for at least one replica to confirm every write. This is a fundamental constraint, not a tooling problem.

Putting It Together

A well-designed DR strategy combines several elements: a clear RTO and RPO target agreed with the business, a topology (active-passive or active-active) matched to the data consistency requirements, a replication mode (synchronous or asynchronous) matched to the RPO, regular DR drills to validate that failover actually works within the stated targets, and monitoring of replication lag as a leading indicator of RPO health. The goal is not to prevent every failure — it is to ensure that every failure has a known, tested, bounded impact.