DR Strategies
DR Strategies
Every disaster recovery strategy is, at its core, a trade-off between two variables you defined in Lesson 1: RTO and RPO. The market offers four canonical patterns — backup-restore, pilot light, warm standby, and active-active — that span the full cost-vs-recovery-time spectrum. Picking the wrong one wastes millions of dollars or leaves you with an RTO you cannot actually meet when the region goes dark. This lesson maps all four patterns with precision, shows the infrastructure behind each, and gives you the judgment to choose.
The Four Strategies at a Glance
Before diving into each, internalize the spectrum: as you move from backup-restore toward active-active, cost increases and recovery time decreases. There is no free lunch — every minute shaved off RTO is paid for in standby compute, data replication, and operational complexity. The diagram below places all four on the spectrum.
Strategy 1: Backup-Restore
What it is: All production data is backed up to durable storage (S3, GCS, Azure Blob) on a schedule. The DR region has no running infrastructure. When a disaster strikes, you provision the entire stack from scratch — IaC templates, AMIs, container images — and restore data from the latest backup. Everything is re-created on demand.
When to use it: Dev/test environments, internal tools, batch-processing workloads, and any system where the business explicitly accepts hours of downtime. A startup with a 4-hour RTO and 1-hour RPO SLA pays a fraction of what active-active would cost.
Production failure modes to know: The most common disaster in backup-restore is discovering that your backups are corrupt or incomplete only when you need them. Another is Terraform/Pulumi state drift — the IaC that worked six months ago no longer matches the running environment. And the cold provisioning time is almost always underestimated in RTO calculations: bootstrapping a fleet of 500 EC2 instances, running migrations, and warming caches routinely takes 3–4 hours even with full automation.
Strategy 2: Pilot Light
What it is: The minimum core of your system runs continuously in the DR region at minimal scale — typically just the database tier with live replication and the container images pre-pulled but not running. Application servers, load balancers, and auto-scaling groups exist as IaC templates but are not provisioned. Failover involves spinning up the dormant compute layer and pointing DNS.
Analogy: A pilot light in a gas heater — the flame is tiny, consuming almost nothing, but it can ignite the full furnace in seconds. Same idea: your data is always warm, your compute is cold.
Scale numbers at Google/AWS scale: Pilot light typically runs the DR database at 1/8 to 1/4 the capacity of production. For a 32-vCPU primary RDS instance, the pilot light runs a 4-vCPU read replica that is promoted on failover. The delta in ongoing cost between backup-restore and pilot light is roughly the cost of that database replica — often $200–800/month versus millions for active-active.
Strategy 3: Warm Standby
What it is: A fully functional, scaled-down replica of production runs continuously in the DR region. The database is fully replicated (synchronous or asynchronous depending on distance), and application servers run at reduced capacity — typically 20–25% of production. On failover, Auto Scaling groups scale out to full production capacity, and DNS flips. The key difference from pilot light: the application tier is already running, just not at full scale.
RTO in practice: The dominant factor in warm-standby RTO is DNS propagation + Auto Scaling scale-out time. With low TTLs (60 seconds) and pre-warmed ASGs, RTO of 5–10 minutes is consistently achievable. That is the SLA that Google Cloud's regional failover SRE playbooks target for Tier 1 services.
The warm standby catch: You must continuously validate that the standby is actually warm — that deployments are hitting both regions, that the DB replica lag is within SLA, and that the reduced-capacity environment can actually absorb a scale-out event. Teams that deploy to production but forget to deploy to the warm standby will discover the mismatch at the worst possible moment.
Strategy 4: Active-Active
What it is: Production traffic is split across two (or more) regions simultaneously. Both regions serve live user traffic at all times. There is no "DR region" — every region is primary. Failure of one region causes the load balancer or DNS layer to route all traffic to the surviving region(s), with zero manual intervention and sub-second failover. This is how AWS, Google, and Netflix run their most critical services.
The hard engineering problems: Active-active is not just "run two copies." The hard problem is data consistency. With writes happening in multiple regions, you need a globally distributed database (DynamoDB Global Tables, CockroachDB, Spanner, Cassandra with multi-master) that can handle conflict resolution. You need to decide your conflict strategy (last-write-wins, CRDTs, application-level resolution), and you need to accept that strong consistency across regions requires cross-region round-trips that will dominate your p99 latency.
Traffic routing mechanisms: Active-active typically uses one of three routing strategies: (1) latency-based routing via Route 53 or Cloudflare, sending each user to the closest region; (2) weighted routing, splitting traffic 50/50 or 70/30 between regions; or (3) Anycast routing at the network layer, used by CDNs and DNS providers themselves. On failure, health checks remove the failing region from the routing pool within 30–60 seconds for DNS-based approaches, or sub-second for BGP-based Anycast.
Choosing the Right Strategy
The decision is driven by four factors: your RTO/RPO contractual SLA, the cost of downtime per minute, regulatory requirements (some financial regulations mandate warm standby or better), and your team's operational maturity. A useful framework from AWS Well-Architected:
- RTO > 4 hours, RPO > 1 hour: Backup-restore. Invest the savings in robust backup testing automation.
- RTO 1–4 hours, RPO 15–60 min: Pilot light. Keep your database replica and images warm; accept cold compute provisioning time.
- RTO < 30 min, RPO < 5 min: Warm standby. Pay for the reduced-capacity replica fleet and automate DNS failover + ASG scale-out.
- RTO < 60 seconds or zero data loss (RPO = 0): Active-active. Accept the architectural constraints and cost; this is the only option for mission-critical services at scale.
One Critical Metric Most Teams Miss
Across all four strategies, the metric that determines whether your DR actually works is MTTR-DR: mean time to recover in a real DR event, measured from the moment the decision to fail over is made to the moment the system is serving production traffic from the DR region at full capacity. Most teams only ever measure this in a lab. Every DR strategy on paper looks better than it is in practice, because the first time you execute a real failover you discover that your Terraform state is in the wrong S3 bucket, your Route 53 TTLs were never lowered, or your warm-standby ASG is running a six-month-old AMI. Test MTTR-DR with real game days, covered in Lesson 8.