Disaster Recovery & Multi-Region

DR Strategies

18 min Lesson 2 of 27

DR Strategies

Every disaster recovery strategy is, at its core, a trade-off between two variables you defined in Lesson 1: RTO and RPO. The market offers four canonical patterns — backup-restore, pilot light, warm standby, and active-active — that span the full cost-vs-recovery-time spectrum. Picking the wrong one wastes millions of dollars or leaves you with an RTO you cannot actually meet when the region goes dark. This lesson maps all four patterns with precision, shows the infrastructure behind each, and gives you the judgment to choose.

The Four Strategies at a Glance

Before diving into each, internalize the spectrum: as you move from backup-restore toward active-active, cost increases and recovery time decreases. There is no free lunch — every minute shaved off RTO is paid for in standby compute, data replication, and operational complexity. The diagram below places all four on the spectrum.

The four DR strategies mapped against cost and recovery time. Moving right and down costs more but achieves near-zero RTO/RPO.

Strategy 1: Backup-Restore

What it is: All production data is backed up to durable storage (S3, GCS, Azure Blob) on a schedule. The DR region has no running infrastructure. When a disaster strikes, you provision the entire stack from scratch — IaC templates, AMIs, container images — and restore data from the latest backup. Everything is re-created on demand.

When to use it: Dev/test environments, internal tools, batch-processing workloads, and any system where the business explicitly accepts hours of downtime. A startup with a 4-hour RTO and 1-hour RPO SLA pays a fraction of what active-active would cost.

Production failure modes to know: The most common disaster in backup-restore is discovering that your backups are corrupt or incomplete only when you need them. Another is Terraform/Pulumi state drift — the IaC that worked six months ago no longer matches the running environment. And the cold provisioning time is almost always underestimated in RTO calculations: bootstrapping a fleet of 500 EC2 instances, running migrations, and warming caches routinely takes 3–4 hours even with full automation.

Production pitfall: Teams that pick backup-restore because it is "cheap" often forget to budget for regular restore drills. If you have never actually restored from your backups in the DR region, you do not have a DR plan — you have a hope. AWS recommends testing restores monthly for Tier 1 workloads. Automated restore validation (restore to a test account, run a smoke test, send a Slack alert) should be a cron job, not a quarterly manual exercise.

# Example: AWS Backup plan + cross-region copy using Terraform.
# Backs up an RDS cluster hourly, copies to the DR region with 7-day retention.

resource "aws_backup_plan" "rds_hourly" {
  name = "rds-hourly-cross-region"

  rule {
    rule_name         = "hourly"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 * ? * * *)"

    lifecycle {
      delete_after = 7
    }

    copy_action {
      destination_vault_arn = "arn:aws:backup:us-west-2:${var.account_id}:backup-vault:dr-vault"
      lifecycle {
        delete_after = 7
      }
    }
  }
}

resource "aws_backup_selection" "rds" {
  name         = "rds-cluster"
  plan_id      = aws_backup_plan.rds_hourly.id
  iam_role_arn = aws_iam_role.backup.arn

  resources = [
    aws_rds_cluster.primary.arn,
  ]
}

Strategy 2: Pilot Light

What it is: The minimum core of your system runs continuously in the DR region at minimal scale — typically just the database tier with live replication and the container images pre-pulled but not running. Application servers, load balancers, and auto-scaling groups exist as IaC templates but are not provisioned. Failover involves spinning up the dormant compute layer and pointing DNS.

Analogy: A pilot light in a gas heater — the flame is tiny, consuming almost nothing, but it can ignite the full furnace in seconds. Same idea: your data is always warm, your compute is cold.

Scale numbers at Google/AWS scale: Pilot light typically runs the DR database at 1/8 to 1/4 the capacity of production. For a 32-vCPU primary RDS instance, the pilot light runs a 4-vCPU read replica that is promoted on failover. The delta in ongoing cost between backup-restore and pilot light is roughly the cost of that database replica — often $200–800/month versus millions for active-active.

Pro practice: Keep your application AMIs and container images refreshed in the DR region continuously, not just when disaster strikes. An image that was last pushed to the DR ECR registry six months ago will not have the latest security patches. Add a step to your CI/CD pipeline that pushes images to both the primary and DR region registries on every merge to main.

Strategy 3: Warm Standby

What it is: A fully functional, scaled-down replica of production runs continuously in the DR region. The database is fully replicated (synchronous or asynchronous depending on distance), and application servers run at reduced capacity — typically 20–25% of production. On failover, Auto Scaling groups scale out to full production capacity, and DNS flips. The key difference from pilot light: the application tier is already running, just not at full scale.

RTO in practice: The dominant factor in warm-standby RTO is DNS propagation + Auto Scaling scale-out time. With low TTLs (60 seconds) and pre-warmed ASGs, RTO of 5–10 minutes is consistently achievable. That is the SLA that Google Cloud's regional failover SRE playbooks target for Tier 1 services.

The warm standby catch: You must continuously validate that the standby is actually warm — that deployments are hitting both regions, that the DB replica lag is within SLA, and that the reduced-capacity environment can actually absorb a scale-out event. Teams that deploy to production but forget to deploy to the warm standby will discover the mismatch at the worst possible moment.

# Terraform: warm standby Auto Scaling group in DR region.
# Runs at 25% capacity normally; failover bumps desired to 100%.

resource "aws_autoscaling_group" "app_dr" {
  provider = aws.dr_region

  name                = "app-warm-standby"
  min_size            = 2
  max_size            = 40
  desired_capacity    = 4           # 25% of production's 16-instance fleet
  vpc_zone_identifier = var.dr_private_subnets
  target_group_arns   = [aws_lb_target_group.app_dr.arn]

  launch_template {
    id      = aws_launch_template.app_dr.id
    version = "$Latest"
  }

  tag {
    key                 = "Role"
    value               = "warm-standby"
    propagate_at_launch = true
  }

  lifecycle {
    ignore_changes = [desired_capacity]   # failover automation overrides this
  }
}

# Failover script: scale to full capacity + flip Route53 health-check weight
# Run from a Lambda or runbook on failover trigger:
# aws autoscaling set-desired-capacity \
#   --auto-scaling-group-name app-warm-standby \
#   --desired-capacity 16 \
#   --region us-west-2

Strategy 4: Active-Active

What it is: Production traffic is split across two (or more) regions simultaneously. Both regions serve live user traffic at all times. There is no "DR region" — every region is primary. Failure of one region causes the load balancer or DNS layer to route all traffic to the surviving region(s), with zero manual intervention and sub-second failover. This is how AWS, Google, and Netflix run their most critical services.

The hard engineering problems: Active-active is not just "run two copies." The hard problem is data consistency. With writes happening in multiple regions, you need a globally distributed database (DynamoDB Global Tables, CockroachDB, Spanner, Cassandra with multi-master) that can handle conflict resolution. You need to decide your conflict strategy (last-write-wins, CRDTs, application-level resolution), and you need to accept that strong consistency across regions requires cross-region round-trips that will dominate your p99 latency.

Traffic routing mechanisms: Active-active typically uses one of three routing strategies: (1) latency-based routing via Route 53 or Cloudflare, sending each user to the closest region; (2) weighted routing, splitting traffic 50/50 or 70/30 between regions; or (3) Anycast routing at the network layer, used by CDNs and DNS providers themselves. On failure, health checks remove the failing region from the routing pool within 30–60 seconds for DNS-based approaches, or sub-second for BGP-based Anycast.

Key idea — active-active is an architecture, not a feature: You cannot retrofit active-active onto a system designed for a single region. Session state must be replicated (Redis Cluster or DynamoDB), user-uploaded content must be in a globally replicated object store, and every service must be designed to handle requests from any region without assuming local-only data. The re-architecture cost is why active-active is reserved for systems where downtime cost exceeds $10,000–$100,000 per minute.

Infrastructure architecture for all four DR strategies. Active-active is the only pattern where both regions serve live traffic simultaneously, with sub-second automatic failover.

Choosing the Right Strategy

The decision is driven by four factors: your RTO/RPO contractual SLA, the cost of downtime per minute, regulatory requirements (some financial regulations mandate warm standby or better), and your team's operational maturity. A useful framework from AWS Well-Architected:

RTO > 4 hours, RPO > 1 hour: Backup-restore. Invest the savings in robust backup testing automation.
RTO 1–4 hours, RPO 15–60 min: Pilot light. Keep your database replica and images warm; accept cold compute provisioning time.
RTO < 30 min, RPO < 5 min: Warm standby. Pay for the reduced-capacity replica fleet and automate DNS failover + ASG scale-out.
RTO < 60 seconds or zero data loss (RPO = 0): Active-active. Accept the architectural constraints and cost; this is the only option for mission-critical services at scale.

Pro practice — the upgrade path: These strategies are not permanent. Most mature platforms start at backup-restore, prove out their IaC automation discipline, then graduate to pilot light when an SLA tightens. Many large SaaS companies run warm standby for years before the business justifies active-active. Design each tier so it can evolve: if your pilot-light DR database is already a Multi-AZ RDS read replica with cross-region replication, upgrading to warm standby is mostly an ASG addition, not a re-architecture.

One Critical Metric Most Teams Miss

Across all four strategies, the metric that determines whether your DR actually works is MTTR-DR: mean time to recover in a real DR event, measured from the moment the decision to fail over is made to the moment the system is serving production traffic from the DR region at full capacity. Most teams only ever measure this in a lab. Every DR strategy on paper looks better than it is in practice, because the first time you execute a real failover you discover that your Terraform state is in the wrong S3 bucket, your Route 53 TTLs were never lowered, or your warm-standby ASG is running a six-month-old AMI. Test MTTR-DR with real game days, covered in Lesson 8.