Disaster Recovery & Multi-Region

Project: A DR Plan

18 min Lesson 10 of 27

Project: A DR Plan

Everything in this tutorial — RTO/RPO derivation, backup architecture, cross-region replication, failover mechanics, GitOps-driven recovery, game-day testing, cost modelling — converges here. This lesson walks you through writing a complete, production-grade DR plan for a realistic sample system. You will produce the two artefacts that actually matter in a real incident: a DR Strategy Document (the "what and why") and a Runbook (the "how, exactly, right now"). Everything else is commentary.

The Sample System

The target system is OrderFlow, a multi-tenant SaaS order-management platform. Characteristics:

150,000 daily-active businesses, peak 80,000 req/s at checkout.
PostgreSQL 16 primary (db-primary.us-east-1) with two read replicas; Aurora Global Database cross-region standby in us-west-2.
Apache Kafka (MSK) for event streaming — order events, inventory updates, payment signals.
Redis 7 cluster for session state and rate-limiting counters.
Kubernetes 1.30 (EKS) in us-east-1 primary, us-west-2 warm-standby cluster (scaled to 30% capacity; scales to full in ~8 min via Karpenter).
Terraform-managed infrastructure; all K8s manifests in a GitOps repo (ArgoCD).
Revenue impact: $120,000 per minute of checkout downtime. $2,400/min for non-checkout degradation.

Derive RTO/RPO from revenue impact, not intuition. At $120k/min, even a 2-minute checkout outage costs $240k. An Aurora Global Database failover promoting the secondary in under 60 seconds costs roughly $18k/month in cross-region replication traffic. The math justifies the architecture automatically.

Part 1 — The DR Strategy Document

Section 1: Objectives and Tier Classification. Every service is assigned a tier, signed off by VP Engineering and CFO:

Tier 0 — Checkout & Payments: RTO < 60 s, RPO = 0 (synchronous write-forwarding via Aurora Global DB). Active-passive with automated promotion.
Tier 1 — Order Read API, Auth, Inventory: RTO < 5 min, RPO < 30 s. Pilot-light with async replication; promoted on PagerDuty alert.
Tier 2 — Reporting, Analytics Ingestion, Webhooks: RTO < 30 min, RPO < 5 min. Warm standby; Kafka MirrorMaker 2 bridges both regions.
Tier 3 — Internal Tooling, Data Science Notebooks: RTO < 4 hr, RPO < 1 hr. Restore from S3 cross-region backup.

Section 2: Architecture Overview. The diagram below shows the dual-region topology with data-flow paths and the promotion boundary.

OrderFlow dual-region DR topology: synchronous Aurora replication guarantees RPO=0 for Tier 0; async Kafka MirrorMaker 2 covers Tier 1-2; S3 CRR for Tier 3. Route 53 health-check failover DNS cuts over automatically on primary failure.

Section 3: Failure Scenarios and Declared Scope. The DR plan covers four declared disaster classes:

AZ failure — handled by multi-AZ within us-east-1; no region failover triggered. RTO: <2 min (pod rescheduling + Aurora AZ switchover).
Full region failure — us-east-1 unreachable or SLO-breaching. Triggers full DR failover to us-west-2. This is what the runbook covers.
Data corruption — logical corruption from a bad deploy or application bug. Region failover does NOT help; response is point-in-time restore from Aurora automated backups or WAL-archived S3.
Ransomware / security incident — isolate; do not failover to a potentially-infected replica. Restore from immutable S3 backup with Object Lock.

Data corruption is the most dangerous scenario. Replication faithfully copies every corrupt write to the standby. By the time you detect the issue, the replica may be equally corrupted. Your only clean recovery path is point-in-time restore to a snapshot taken before the corruption window — which is why you must test PITR quarterly, not just annually.

Part 2 — The Runbook: Full Region Failover

A runbook is a numbered, command-level procedure an on-call engineer executes at 3 AM under pressure. Every step must be unambiguous, self-contained, and reference its rollback action. The following is the OrderFlow Region Failover Runbook at its executable core.

Pre-conditions (verify before declaring DR):

PagerDuty incident created and acknowledged; Incident Commander assigned.
Confirmed: us-east-1 is unreachable or SLO-breaching for >3 consecutive minutes per Alertmanager.
Confirmed: the failure is NOT data corruption (check Aurora error logs before proceeding).
Exec sponsor notified in #incidents-exec.

# ── STEP 1: Promote Aurora Global DB secondary (Tier 0) ─────────────────────
# Detaches the secondary cluster and promotes it to a standalone writer.
# Expected: 60-90 s. Do NOT skip even if primary region is fully down.

aws rds failover-global-cluster \
  --global-cluster-identifier orderflow-global \
  --target-db-cluster-identifier \
    arn:aws:rds:us-west-2:123456789012:cluster:orderflow-west \
  --region us-west-2

# Poll until status == "available"
watch -n 5 "aws rds describe-db-clusters \
  --db-cluster-identifier orderflow-west \
  --region us-west-2 \
  --query 'DBClusters[0].Status' --output text"

# ── STEP 2: Update the DB endpoint Secret (apps read from Secrets Manager) ───
aws secretsmanager update-secret \
  --secret-id orderflow/db/writer-endpoint \
  --secret-string '{"host":"orderflow-west.cluster-xyz.us-west-2.rds.amazonaws.com","port":5432}' \
  --region us-west-2

# ── STEP 3: Scale EKS standby cluster to full production capacity ────────────
export DR_CTX="arn:aws:eks:us-west-2:123456789012:cluster/orderflow-west"

kubectl --context $DR_CTX scale deployment \
  checkout-api order-read-api auth-service inventory-api payment-worker \
  --replicas=20 -n orderflow

# Karpenter will auto-provision nodes; watch until all pods are Running
kubectl --context $DR_CTX get pods -n orderflow -w

# ── STEP 4: Verify ArgoCD is in sync on the standby cluster ──────────────────
argocd context dr-west
argocd app sync orderflow-checkout --prune
argocd app sync orderflow-api --prune
argocd app wait orderflow-checkout orderflow-api --health --timeout 300

# ── STEP 5: Flip Route 53 DNS (automated health-check may have done this) ────
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.orderflow.io",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "orderflow-west-alb.us-west-2.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        }
      }
    }]
  }'

# ── STEP 6: Verify checkout smoke test ────────────────────────────────────────
curl -sf https://api.orderflow.io/health | jq .
# Expected: {"status":"ok","region":"us-west-2","db":"connected","kafka":"connected"}

Post-failover actions are as important as the failover itself. After the above steps confirm green:

Declare the failover complete in PagerDuty and Slack; update the status page.
Reset Kafka consumer group offsets for services that could not drain cleanly — check for duplicate-processing safeguards (idempotency keys) before resuming consumers.
Warm the Redis cluster: rate-limit counters reset to zero (acceptable), but session keys must be repopulated. Most apps handle this by treating a cold Redis as an empty cache — verify the auth service tolerates session miss gracefully (force re-login).
Monitor the standby Aurora cluster write throughput and replication lag to its new read replicas for the first 30 minutes.
Schedule the failback window — typically off-peak, at least 48 hours after primary-region recovery is confirmed stable.

Write the runbook before you need it, then automate as much as possible. Steps 1-4 above can be packaged as a single AWS Systems Manager Automation document and triggered with one click (or one PagerDuty webhook). The operator still confirms each phase, but the commands, polling logic, and error handling are pre-encoded and version-controlled — not typed from memory at 3 AM. This is the difference between a 4-minute RTO and a 40-minute one.

Part 3 — Measuring Readiness Continuously

A DR plan that is only validated at annual audits is a liability document, not an operational tool. Wire these three Prometheus alerts to your existing Alertmanager stack. They keep the DR posture visible every hour of every day without requiring a human to check:

# prometheus/rules/dr-readiness.yml
groups:
  - name: dr_readiness
    rules:

      # Alert if Aurora Global DB replication lag breaches the RPO budget
      - alert: AuroraGlobalReplicationLag
        expr: aws_rds_aurora_global_db_replication_lag_milliseconds / 1000 > 10
        for: 2m
        labels:
          severity: critical
          tier: "0"
        annotations:
          summary: "Aurora Global DB lag {{ $value }}s > 10s RPO threshold"

      # Alert if Kafka MirrorMaker 2 consumer lag exceeds Tier-1 RPO (30s)
      - alert: KafkaMirrorMaker2Lag
        expr: kafka_consumer_lag_sum{consumer_group="mm2-orderflow"} > 15000
        for: 5m
        labels:
          severity: warning
          tier: "1"
        annotations:
          summary: "MirrorMaker2 offset lag {{ $value }} may breach 30s RPO"

      # Alert if DR runbook version in Git does not match last validated version
      - alert: DRRunbookStaleness
        expr: time() - dr_runbook_last_validated_timestamp > 7776000  # 90 days
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "DR runbook has not been re-validated in 90 days"

The third alert — runbook staleness — deserves special attention. Wire it to a metric you update in Prometheus (via a push gateway or a synthetic exporter) every time you complete a game day or DR drill. If the metric ages past 90 days without a refresh, the alert fires, forcing a re-validation. This prevents the silent drift where the runbook describes a system that no longer exists.

The Completed DR Plan Checklist

Before you call a DR plan "done," verify it includes all of the following. This is the list a principal engineer at a top-tier company would use during a design review:

RTO and RPO per tier, signed off by business stakeholders.
Architecture diagram showing replication paths, DNS failover, and promotion boundaries.
Explicit scope: which failure scenarios are covered, which are out-of-scope and why.
Numbered runbook with real commands, expected outputs, and per-step rollback actions.
Data corruption recovery path (PITR procedure) separate from the region-failover runbook.
Continuous monitoring alerts tied to the RTO/RPO thresholds.
Game-day schedule (quarterly minimum for Tier 0), with results stored and compared over time.
Communication templates (status page, exec Slack, customer email) pre-written and reviewed.
Cost model: what does this architecture cost at steady state vs. during a declared DR?
Owner and review cadence: who updates this document, and when does it expire?

A DR plan is a living contract between your engineering team and the business. The architecture, the runbook, and the test results form the proof that the contract is honoured. Write it, test it, automate it, and keep it current — in that order.