Project: A DR Plan
Project: A DR Plan
Everything in this tutorial — RTO/RPO derivation, backup architecture, cross-region replication, failover mechanics, GitOps-driven recovery, game-day testing, cost modelling — converges here. This lesson walks you through writing a complete, production-grade DR plan for a realistic sample system. You will produce the two artefacts that actually matter in a real incident: a DR Strategy Document (the "what and why") and a Runbook (the "how, exactly, right now"). Everything else is commentary.
The Sample System
The target system is OrderFlow, a multi-tenant SaaS order-management platform. Characteristics:
- 150,000 daily-active businesses, peak 80,000 req/s at checkout.
- PostgreSQL 16 primary (
db-primary.us-east-1) with two read replicas; Aurora Global Database cross-region standby inus-west-2. - Apache Kafka (MSK) for event streaming — order events, inventory updates, payment signals.
- Redis 7 cluster for session state and rate-limiting counters.
- Kubernetes 1.30 (EKS) in
us-east-1primary,us-west-2warm-standby cluster (scaled to 30% capacity; scales to full in ~8 min via Karpenter). - Terraform-managed infrastructure; all K8s manifests in a GitOps repo (ArgoCD).
- Revenue impact: $120,000 per minute of checkout downtime. $2,400/min for non-checkout degradation.
Part 1 — The DR Strategy Document
Section 1: Objectives and Tier Classification. Every service is assigned a tier, signed off by VP Engineering and CFO:
- Tier 0 — Checkout & Payments: RTO < 60 s, RPO = 0 (synchronous write-forwarding via Aurora Global DB). Active-passive with automated promotion.
- Tier 1 — Order Read API, Auth, Inventory: RTO < 5 min, RPO < 30 s. Pilot-light with async replication; promoted on PagerDuty alert.
- Tier 2 — Reporting, Analytics Ingestion, Webhooks: RTO < 30 min, RPO < 5 min. Warm standby; Kafka MirrorMaker 2 bridges both regions.
- Tier 3 — Internal Tooling, Data Science Notebooks: RTO < 4 hr, RPO < 1 hr. Restore from S3 cross-region backup.
Section 2: Architecture Overview. The diagram below shows the dual-region topology with data-flow paths and the promotion boundary.
Section 3: Failure Scenarios and Declared Scope. The DR plan covers four declared disaster classes:
- AZ failure — handled by multi-AZ within
us-east-1; no region failover triggered. RTO: <2 min (pod rescheduling + Aurora AZ switchover). - Full region failure —
us-east-1unreachable or SLO-breaching. Triggers full DR failover tous-west-2. This is what the runbook covers. - Data corruption — logical corruption from a bad deploy or application bug. Region failover does NOT help; response is point-in-time restore from Aurora automated backups or WAL-archived S3.
- Ransomware / security incident — isolate; do not failover to a potentially-infected replica. Restore from immutable S3 backup with Object Lock.
Part 2 — The Runbook: Full Region Failover
A runbook is a numbered, command-level procedure an on-call engineer executes at 3 AM under pressure. Every step must be unambiguous, self-contained, and reference its rollback action. The following is the OrderFlow Region Failover Runbook at its executable core.
Pre-conditions (verify before declaring DR):
- PagerDuty incident created and acknowledged; Incident Commander assigned.
- Confirmed:
us-east-1is unreachable or SLO-breaching for >3 consecutive minutes per Alertmanager. - Confirmed: the failure is NOT data corruption (check Aurora error logs before proceeding).
- Exec sponsor notified in
#incidents-exec.
Post-failover actions are as important as the failover itself. After the above steps confirm green:
- Declare the failover complete in PagerDuty and Slack; update the status page.
- Reset Kafka consumer group offsets for services that could not drain cleanly — check for duplicate-processing safeguards (idempotency keys) before resuming consumers.
- Warm the Redis cluster: rate-limit counters reset to zero (acceptable), but session keys must be repopulated. Most apps handle this by treating a cold Redis as an empty cache — verify the auth service tolerates session miss gracefully (force re-login).
- Monitor the standby Aurora cluster write throughput and replication lag to its new read replicas for the first 30 minutes.
- Schedule the failback window — typically off-peak, at least 48 hours after primary-region recovery is confirmed stable.
Part 3 — Measuring Readiness Continuously
A DR plan that is only validated at annual audits is a liability document, not an operational tool. Wire these three Prometheus alerts to your existing Alertmanager stack. They keep the DR posture visible every hour of every day without requiring a human to check:
The third alert — runbook staleness — deserves special attention. Wire it to a metric you update in Prometheus (via a push gateway or a synthetic exporter) every time you complete a game day or DR drill. If the metric ages past 90 days without a refresh, the alert fires, forcing a re-validation. This prevents the silent drift where the runbook describes a system that no longer exists.
The Completed DR Plan Checklist
Before you call a DR plan "done," verify it includes all of the following. This is the list a principal engineer at a top-tier company would use during a design review:
- RTO and RPO per tier, signed off by business stakeholders.
- Architecture diagram showing replication paths, DNS failover, and promotion boundaries.
- Explicit scope: which failure scenarios are covered, which are out-of-scope and why.
- Numbered runbook with real commands, expected outputs, and per-step rollback actions.
- Data corruption recovery path (PITR procedure) separate from the region-failover runbook.
- Continuous monitoring alerts tied to the RTO/RPO thresholds.
- Game-day schedule (quarterly minimum for Tier 0), with results stored and compared over time.
- Communication templates (status page, exec Slack, customer email) pre-written and reviewed.
- Cost model: what does this architecture cost at steady state vs. during a declared DR?
- Owner and review cadence: who updates this document, and when does it expire?
A DR plan is a living contract between your engineering team and the business. The architecture, the runbook, and the test results form the proof that the contract is honoured. Write it, test it, automate it, and keep it current — in that order.