DR Testing & Game Days
DR Testing & Game Days
A DR plan that has never been executed is a hypothesis. Every untested runbook, every undrained failover script, every backup set that has never been restored carries a hidden assumption: that it will work under pressure, on the day it matters most. The only way to convert that assumption into a known property is to run the plan — deliberately, repeatedly, and with enough realism that gaps surface before a real incident does.
This lesson covers the full spectrum of DR validation: restore drills (proving backups are intact and recoverable within your RPO), failover exercises (proving the full promotion sequence executes within your RTO), and structured game days — the big-tech practice of injecting controlled failures into production-like environments with an observer team, a clear hypothesis, and a post-game review. You have been running chaos experiments since the chaos engineering tutorial; game days extend that thinking to multi-system, time-bounded DR scenarios with explicit RTO/RPO pass/fail criteria.
The Testing Pyramid for DR
Just as unit tests run cheaply and frequently while end-to-end tests run slowly, DR testing has a pyramid: component-level restore drills at the base (frequent, automated), integration-level failover rehearsals in the middle (monthly, semi-automated), and full game days at the apex (quarterly, manual leadership involvement). Running only the apex is the most common mistake — teams simulate a full region failover once a year and discover on the day that the DNS TTL was never lowered, or the IAM role for the DR automation expired.
Restore Drills: Proving Backups Are Real
A backup that has never been restored is not a backup — it is a file. Restore drills are the automated, scheduled process of taking a recent backup, restoring it to an isolated environment, running integrity checks, and measuring how long the full restore took. The output feeds directly into your RPO and RTO dashboards.
The following script runs a nightly Postgres restore drill in CI/CD (adapt the S3 paths and DB name for your stack). It restores the latest pg_dump snapshot to an ephemeral RDS instance, runs row-count sanity checks, and records elapsed time to a metrics endpoint:
dr_restore_duration_seconds as a direct SLI for your RPO. If your RPO is 30 minutes and restore time is trending toward 28 minutes, you are two weeks from an RPO violation — and you will see it in the graph before a real incident forces the issue. Grafana alert on dr_restore_duration_seconds > (rpo_seconds * 0.8).
Failover Exercises: Timing the Full Sequence
A failover exercise is more invasive than a restore drill. You are executing the entire promotion sequence — DNS cutover, database promotion, load-balancer re-pointing, readiness checks — against a staging or shadow environment that mirrors production topology. The goal is to measure end-to-end elapsed time from "failure declared" to "traffic successfully serving from DR region," then compare it against your RTO contract.
For a Kubernetes-based stack with Argo CD managing GitOps state, a failover exercise typically involves the following sequence. Automate it with a runbook script that records timestamps at each step:
Game Days: Structured Chaos Under Observation
A game day is a bounded, observed, hypothesis-driven failure experiment at system scope. It differs from a failover drill in three ways: (1) the scenario is injected into a production-like environment, often without every participant knowing the exact failure mode in advance; (2) an observer team (engineers not directly on the response team) documents timeline, decisions, and gaps; (3) there is a formal hypothesis — "We believe our system will achieve RTO < 10 minutes during a full us-east-1 AZ failure, because our failover runbook was validated last quarter" — and a clear pass/fail criterion evaluated after the exercise.
A well-run game day follows this structure:
- Pre-game brief (30 min): Publish the scenario (or a sanitized version of it), confirm the blast radius is contained to the test environment, brief the observer team, assign a timekeeper, and agree on the abort condition (a specific observable that means "stop immediately").
- Failure injection (varies): Inject the failure using your chaos tooling (
chaos-mesh, AWS Fault Injection Simulator, or manual action). The injection team does not help the response team. - Response window: The on-call team responds exactly as they would during a real incident — Slack war room, runbooks, escalation paths. The observer team records every action with a wall-clock timestamp.
- Halt and measure: At the agreed end condition (service restored, RTO window expired, or abort trigger hit), the injection team stops the experiment. Measure actual RTO and RPO from the observer timeline.
- Post-game review (60–90 min same day): Work through the observer notes as a group. Identify: what worked, what was slow, what was missing (runbook gaps, missing automation, undocumented dependencies), and what surprised everyone. File action items with owners and due dates before the session ends.
k6 for Load-Testing the DR Region Under Exercise
A failover exercise that only checks health endpoints proves availability but not capacity. The DR region may serve traffic correctly but collapse under production load if it was undersized. As part of every failover drill, run a 5-minute load test against the DR endpoint immediately after it passes the health check:
Tracking Findings and Closing the Loop
A game day that generates findings but no tracked action items is a ceremony, not an improvement loop. At Google and Amazon, each DR exercise produces a Corrective Action Plan (CAP) — a Jira/Linear epic with child tickets for each gap found. The CAP is reviewed at the next quarterly DR review. RTO and RPO measurements from each exercise are recorded as time-series data points; a regression — measured RTO increasing quarter over quarter — triggers an immediate investigation, just as an SLO burn-rate alert would.
{ "exercise_id": "...", "scenario": "...", "rto_target_s": 600, "rto_actual_s": 743, "rpo_target_s": 300, "rpo_actual_s": 187, "pass": false, "gaps": [...] }) at the end of every run. Ingest this into your observability platform so you can graph RTO/RPO trends across exercises. When you present to leadership, showing a chart of five consecutive game days where RTO improved from 18 minutes to 7 minutes is far more persuasive than a narrative report.
DR testing is the discipline that converts the earlier lessons in this tutorial — replication, failover mechanics, GitOps-driven recovery — from architectural diagrams into operationally proven capabilities. Without it, every RTO/RPO claim is a promise. With it, those claims are measurements. Run restore drills daily, failover exercises monthly, and game days quarterly. Treat each exercise as an investment: the cost is a few engineering-hours of controlled disruption; the return is eliminating the catastrophic surprise that a real disaster would otherwise deliver.