Disaster Recovery & Multi-Region

DR Testing & Game Days

18 min Lesson 8 of 27

DR Testing & Game Days

A DR plan that has never been executed is a hypothesis. Every untested runbook, every undrained failover script, every backup set that has never been restored carries a hidden assumption: that it will work under pressure, on the day it matters most. The only way to convert that assumption into a known property is to run the plan — deliberately, repeatedly, and with enough realism that gaps surface before a real incident does.

This lesson covers the full spectrum of DR validation: restore drills (proving backups are intact and recoverable within your RPO), failover exercises (proving the full promotion sequence executes within your RTO), and structured game days — the big-tech practice of injecting controlled failures into production-like environments with an observer team, a clear hypothesis, and a post-game review. You have been running chaos experiments since the chaos engineering tutorial; game days extend that thinking to multi-system, time-bounded DR scenarios with explicit RTO/RPO pass/fail criteria.

The Testing Pyramid for DR

Just as unit tests run cheaply and frequently while end-to-end tests run slowly, DR testing has a pyramid: component-level restore drills at the base (frequent, automated), integration-level failover rehearsals in the middle (monthly, semi-automated), and full game days at the apex (quarterly, manual leadership involvement). Running only the apex is the most common mistake — teams simulate a full region failover once a year and discover on the day that the DNS TTL was never lowered, or the IAM role for the DR automation expired.

The DR testing pyramid: high-frequency automated restore drills at the base, monthly failover exercises in the middle, quarterly full-scenario game days at the apex.

Restore Drills: Proving Backups Are Real

A backup that has never been restored is not a backup — it is a file. Restore drills are the automated, scheduled process of taking a recent backup, restoring it to an isolated environment, running integrity checks, and measuring how long the full restore took. The output feeds directly into your RPO and RTO dashboards.

The following script runs a nightly Postgres restore drill in CI/CD (adapt the S3 paths and DB name for your stack). It restores the latest pg_dump snapshot to an ephemeral RDS instance, runs row-count sanity checks, and records elapsed time to a metrics endpoint:

#!/usr/bin/env bash
# dr-restore-drill.sh — nightly Postgres restore drill
# Usage: called by a scheduled GitHub Actions job or Jenkins cron

set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_BUCKET="s3://my-dr-backups/postgres"
DB_NAME="appdb"
RESTORE_HOST="dr-drill-$TIMESTAMP.internal"
METRICS_URL="https://pushgateway.internal/metrics/job/dr_drill"

echo "[drill] Starting restore drill at $TIMESTAMP"
START=$(date +%s)

# 1. Find the most recent backup manifest
LATEST=$(aws s3 ls "$BACKUP_BUCKET/" \
  | sort \
  | tail -1 \
  | awk '{print $4}')
echo "[drill] Restoring $LATEST"

# 2. Download
aws s3 cp "$BACKUP_BUCKET/$LATEST" /tmp/restore.dump.gz

# 3. Spin up an ephemeral Postgres container (or RDS snapshot restore)
docker run -d \
  --name "dr-drill-pg-$TIMESTAMP" \
  -e POSTGRES_PASSWORD=drillpass \
  -e POSTGRES_DB=$DB_NAME \
  -p 5433:5432 \
  postgres:16

sleep 10   # wait for Postgres to initialise

# 4. Restore
zcat /tmp/restore.dump.gz \
  | docker exec -i "dr-drill-pg-$TIMESTAMP" \
      psql -U postgres -d $DB_NAME

# 5. Sanity checks
ROW_COUNT=$(docker exec "dr-drill-pg-$TIMESTAMP" \
  psql -U postgres -d $DB_NAME -tAc "SELECT COUNT(*) FROM orders;")

if [ "$ROW_COUNT" -lt 1000 ]; then
  echo "[drill] FAIL: orders table has $ROW_COUNT rows (expected >= 1000)"
  EXIT_CODE=1
else
  echo "[drill] PASS: $ROW_COUNT rows in orders"
  EXIT_CODE=0
fi

# 6. Measure and push metric
END=$(date +%s)
ELAPSED=$((END - START))
echo "[drill] Restore elapsed: ${ELAPSED}s"

cat <<EOF | curl --data-binary @- "$METRICS_URL"
# HELP dr_restore_duration_seconds Time to restore latest DB backup
# TYPE dr_restore_duration_seconds gauge
dr_restore_duration_seconds{db="$DB_NAME"} $ELAPSED

# HELP dr_restore_success 1 if last drill passed, 0 if failed
# TYPE dr_restore_success gauge
dr_restore_success{db="$DB_NAME"} $EXIT_CODE
EOF

# 7. Teardown
docker rm -f "dr-drill-pg-$TIMESTAMP"

exit $EXIT_CODE

Pipe restore drill metrics into your SLO dashboards. Treat dr_restore_duration_seconds as a direct SLI for your RPO. If your RPO is 30 minutes and restore time is trending toward 28 minutes, you are two weeks from an RPO violation — and you will see it in the graph before a real incident forces the issue. Grafana alert on dr_restore_duration_seconds > (rpo_seconds * 0.8).

Failover Exercises: Timing the Full Sequence

A failover exercise is more invasive than a restore drill. You are executing the entire promotion sequence — DNS cutover, database promotion, load-balancer re-pointing, readiness checks — against a staging or shadow environment that mirrors production topology. The goal is to measure end-to-end elapsed time from "failure declared" to "traffic successfully serving from DR region," then compare it against your RTO contract.

For a Kubernetes-based stack with Argo CD managing GitOps state, a failover exercise typically involves the following sequence. Automate it with a runbook script that records timestamps at each step:

#!/usr/bin/env bash
# failover-drill.sh — Kubernetes + ArgoCD + Route53 failover exercise
# Designed to run in a DR staging environment, NOT production.

set -euo pipefail

DR_REGION="us-west-2"
PRIMARY_CLUSTER="prod-us-east-1"
DR_CLUSTER="dr-us-west-2"
HOSTED_ZONE_ID="Z1234567890"
RECORD_NAME="api.myapp.internal"
DR_LB_DNS="k8s-dr-lb-abc123.us-west-2.elb.amazonaws.com"
START=$(date +%s)
log() { echo "[$(date -u +%T)] $*"; }

# Step 1: Isolate primary (simulate region failure)
log "STEP 1: Isolating primary cluster"
kubectl --context $PRIMARY_CLUSTER cordon --all-namespaces \
  --selector tier=backend 2>&1 || true

# Step 2: Promote the DR Postgres replica to primary
log "STEP 2: Promoting Postgres replica in $DR_REGION"
aws rds promote-read-replica \
  --db-instance-identifier appdb-dr \
  --region $DR_REGION
# Wait for promotion
aws rds wait db-instance-available \
  --db-instance-identifier appdb-dr \
  --region $DR_REGION
T2=$(date +%s); log "DB promotion done in $((T2 - START))s"

# Step 3: Update the ArgoCD app to point at DR cluster
log "STEP 3: Syncing ArgoCD to DR cluster"
argocd app set myapp \
  --dest-server "https://$(kubectl --context $DR_CLUSTER config view \
    --minify -o jsonpath='{.clusters[0].cluster.server}')" \
  --revision main
argocd app sync myapp --timeout 300
T3=$(date +%s); log "ArgoCD sync done in $((T3 - START))s"

# Step 4: DNS cutover
log "STEP 4: Cutting over Route 53"
aws route53 change-resource-record-sets \
  --hosted-zone-id $HOSTED_ZONE_ID \
  --change-batch "{
    \"Changes\": [{
      \"Action\": \"UPSERT\",
      \"ResourceRecordSet\": {
        \"Name\": \"$RECORD_NAME\",
        \"Type\": \"CNAME\",
        \"TTL\": 30,
        \"ResourceRecords\": [{\"Value\": \"$DR_LB_DNS\"}]
      }
    }]
  }"
T4=$(date +%s); log "DNS cutover done in $((T4 - START))s"

# Step 5: Validate
log "STEP 5: Health-checking DR endpoint"
until curl -sf "https://$RECORD_NAME/healthz" | grep -q '"status":"ok"'; do
  sleep 5
done
T5=$(date +%s)
TOTAL=$((T5 - START))
log "COMPLETE. Total failover time: ${TOTAL}s"

if [ $TOTAL -le 900 ]; then
  log "RTO PASS: $TOTAL s <= 900 s target"
else
  log "RTO FAIL: $TOTAL s > 900 s target"
  exit 1
fi

Pre-warm your DNS TTLs before a drill — not during. If your current TTL is 300 seconds (5 minutes) and you drop it to 30 seconds at failover time, resolvers will continue serving stale records for the full 5 minutes. You must lower the TTL at least one full TTL period before any planned failover exercise. In production DR planning, this means keeping critical DNS records at TTL 30–60 permanently for Tier 0 services, accepting slightly higher resolver load in exchange for near-instant propagation during a real event.

Game Days: Structured Chaos Under Observation

A game day is a bounded, observed, hypothesis-driven failure experiment at system scope. It differs from a failover drill in three ways: (1) the scenario is injected into a production-like environment, often without every participant knowing the exact failure mode in advance; (2) an observer team (engineers not directly on the response team) documents timeline, decisions, and gaps; (3) there is a formal hypothesis — "We believe our system will achieve RTO < 10 minutes during a full us-east-1 AZ failure, because our failover runbook was validated last quarter" — and a clear pass/fail criterion evaluated after the exercise.

A well-run game day follows this structure:

Pre-game brief (30 min): Publish the scenario (or a sanitized version of it), confirm the blast radius is contained to the test environment, brief the observer team, assign a timekeeper, and agree on the abort condition (a specific observable that means "stop immediately").
Failure injection (varies): Inject the failure using your chaos tooling (chaos-mesh, AWS Fault Injection Simulator, or manual action). The injection team does not help the response team.
Response window: The on-call team responds exactly as they would during a real incident — Slack war room, runbooks, escalation paths. The observer team records every action with a wall-clock timestamp.
Halt and measure: At the agreed end condition (service restored, RTO window expired, or abort trigger hit), the injection team stops the experiment. Measure actual RTO and RPO from the observer timeline.
Post-game review (60–90 min same day): Work through the observer notes as a group. Identify: what worked, what was slow, what was missing (runbook gaps, missing automation, undocumented dependencies), and what surprised everyone. File action items with owners and due dates before the session ends.

The most valuable game day output is not the pass/fail verdict — it is the surprises. Teams that run game days consistently report that the most impactful findings are not "the failover took 18 minutes instead of 10" but "we did not know that service X has a hard dependency on the primary-region secret manager, so it never came up in the DR cluster." These hidden dependencies are invisible to architecture diagrams and only surface under actual failure conditions.

k6 for Load-Testing the DR Region Under Exercise

A failover exercise that only checks health endpoints proves availability but not capacity. The DR region may serve traffic correctly but collapse under production load if it was undersized. As part of every failover drill, run a 5-minute load test against the DR endpoint immediately after it passes the health check:

// dr-smoke-load.js — k6 script for DR region load validation
// Usage: k6 run --out influxdb=http://influx:8086/k6 dr-smoke-load.js

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

export const options = {
  stages: [
    { duration: '1m', target: 200 },   // ramp to 200 VUs
    { duration: '3m', target: 200 },   // hold at 200 (10% of peak prod traffic)
    { duration: '1m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // 95th pct must be < 500 ms
    errors: ['rate<0.01'],             // error rate must be < 1%
  },
};

const DR_BASE = __ENV.DR_BASE_URL || 'https://dr-api.myapp.internal';

export default function () {
  const res = http.get(`${DR_BASE}/api/v1/products`, {
    headers: { Authorization: `Bearer ${__ENV.TEST_TOKEN}` },
  });

  check(res, {
    'status 200': (r) => r.status === 200,
    'latency ok':  (r) => r.timings.duration < 500,
  });

  errorRate.add(res.status !== 200);
  sleep(0.5);
}

Tracking Findings and Closing the Loop

A game day that generates findings but no tracked action items is a ceremony, not an improvement loop. At Google and Amazon, each DR exercise produces a Corrective Action Plan (CAP) — a Jira/Linear epic with child tickets for each gap found. The CAP is reviewed at the next quarterly DR review. RTO and RPO measurements from each exercise are recorded as time-series data points; a regression — measured RTO increasing quarter over quarter — triggers an immediate investigation, just as an SLO burn-rate alert would.

Automate the game-day evidence package. Configure your DR drill scripts to emit a structured JSON log (

{ "exercise_id": "...", "scenario": "...", "rto_target_s": 600, "rto_actual_s": 743, "rpo_target_s": 300, "rpo_actual_s": 187, "pass": false, "gaps": [...] }

) at the end of every run. Ingest this into your observability platform so you can graph RTO/RPO trends across exercises. When you present to leadership, showing a chart of five consecutive game days where RTO improved from 18 minutes to 7 minutes is far more persuasive than a narrative report.

DR testing is the discipline that converts the earlier lessons in this tutorial — replication, failover mechanics, GitOps-driven recovery — from architectural diagrams into operationally proven capabilities. Without it, every RTO/RPO claim is a promise. With it, those claims are measurements. Run restore drills daily, failover exercises monthly, and game days quarterly. Treat each exercise as an investment: the cost is a few engineering-hours of controlled disruption; the return is eliminating the catastrophic surprise that a real disaster would otherwise deliver.