Chaos Engineering & Resilience

Game Days

18 min Lesson 7 of 27

Game Days

A game day is a structured, time-boxed event in which an engineering team deliberately induces failure in a controlled environment to test whether their systems, runbooks, and people respond as expected. The term comes from American football — the week of practice ends; now you play under real conditions. At Amazon, game days (called "fire drills") were institutionalized before AWS existed. At Google, the equivalent is the DiRT (Disaster Recovery Test) exercise run annually at the infrastructure layer. At Netflix, they evolved from quarterly chaos days into the continuous Chaos Monkey pipeline. In every organization that runs them well, game days do something automated chaos tooling alone cannot: they test the entire sociotechnical system — not just the software.

Game days test people and process, not just software. Automated chaos tools (Chaos Monkey, Gremlin, Litmus) can inject faults 24/7. What they cannot do is reveal that your on-call engineer does not know where the runbook lives, that your incident bridge line requires a PIN no one has, or that your blast-radius assumptions were wrong because a dependency was undocumented. Game days expose these gaps before a real incident does.

The Four Phases of a Well-Run Game Day

Phase 1: Planning (1-2 Weeks Before)

Start with a hypothesis, not a script. The hypothesis follows the chaos engineering format: "We believe that if failure condition X occurs, system Y will respond with behavior Z, and customer impact will be limited to W." A vague goal ("let us see what breaks") produces vague learning. A sharp hypothesis produces a pass/fail verdict and a clear action item.

Define the blast radius before you touch anything. Enumerate which services, which data stores, and which user populations could be affected. For each one, define a rollback procedure and a halt condition — the observable signal that tells you to stop the experiment immediately. Without a halt condition written down in advance, you will hesitate to stop when you should, and that hesitation costs customers.

Decide on scope:

Production game day — highest fidelity, highest risk. Appropriate for mature teams with solid observability and rollback automation. Run in a low-traffic window (e.g., Tuesday 10:00 AM when traffic is 60% of peak, not Friday afternoon).
Staging game day — safe, low fidelity. Good for new teams or new failure modes. The limitation is that staging traffic patterns rarely match production, so some failure modes will not manifest.
Shadow / dark launch game day — inject faults only into a shadow traffic copy. Excellent for validating resilience without any customer risk, but requires shadow infrastructure.

Assign roles explicitly before the day: Chaos Operator (injects the fault), Incident Commander (leads the response, same role as in real incidents), Observer/Scribe (documents timeline and decisions — a different person from the responders), and optionally a Customer Advocate who watches external SLO dashboards and can call a halt.

Phase 2: Dry Run

One week before the game day, run the fault injection in staging with the same team and timeline. This catches configuration mistakes (wrong cluster targeted, wrong fault type), validates that your monitoring fires the expected alerts, and confirms that your rollback procedure actually works. Discovering that your rollback script has a syntax error on game day is embarrassing; discovering it in a dry run is just a Tuesday.

Phase 3: Execution

Begin with a steady-state validation: confirm all SLOs are green, error rates are at baseline, and dashboards are correctly loaded before touching anything. This is your pre-fault baseline — screenshot it.

Inject the fault. The Chaos Operator announces each action on the incident bridge in real time: "Injecting 500 ms latency on all calls from order-service to inventory-service — starting now." The Observer timestamps this in the running log.

Let the team respond exactly as they would in a real incident. Do not coach, do not hint, do not step in unless the halt condition is triggered. The simulation has zero value if you rescue the responders — the point is to expose gaps while the stakes are artificial.

After a pre-agreed dwell time (typically 15-30 minutes for a single fault), remove the fault, confirm the system returns to steady state, and then optionally inject the next fault in the scenario. Complex game days chain multiple simultaneous or sequential faults to test compound failure handling.

Announce game days broadly — but do not brief the responders on the exact fault. The on-call team should know a game day is happening (so they do not escalate unnecessarily), but they should not know whether the injected fault is a zone outage, a slow database, or a malformed config deploy. Knowing the answer in advance turns a resilience test into a rehearsal, which is less valuable. Security-style "purple team" exercises follow the same principle.

Phase 4: Retrospective (Same Day or Next Day)

The retrospective is where the learning lives. Run a blameless postmortem format — the same format you use for real incidents. Work from the Observer timeline. For each decision point, ask: "What information did the responder have? What did they decide? What would have made that decision faster or more accurate?" This surfaces documentation gaps, missing dashboards, and unclear runbook steps — all fixable before the next real incident.

Produce a written report with three sections: what we expected (the hypothesis), what actually happened (the timeline), and action items with owners and due dates. Without owners and dates, action items do not get done. File them as engineering tickets in the same sprint cycle.

Coordinating a Multi-Team Game Day

A single-service game day is straightforward. A multi-team game day — simulating a zone outage that affects five services owned by four teams — requires additional coordination infrastructure.

Use a dedicated Slack channel for game day only, separate from your normal incident channels, so observers can watch without polluting the real-time responder conversation. The Chaos Operator posts all fault injection and rollback actions as timestamped messages there; the Observer copies them into a shared doc in real time.

Agree on a shared all-stop signal that any participant can invoke — a specific emoji in Slack, a passcode on the bridge line. Anyone who sees real customer harm forming should be able to halt the exercise without negotiation. This is the most important procedural rule of a production game day.

Never run a production game day without a pre-approved change freeze window. If another team deploys a config change during your game day, it contaminates the experiment — you cannot tell whether the anomaly you see is your fault injection or their deploy. Coordinate a freeze window with your change management process, and post it in your engineering calendar at least 48 hours in advance.

Game Day Scenario Design

Scenarios should be drawn from your actual incident history plus theoretical failure modes in your dependency graph. A good starting portfolio:

Single dependency latency — inject 2 s p99 latency on one downstream call and observe whether circuit breakers trip and fallbacks engage within the expected time.
Leader election loss — kill the primary in a replicated system (database primary, Kafka controller, etcd leader) and measure actual failover time against your SLO budget.
Memory pressure — inject a memory-hungry process on one node and observe whether GC pauses cause timeout cascades or whether CPU throttling contains the blast.
Certificate expiry simulation — rotate a certificate to an expired one in staging and validate that your mTLS setup fails gracefully (refuses connections) rather than silently passing traffic.
Runbook accuracy test — give a new engineer the runbook for a known failure class and watch them follow it during a simulated incident. If they get stuck, the runbook is incomplete.

A game day execution timeline: steady-state snapshot, fault injection, alert detection lag (MTTD), responder mitigation (MTTM), and recovery. The SLO breach window reveals a gap against the target recovery time.

Measuring Game Day Effectiveness

Track these metrics across game days to demonstrate maturity over time:

Mean Time to Detect (MTTD) — how long from fault injection until the first alert fired. Your SLO alert windows define the target; anything longer is a gap.
Mean Time to Mitigate (MTTM) — how long from alert until the fault was contained. Compare against your error budget burn rate to understand whether response speed is SLO-safe.
Hypothesis accuracy — did the system behave as predicted? If your hypothesis was right every time, your scenarios are not novel enough. A 70-80% accuracy rate is healthy; it means you are testing things you do not fully understand yet.
Action item closure rate — what percentage of action items from the last game day were completed before the next one? Below 80% signals a process failure, not an engineering failure.

# Game day observability: PromQL checks to run before and after fault injection

# 1. Confirm steady state: error rate baseline (run BEFORE injecting)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# 2. Watch p99 latency in real time during the fault window
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[1m])) by (le, service)
)

# 3. Circuit breaker state (resilience4j metrics via Micrometer/Prometheus)
resilience4j_circuitbreaker_state{state="open"}

# 4. Post a Grafana annotation to pin the exact fault window on all dashboards
#    Run this immediately after injecting — timestamps the event for post-game analysis
curl -s -X POST http://grafana:3000/api/annotations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  -d '{
    "time":  '"$(date +%s%3N)"',
    "text":  "Game day fault injected: db-latency 2s on order-service",
    "tags":  ["gameday", "chaos", "fault-inject"]
  }'

# 5. After fault removed: watch error rate return to baseline (poll every 15 s)
watch -n15 'curl -sg "http://prometheus:9090/api/v1/query" \
  --data-urlencode \
  "query=sum(rate(http_requests_total{status=~\"5..\"}[2m]))/sum(rate(http_requests_total[2m]))" \
  | jq .data.result'

Learning Without Breaking Customers

The central tension of production game days is fidelity versus safety. Here are the techniques the industry uses to maximize both:

Start in staging, graduate to production. Run the first three iterations of any new scenario type in staging. Only move to production once you have confirmed the experiment behaves as designed and the rollback procedure works reliably.
Use feature flags to limit blast radius. If your platform supports per-user or per-cohort feature flags, inject faults only for internal users (employees) before expanding to a small percentage of production traffic. Amplitude, LaunchDarkly, and homegrown flag systems all support this pattern.
Automate rollback, not just rollforward. Every fault injection script must have a paired teardown script. The Chaos Operator runs teardown if the halt condition fires. Never rely on manual steps during a halt — you will be stressed and under pressure; the script must be one command.
Track error budget consumption. Before every game day, calculate how many minutes of SLO breach remain in your monthly error budget. If you are already at 80% of budget consumed, postpone the game day — you do not have enough budget headroom to absorb a planned outage plus the risk of an unplanned one that week.

The most valuable game days are the ones where the hypothesis is wrong. If the circuit breaker you thought would trip in 30 seconds actually takes 4 minutes, you have just discovered a real production risk under controlled conditions — which is exactly the point. A game day where everything works as expected is still valuable for confidence, but a game day that reveals a gap is what justifies the program to leadership.