Game Days
Game Days
A game day is a structured, time-boxed event in which an engineering team deliberately induces failure in a controlled environment to test whether their systems, runbooks, and people respond as expected. The term comes from American football — the week of practice ends; now you play under real conditions. At Amazon, game days (called "fire drills") were institutionalized before AWS existed. At Google, the equivalent is the DiRT (Disaster Recovery Test) exercise run annually at the infrastructure layer. At Netflix, they evolved from quarterly chaos days into the continuous Chaos Monkey pipeline. In every organization that runs them well, game days do something automated chaos tooling alone cannot: they test the entire sociotechnical system — not just the software.
The Four Phases of a Well-Run Game Day
Phase 1: Planning (1-2 Weeks Before)
Start with a hypothesis, not a script. The hypothesis follows the chaos engineering format: "We believe that if failure condition X occurs, system Y will respond with behavior Z, and customer impact will be limited to W." A vague goal ("let us see what breaks") produces vague learning. A sharp hypothesis produces a pass/fail verdict and a clear action item.
Define the blast radius before you touch anything. Enumerate which services, which data stores, and which user populations could be affected. For each one, define a rollback procedure and a halt condition — the observable signal that tells you to stop the experiment immediately. Without a halt condition written down in advance, you will hesitate to stop when you should, and that hesitation costs customers.
Decide on scope:
- Production game day — highest fidelity, highest risk. Appropriate for mature teams with solid observability and rollback automation. Run in a low-traffic window (e.g., Tuesday 10:00 AM when traffic is 60% of peak, not Friday afternoon).
- Staging game day — safe, low fidelity. Good for new teams or new failure modes. The limitation is that staging traffic patterns rarely match production, so some failure modes will not manifest.
- Shadow / dark launch game day — inject faults only into a shadow traffic copy. Excellent for validating resilience without any customer risk, but requires shadow infrastructure.
Assign roles explicitly before the day: Chaos Operator (injects the fault), Incident Commander (leads the response, same role as in real incidents), Observer/Scribe (documents timeline and decisions — a different person from the responders), and optionally a Customer Advocate who watches external SLO dashboards and can call a halt.
Phase 2: Dry Run
One week before the game day, run the fault injection in staging with the same team and timeline. This catches configuration mistakes (wrong cluster targeted, wrong fault type), validates that your monitoring fires the expected alerts, and confirms that your rollback procedure actually works. Discovering that your rollback script has a syntax error on game day is embarrassing; discovering it in a dry run is just a Tuesday.
Phase 3: Execution
Begin with a steady-state validation: confirm all SLOs are green, error rates are at baseline, and dashboards are correctly loaded before touching anything. This is your pre-fault baseline — screenshot it.
Inject the fault. The Chaos Operator announces each action on the incident bridge in real time: "Injecting 500 ms latency on all calls from order-service to inventory-service — starting now." The Observer timestamps this in the running log.
Let the team respond exactly as they would in a real incident. Do not coach, do not hint, do not step in unless the halt condition is triggered. The simulation has zero value if you rescue the responders — the point is to expose gaps while the stakes are artificial.
After a pre-agreed dwell time (typically 15-30 minutes for a single fault), remove the fault, confirm the system returns to steady state, and then optionally inject the next fault in the scenario. Complex game days chain multiple simultaneous or sequential faults to test compound failure handling.
Phase 4: Retrospective (Same Day or Next Day)
The retrospective is where the learning lives. Run a blameless postmortem format — the same format you use for real incidents. Work from the Observer timeline. For each decision point, ask: "What information did the responder have? What did they decide? What would have made that decision faster or more accurate?" This surfaces documentation gaps, missing dashboards, and unclear runbook steps — all fixable before the next real incident.
Produce a written report with three sections: what we expected (the hypothesis), what actually happened (the timeline), and action items with owners and due dates. Without owners and dates, action items do not get done. File them as engineering tickets in the same sprint cycle.
Coordinating a Multi-Team Game Day
A single-service game day is straightforward. A multi-team game day — simulating a zone outage that affects five services owned by four teams — requires additional coordination infrastructure.
Use a dedicated Slack channel for game day only, separate from your normal incident channels, so observers can watch without polluting the real-time responder conversation. The Chaos Operator posts all fault injection and rollback actions as timestamped messages there; the Observer copies them into a shared doc in real time.
Agree on a shared all-stop signal that any participant can invoke — a specific emoji in Slack, a passcode on the bridge line. Anyone who sees real customer harm forming should be able to halt the exercise without negotiation. This is the most important procedural rule of a production game day.
Game Day Scenario Design
Scenarios should be drawn from your actual incident history plus theoretical failure modes in your dependency graph. A good starting portfolio:
- Single dependency latency — inject 2 s p99 latency on one downstream call and observe whether circuit breakers trip and fallbacks engage within the expected time.
- Leader election loss — kill the primary in a replicated system (database primary, Kafka controller, etcd leader) and measure actual failover time against your SLO budget.
- Memory pressure — inject a memory-hungry process on one node and observe whether GC pauses cause timeout cascades or whether CPU throttling contains the blast.
- Certificate expiry simulation — rotate a certificate to an expired one in staging and validate that your mTLS setup fails gracefully (refuses connections) rather than silently passing traffic.
- Runbook accuracy test — give a new engineer the runbook for a known failure class and watch them follow it during a simulated incident. If they get stuck, the runbook is incomplete.
Measuring Game Day Effectiveness
Track these metrics across game days to demonstrate maturity over time:
- Mean Time to Detect (MTTD) — how long from fault injection until the first alert fired. Your SLO alert windows define the target; anything longer is a gap.
- Mean Time to Mitigate (MTTM) — how long from alert until the fault was contained. Compare against your error budget burn rate to understand whether response speed is SLO-safe.
- Hypothesis accuracy — did the system behave as predicted? If your hypothesis was right every time, your scenarios are not novel enough. A 70-80% accuracy rate is healthy; it means you are testing things you do not fully understand yet.
- Action item closure rate — what percentage of action items from the last game day were completed before the next one? Below 80% signals a process failure, not an engineering failure.
Learning Without Breaking Customers
The central tension of production game days is fidelity versus safety. Here are the techniques the industry uses to maximize both:
- Start in staging, graduate to production. Run the first three iterations of any new scenario type in staging. Only move to production once you have confirmed the experiment behaves as designed and the rollback procedure works reliably.
- Use feature flags to limit blast radius. If your platform supports per-user or per-cohort feature flags, inject faults only for internal users (employees) before expanding to a small percentage of production traffic. Amplitude, LaunchDarkly, and homegrown flag systems all support this pattern.
- Automate rollback, not just rollforward. Every fault injection script must have a paired teardown script. The Chaos Operator runs teardown if the halt condition fires. Never rely on manual steps during a halt — you will be stressed and under pressure; the script must be one command.
- Track error budget consumption. Before every game day, calculate how many minutes of SLO breach remain in your monthly error budget. If you are already at 80% of budget consumed, postpone the game day — you do not have enough budget headroom to absorb a planned outage plus the risk of an unplanned one that week.