Incident Management & On-Call

Runbooks & Playbooks

18 min Lesson 5 of 28

Runbooks & Playbooks

It is 3:07 AM. Your phone wakes you. PagerDuty says checkout is down. Your brain is at 30% capacity. The last thing you want is a blank screen and the expectation that you will reconstruct weeks of institutional knowledge from memory while under pressure. This is exactly why runbooks exist — not as bureaucratic documentation exercises, but as survival tools that let a half-awake engineer resolve a P1 incident without making it worse.

At companies like Google, Netflix, and Stripe, runbooks are treated as first-class production artifacts, version-controlled alongside service code, reviewed in pull requests, and tested in fire drills. The gap between a team that survives incidents gracefully and one that thrashes through them is almost always the quality of their runbooks.

Runbook vs. Playbook: A runbook is a procedure for a specific, known failure mode — "what to do when Redis runs out of memory." A playbook is a higher-level response guide for a class of incidents — "how to manage a database outage." Playbooks orchestrate multiple runbooks. Both are essential; neither replaces the other.

What Makes a Runbook Usable at 3 AM

Most runbooks fail not because they are inaccurate but because they are written for the author, not the reader. The author already knows the context. The reader at 3 AM does not. A usable runbook has these properties:

A single, concrete trigger. The runbook is linked directly from the alert that fires it. An engineer should never have to guess which runbook applies. The Alertmanager annotation or PagerDuty incident description contains the URL.
Symptom confirmation first. Before taking any action, the engineer confirms they have the right problem. Every runbook starts with verification steps: "You should see X in the dashboard. If you do not, stop and escalate."
Numbered, atomic steps. Not paragraphs. Not prose. Numbered steps, each small enough to do and verify independently. Engineers under stress skip, lose their place, and misread long sentences. They handle numbered lists reliably.
Commands that can be copy-pasted. Every command in a runbook should be runnable without modification, or with a clearly marked variable like ${SERVICE_NAME} or ${CLUSTER} that the engineer fills in. Ambiguity at 3 AM is dangerous.
Expected output after each command. "You should see: PONG. If you see a connection error, proceed to step 7." Engineers need confirmation that a step worked before moving on.
Explicit decision trees. "If step 4 resolves the alert within 5 minutes, proceed to the validation section. If not, escalate to the database on-call." Decisions must be made explicit — not left to judgment when judgment is impaired.
An escape hatch. Every runbook should tell you when to stop following it and call for help. Blind adherence to a runbook in an unexpected failure mode can make things worse.

Anatomy of a Production Runbook

Here is the structure used by mature SRE teams. This is a real template, not a suggestion — use it verbatim and adapt the content per service:

# Runbook: Redis Memory Pressure
**Alert:** RedisMemoryUsageHigh
**Severity:** P2 (degrades to P1 if eviction begins)
**Owner:** Platform SRE
**Last tested:** 2025-04-18 | **Last updated:** 2025-05-01
**Escalation:** #sre-oncall → platform-lead@company.com

---

## 1. Confirm the Alert
Check that Redis memory usage is > 80% on the affected instance:

```
redis-cli -h ${REDIS_HOST} -p 6379 INFO memory | grep used_memory_human
```
Expected: value > 80% of maxmemory (check CONFIG GET maxmemory).

If used_memory_human is below 70%, this alert fired spuriously. Page the observability team.

---

## 2. Check Eviction Policy & Active Eviction
```
redis-cli -h ${REDIS_HOST} CONFIG GET maxmemory-policy
redis-cli -h ${REDIS_HOST} INFO stats | grep evicted_keys
```
If evicted_keys is increasing, data loss is already occurring. Severity escalates to P1. Notify product on-call immediately.

---

## 3. Identify the Largest Keys
```
redis-cli -h ${REDIS_HOST} --bigkeys --count 200
```
Note the top 5 key patterns by size and their TTL:
```
redis-cli -h ${REDIS_HOST} DEBUG OBJECT <key>
redis-cli -h ${REDIS_HOST} TTL <key>
```

---

## 4. Mitigation Options (choose one based on findings)

**Option A: Key has no TTL and should have one**
```
redis-cli -h ${REDIS_HOST} EXPIRE <key> 3600
```

**Option B: Runaway session accumulation**
```
# Check session key count
redis-cli -h ${REDIS_HOST} DBSIZE
# Flush only the sessions DB (DB 1) if confirmed safe with team lead
redis-cli -h ${REDIS_HOST} -n 1 FLUSHDB ASYNC
```
⚠️ Confirm with lead before flushing. This logs out all active users.

**Option C: Increase maxmemory temporarily (buys 30 min)**
```
redis-cli -h ${REDIS_HOST} CONFIG SET maxmemory 8gb
```
This is a temporary fix. File a capacity ticket immediately.

---

## 5. Validate Resolution
```
redis-cli -h ${REDIS_HOST} INFO memory | grep used_memory_human
```
Alert should clear within 5 minutes if usage is below 70%.

---

## 6. Post-Incident
- Update the incident timeline in the war room channel.
- File a capacity ticket if Option C was used.
- Add findings to the postmortem draft.

Link runbooks to alerts, not wikis. In Alertmanager, set annotations.runbook_url on every alert rule. In PagerDuty, attach the runbook as a response play. Engineers should never have to search for the runbook — it should appear in the pager notification itself.

Runbook as Code: Version Control and Testing

Runbooks that live in Confluence or Google Docs rot. Engineers fix the system, forget to update the doc, and the next responder follows stale instructions. Treat runbooks like code:

Store them in the service repository under docs/runbooks/ or in a dedicated runbooks repo.
Require a runbook update in the PR checklist for any change that affects the alert threshold or remediation path.
Add a "Last tested" field. During chaos engineering or fire drills, actually run through the runbook and update this date.
Use CI to validate links — broken URLs to dashboards and escalation contacts are common and catastrophic during incidents.

# .github/PULL_REQUEST_TEMPLATE.md (excerpt)
## Runbook & Alert Checklist
- [ ] If this changes a service behavior, is the relevant runbook updated?
- [ ] If this adds a new alert, does the alert annotation include a runbook_url?
- [ ] If this changes thresholds, are the "Expected output" sections in the runbook still accurate?
- [ ] Runbook link: ___________

Playbooks: Orchestrating the Response

Where a runbook handles one failure mode, a playbook handles an incident class. A "Database Outage Playbook" does not tell you how to fix a specific database error — it tells you how to run the incident: who takes Incident Command, which runbooks to invoke in parallel, how to communicate with stakeholders, when to declare SEV1, and what the rollback criteria are. Playbooks are referenced by Incident Commanders, not necessarily by the engineers executing remediation steps.

Alerts link to runbooks; playbooks orchestrate the wider incident response; postmortems feed improvements back into both.

The Runbook Lifecycle: Writing, Validating, Retiring

A runbook has a lifecycle. It is created when a new alert is added. It is validated during the next fire drill or real incident. It is updated after every postmortem that reveals a gap. It is retired when the underlying system changes and the failure mode no longer exists. Teams that do not retire stale runbooks accumulate dangerous noise — engineers stop trusting the documentation and start improvising, which defeats the purpose entirely.

Build a quarterly audit into your team calendar. Walk through every runbook: does the command still work? Does the dashboard URL resolve? Is the escalation contact still on the team? A two-hour audit every quarter prevents hours of confusion during incidents.

The most dangerous runbook is one that is 90% correct. An engineer following a mostly-right runbook will execute the first eight steps confidently, hit step nine that no longer applies, and either make a bad judgment call or freeze. Out-of-date runbooks are worse than no runbooks because they create false confidence. If you cannot commit to keeping them current, write fewer runbooks and keep them pristine.

Automation: The Runbook's Final Form

Every manual step in a runbook is a bug. The long-term goal is to automate runbook steps until the runbook becomes a reference document for edge cases rather than a primary response mechanism. When your Redis eviction runbook is run 10 times and the fix is always "flush DB 1," that step should become an automated remediation triggered by the alert. Tools like AWS Systems Manager Automation, Rundeck, and custom Lambda/Cloud Function responses make this possible. The runbook still exists — to describe what the automation does, the conditions under which it runs, and how to manually intervene if the automation fails.