Incident Management & On-Call

Project: Run a Mock Incident

35 min Lesson 10 of 28

Project: Run a Mock Incident

Everything in this tutorial — anatomy, on-call discipline, severity levels, incident command, runbooks, communication, blameless postmortems, learning systems, and tooling — converges here. This project walks you through a complete, realistic incident from the moment the pager fires to the completed postmortem document. Follow every step as if it is real. The goal is muscle memory: when a P0 lands at 03:00 on a Tuesday you do not want to be thinking about process, you want process running automatically while you think about the system.

How to use this project: Run this alone as a tabletop exercise using the commands below against a local or staging environment, or run it with two or three colleagues — one as Incident Commander (IC), one as tech lead, one as communications lead. The scenario is a realistic payment-service outage triggered by a bad deploy. Every command is real and runnable.

The Scenario

Your team operates payments-api, a Kubernetes-hosted service that handles checkout for an e-commerce platform. At 14:32 UTC a deploy goes out — a routine dependency bump of the database connection pool library from pgpool v2.3.1 to v2.4.0. At 14:38 UTC PagerDuty fires a critical alert: ErrorBudgetBurnHigh. The error rate has climbed from 0.03% to 4.2% and is still rising. Users are seeing HTTP 500s on the checkout endpoint. This is a P1 on your severity scale — significant user impact, no complete outage, but burning error budget fast. Your job is to run the response.

The complete mock incident timeline from deploy to resolution. TTD 6 min, TTM 21 min, TTR 27 min — a solid outcome driven by fast detection and a clear rollback path.

Step 1 — The Page Arrives (T+0, 14:38 UTC)

PagerDuty wakes you. The alert is ErrorBudgetBurnHigh — severity critical, 14x burn rate. Before touching anything, acknowledge the page (stops escalation) and immediately open the incident channel.

# Acknowledge via PagerDuty CLI (stops escalation to the next on-call)
pd incident acknowledge --id P4X9KLM

# Open a dedicated Slack channel (or use your incident tool's API)
# In Slack, type: /incident declare  (FireHydrant / Rootly integration)
# Manually: /open #inc-2024-1211-payments-errors

# Post the opening message immediately — do NOT wait until you understand the problem
# Template:
# [INCIDENT OPENED] P1 — payments-api elevated errors
# IC: @yourname | Tech lead: @jane | Comms: @bob
# Symptoms: HTTP 500s on /checkout, error rate 4.2% (normal 0.03%)
# First update in 10 minutes. Status page updated to Investigating.

# Update your status page (Statuspage.io CLI or API)
curl -s -X POST https://api.statuspage.io/v1/pages/PAGE_ID/incidents \
  -H "Authorization: OAuth YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "incident": {
      "name": "Elevated checkout error rate",
      "status": "investigating",
      "body": "We are investigating elevated error rates on the checkout service. Impact: some users may experience failures at checkout. We will update in 10 minutes.",
      "impact_override": "minor",
      "component_ids": ["CHECKOUT_COMPONENT_ID"]
    }
  }'

Opening ritual matters: The first two minutes of an incident set the tone for everything that follows. A calm, structured opening message with clear role assignments stops the "who is handling this?" chaos before it starts. At Google and Stripe, this opening message is templated in the runbook so the IC copies it in under 60 seconds.

Step 2 — Triage and Scope (T+2, 14:40)

With the channel open and the page acknowledged, you now have a few minutes to establish severity and scope before your first update is due. Pull the golden-signal dashboard and answer: how bad, how wide, when did it start?

# Check error rate and latency in Prometheus (or Grafana via API)
# Run this against your cluster's Prometheus endpoint
curl -sG 'http://prometheus.internal:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{service="payments-api",status=~"5.."}[5m]) / rate(http_requests_total{service="payments-api"}[5m])' \
  | jq '.data.result[].value[1]'

# Check which pods are throwing errors (isolates whether all replicas are bad or just some)
kubectl top pods -n prod -l app=payments-api
kubectl get events -n prod --sort-by='.lastTimestamp' | tail -30

# Correlate with deploy history — was there a deploy near 14:32?
kubectl rollout history deployment/payments-api -n prod
# Output will show revisions; check the timestamp of the latest one

# Check recent logs for the error pattern
kubectl logs -n prod -l app=payments-api --since=15m | grep -E "ERROR|FATAL|panic" | tail -40

# Typical finding: you will see something like:
# ERROR pgpool: max_connections exceeded, pool exhausted, rejecting new connections
# This tells you the new pgpool version changed the default max_connections from 100 to 10

At T+7 (14:47) you have your answer: the pgpool v2.4.0 library changed a default — pool_max_conn dropped from 100 to 10 in the new version. Under normal load (70 req/s) the pool exhausts in seconds, causing DB connection errors that surface as 500s. Every replica is affected. This is deploy-caused, which means rollback is the fastest path to mitigation.

Step 3 — Declare Severity and Assign IC (T+2, 14:40)

Based on your triage, declare the severity formally in the incident channel. In this scenario: error rate 4.2%, checkout partially broken, no complete outage, affecting ~35% of checkout attempts. This is P1 by your severity matrix. Assign roles explicitly — do not assume people know their function.

Post to the incident channel:

"Severity confirmed: P1. IC: @yourname. Tech lead: @jane (owns the investigation). Comms lead: @bob (owns status page and stakeholder updates). Root cause candidate: pgpool v2.4.0 deploy at 14:32, pool_max_conn regression. Next update: 14:50."

Common mistake — skipping the IC declaration: In high-pressure moments, teams jump straight to debugging without explicitly declaring who is in command. Within minutes you have three engineers all running different kubectl commands, one person rolling back while another is still trying to reproduce, and the comms channel is silent. The IC declaration is a 15-second investment that prevents 20 minutes of chaos.

Step 4 — Mitigation: Execute the Rollback (T+9, 14:47)

Root cause confirmed. The runbook for "deploy-induced regressions" says: rollback first, investigate after. Execute it.

# Rollback to the previous Kubernetes deployment revision
# This re-deploys the image that was running before the bad deploy
kubectl rollout undo deployment/payments-api -n prod

# Watch the rollout in real time
kubectl rollout status deployment/payments-api -n prod --timeout=180s

# While the rollout runs, watch error rate drop in another terminal
# (run this every 10 seconds — Prometheus instant query)
watch -n 10 "curl -sG 'http://prometheus.internal:9090/api/v1/query' \
  --data-urlencode 'query=rate(http_requests_total{service=\"payments-api\",status=~\"5..\"}[2m]) / rate(http_requests_total{service=\"payments-api\"}[2m])' \
  | jq '.data.result[].value[1]'"

# Verify with a smoke test against the checkout endpoint
for i in $(seq 1 5); do
  STATUS=$(curl -o /dev/null -sw "%{http_code}" -X POST https://api.example.com/checkout \
    -H "Content-Type: application/json" \
    -d '{"cart_id":"test-001","dry_run":true}')
  echo "$(date -u +%H:%M:%S) HTTP $STATUS"
  sleep 5
done

# Expected output after rollback completes:
# 14:52:05 HTTP 200
# 14:52:10 HTTP 200
# 14:52:15 HTTP 200
# 14:52:20 HTTP 200
# 14:52:25 HTTP 200

# Post to incident channel: "Rollback deployed. Error rate falling.
# Confirming stability before declaring resolved. Next update in 5 minutes."

By 14:53 the rollout is complete, all pods are running the previous image, and the error rate has dropped to 0.04% — within normal range. The rollback worked. Now you wait 5 minutes to confirm stability before declaring resolution, rather than jumping the gun.

Step 5 — Declare Resolution (T+21, 14:59)

Error rate has been nominal for 6 minutes. SLO is restored. Declare the incident resolved — explicitly, with a timestamp, in the incident channel and in the ticketing system.

# Resolve in PagerDuty
pd incident resolve --id P4X9KLM

# Add resolution note with full timeline
pd incident note add --id P4X9KLM --content "RESOLVED 14:59 UTC. Root cause: pgpool v2.4.0 changed pool_max_conn default from 100 to 10, causing connection pool exhaustion under normal load. Mitigation: rolled back deployment to previous revision (pgpool v2.3.1). Error rate returned to baseline 14:53 UTC. Postmortem scheduled 2024-12-12 10:00 UTC. IC: @yourname. Action items: pin pool_max_conn explicitly in config; add integration test for pool exhaustion; pin library version until regression analysed."

# Update status page to resolved
curl -s -X PATCH https://api.statuspage.io/v1/pages/PAGE_ID/incidents/INCIDENT_ID \
  -H "Authorization: OAuth YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "incident": {
      "status": "resolved",
      "body": "This incident has been resolved. Checkout is operating normally. We apologize for the disruption. A postmortem will be completed and shared within 48 hours."
    }
  }'

# Post to incident channel:
# [INCIDENT RESOLVED] 14:59 UTC
# TTD: 6 min | TTM: 21 min | TTR: 27 min
# Error budget consumed: ~4.8% of monthly budget in 27 minutes
# Postmortem: 2024-12-12 10:00 UTC, owner @jane
# Channel archiving in 1 hour.

Step 6 — Write the Postmortem

The incident is resolved, but the work is not done. Schedule the postmortem within 24 hours while memory is still sharp. Below is the completed postmortem document for this incident — notice the structure: timeline first (objective facts), contributing factors (not "root causes," plural), and action items with concrete owners and dates.

# Postmortem: payments-api P1 — 2024-12-11
# Duration: 14:32 – 14:59 UTC (27 minutes impact)
# Severity: P1 | TTD: 6 min | TTM: 21 min | TTR: 27 min
# Error budget consumed: ~4.8% of December monthly budget
# Incident commander: [your name] | Scribe: [bob] | Tech lead: [jane]

## Impact
- 4.2% checkout error rate (normal: 0.03%) during 14:38 – 14:53 UTC
- Estimated ~35% of checkout attempts failed during peak window
- ~2,400 failed transactions; no data loss, no payment capture on failed checkouts
- 0 complete service outage (60% of requests succeeded through retry logic)

## Timeline (all UTC)
14:32 — Deploy payments-api v1.47.0 (pgpool v2.4.0 dependency bump, routine)
14:33 — Deploy completes; new pods running pgpool v2.4.0
14:34 — First pool exhaustion errors begin (not yet detected)
14:38 — ErrorBudgetBurnHigh alert fires (14x burn rate); PagerDuty pages on-call
14:40 — On-call acknowledges, opens #inc-2024-1211-payments; IC declared
14:40 — Status page updated to "Investigating"
14:47 — Root cause identified: pgpool v2.4.0 pool_max_conn default changed 100→10
14:47 — Rollback initiated: kubectl rollout undo deployment/payments-api -n prod
14:53 — Rollback complete; error rate drops to 0.04%
14:59 — 6-minute stability window confirmed; incident declared resolved
14:59 — Status page updated to "Resolved"; PagerDuty resolved

## Contributing Factors (not "blame")
1. The pgpool v2.4.0 changelog noted the pool_max_conn default change in a minor
   bullet point. No one on the team read the full changelog before merging.
2. Our integration tests do not exercise connection pool limits — the regression
   was undetectable in CI.
3. The deploy happened at 14:32 UTC (peak traffic hour). Timing amplified impact.
4. We had no automated canary deployment; the change went to 100% of traffic
   immediately, maximising blast radius.

## What Went Well
- Alert fired within 6 minutes of impact start (TTD target: < 10 min) ✓
- IC declared within 2 minutes; roles assigned immediately ✓
- Root cause identified in under 10 minutes using structured log + deploy correlation ✓
- Rollback executed from runbook without hesitation; no extended investigation ✓
- Comms lead posted status page update within 3 minutes of incident opening ✓

## Action Items
| # | Action | Owner | Due |
|---|--------|-------|-----|
| 1 | Explicitly set pool_max_conn=100 in payments-api config (not rely on library default) | @jane | 2024-12-12 |
| 2 | Add connection pool exhaustion integration test to CI pipeline | @alex | 2024-12-18 |
| 3 | Implement canary deployment (10% traffic) for all payments-api deploys | @devops | 2024-12-25 |
| 4 | Add changelog review checklist to dependency bump PR template | @yourname | 2024-12-13 |
| 5 | Shift peak deploys to off-peak window (before 11:00 or after 18:00 UTC) | @yourname | 2024-12-13 |

The postmortem is the most leveraged document you produce: Action item 3 (canary deployments) will prevent an entire category of future incidents. Action item 2 will catch this specific regression if it ever recurs. Good postmortems produce improvements that compound — each incident should make the system measurably safer than it was before.

Debrief: What Made This Response Effective

Walk back through the mock incident and score yourself against the practices from this tutorial:

Detection (Lesson 1-3): SLO burn-rate alerting fired in 6 minutes. Synthetic monitoring was not needed here because the alert was fast enough — but it would have caught the gap between 14:34 and 14:38 if configured.
On-call hygiene (Lesson 2): Page acknowledged immediately, no escalation, IC declared within two minutes.
Incident command (Lesson 4): Single IC, roles assigned, no free-for-all debugging, 10-minute update cadence held.
Runbook (Lesson 5): "Rollback first" was in the runbook. The team did not debate; they executed.
Communication (Lesson 6): Status page updated within 3 minutes, stakeholder channel got structured updates, no silence.
Postmortem (Lesson 7-8): Blameless, timeline-first, contributing factors not a single scapegoat, five concrete action items with owners and dates.

Run this exercise regularly: Google, PagerDuty, and Stripe all run quarterly game days — scheduled mock incidents where the team intentionally injects a failure and runs the full response procedure. The goal is to keep the process sharp so the first time anyone encounters a new failure mode is not also the first time they have run the incident procedure. Teams that practise this way consistently outperform teams that only respond to real incidents.

Your Incident Readiness Checklist

Before you are on-call for real, verify each item:

SLO burn-rate alerts configured for every user-facing service (not just raw CPU/memory thresholds)
PagerDuty (or equivalent) escalation policy tested end-to-end in staging
Incident channel creation automated (one command or one Slack shortcut)
Status page account shared with team; update procedure documented
Rollback procedure in the runbook and tested in staging in the last 30 days
Postmortem template in the wiki; process owner assigned
At least one game day run in the last quarter

Incident management is a skill. Like any skill, it degrades without practice and sharpens with repetition. This tutorial gave you the theory and the frameworks; running this mock incident — and eventually real ones — gives you the reps. The engineers who stay calm at 03:00 are the ones who have run the scenario enough times that calmness is the default.