Incident Management & On-Call

Anatomy of an Incident

18 min Lesson 1 of 28

Anatomy of an Incident

Every production incident — from a brief latency spike that auto-healed in seconds to a multi-hour outage that made the news — follows the same underlying lifecycle. At Google, Amazon, Netflix, and every serious engineering organisation, the ability to understand exactly where you are in that lifecycle at any given moment is the difference between a team that resolves incidents predictably and a team that thrashes. This lesson maps that lifecycle in detail, connecting each phase to the tooling, human decisions, and failure modes you will encounter in production.

Why the lifecycle matters: Most engineers optimise for the "fix it" phase and neglect detection and scoping. At big-tech scale, a one-minute improvement in time-to-detect (TTD) is worth far more than a five-minute improvement in time-to-fix, because every minute of undetected impact multiplies across millions of users or thousands of downstream services.

The Six Phases of an Incident

The incident lifecycle is not a rigid checklist — it is a mental model that keeps everyone oriented. Phases can overlap, compress under pressure, or temporarily reverse (you thought you had root cause and then learned you did not). Knowing the model is what lets you notice when you are drifting.

The six phases of an incident lifecycle. Every serious engineering team tracks TTD, TTM, and TTR as the primary incident health metrics — improving them drives systemic reliability gains.

Phase 1: Detection

An incident begins the moment user impact starts — not when the first alert fires. This distinction is critical. There is always a detection gap: the interval between when the system degraded and when someone noticed. Closing this gap is the first and most leveraged reliability investment you can make.

Detection sources fall into three categories, in rough order of reliability:

Synthetic monitoring — probes you control, running continuously from outside your system (blackbox Prometheus exporters, Pingdom, Datadog Synthetics). These detect failures from the user's perspective and are not subject to the same failure modes as your infrastructure. If your load balancer crashes and takes your internal monitoring with it, synthetics still fire.
Metric-based alerts — threshold or anomaly alerts on your golden signals: latency, traffic, errors, saturation (USE: Utilisation, Saturation, Errors). Prometheus AlertManager rules, Datadog monitors, CloudWatch alarms. These fire fast but require good SLO-based thresholds — alerting on CPU at 80% catches almost nothing meaningful; alerting on error-rate SLO burn-rate catches what users feel.
User reports — the worst detection mechanism. By the time a user reports an issue via support ticket, you have already missed your TTD target by minutes or hours. User reports are a signal that your monitoring has a gap.

# Prometheus AlertManager: SLO burn-rate alert (the right way to alert)
# This fires when you are burning your monthly error budget 14x faster than steady-state
# — meaning if sustained, you will exhaust the budget in ~2 days.

groups:
  - name: slo_burn_rate
    rules:
      - alert: HighErrorBudgetBurn
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[1h])
            /
            rate(http_requests_total[1h])
          ) > (14 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "High error budget burn rate on {{ $labels.service }}"
          description: "Error rate {{ $value | humanizePercentage }} — burning budget 14x faster than target. Estimated budget exhaustion in < 2 days."
          runbook: "https://wiki.internal/runbooks/payments-high-error-rate"

Phase 2: Triage

Triage is the 2-to-5-minute window after detection where you answer three questions: How bad is it? How many users/systems are affected? Who owns the response? The output of triage is a severity level (your organisation's S0/S1/S2/S3 or P0/P1/P2 scale) and an incident commander assigned. Getting this wrong is expensive — under-triaging a critical outage wastes precious minutes; over-triaging a minor blip burns people out and erodes response discipline.

Effective triage uses your dashboards and logs, not intuition. A trained on-call engineer looks at the golden signals first — latency, error rate, throughput — and then at the scope: is this one region, one availability zone, one service, or a cascade? Cross-referencing the error rate spike with a recent deploy (git log --oneline -20 or your deployment tracking tool) takes thirty seconds and often gives you 80% of the answer.

Phase 3: Communication

Communication begins in parallel with triage, not after it. This surprises engineers who think "I should understand the problem before saying anything." At Google and most big-tech companies, the convention is the opposite: open an incident channel immediately, post a brief initial assessment ("investigating elevated error rates on payments API, likely related to 14:32 deploy, stand by for update in 10 minutes"), and update on a cadence. Silence in an incident channel is worse than an uncertain update — stakeholders fill silence with worst-case assumptions, which triggers escalation chains that distract your engineers from fixing the problem.

The 10-minute update rule: Set a timer. Every 10 minutes during an active incident, post an update to the incident channel, even if that update is "still investigating, no new information." This one practice eliminates 80% of "what is the status?" interruptions to the incident commander. Tools like PagerDuty, FireHydrant, and Rootly can automate reminder prompts.

Phase 4: Mitigation

Mitigation and root cause analysis are separate activities, and confusing them is one of the most common production mistakes. Mitigation means stopping user impact as fast as possible, by any means available. Root cause analysis comes later. A team that spends 40 minutes tracing the exact cause of a database deadlock while users cannot check out is making the wrong tradeoff — roll back the deploy first, then investigate.

The canonical mitigation toolkit in order of preference:

Rollback or revert — if a deploy caused the issue, roll it back. This is the fastest and most reliable mitigation for a large class of incidents. Your deployment pipeline must support this in under two minutes for it to be effective.
Feature flag / circuit breaker — disable the specific feature or dependency that is failing without a full redeploy. LaunchDarkly, Statsig, or a simple database-backed flag can cut scope in seconds.
Traffic shifting — redirect traffic away from the failing region or service version. Kubernetes weighted services, ALB target group weights, or Istio traffic policies give you this at layer 7.
Horizontal scaling — if the cause is resource saturation (not a bug), adding capacity buys time. kubectl scale deployment payments --replicas=20. This is a temporising measure, not a fix.

# Kubernetes: fast rollback to the previous ReplicaSet
# Check current rollout history first
kubectl rollout history deployment/payments-api -n prod

# Roll back to the previous revision
kubectl rollout undo deployment/payments-api -n prod

# Confirm the rollout is complete
kubectl rollout status deployment/payments-api -n prod --timeout=120s

# Verify error rate dropped — check Prometheus or run a quick curl smoke test
for i in $(seq 1 10); do
  curl -sf https://api.example.com/healthz && echo "OK" || echo "FAIL"
  sleep 2
done

Phase 5: Resolution

An incident is resolved when two conditions are met: user-facing SLOs have returned to target levels, and the mitigation is stable (not just "seems OK for now"). Resolution is a deliberate declaration, not just the moment the alerts clear. The incident commander explicitly closes the incident, records the end time (critical for calculating MTTR and error budget consumption), and hands off any remaining work to normal engineering channels.

Avoid the antipattern of "soft closing" — leaving the incident channel open with no declared owner while engineers quietly keep working on it. This obscures your true MTTR metrics and leaves stakeholders uncertain about the system state.

# PagerDuty CLI: resolve an incident and record end time
# Install: pip install pdpyras  OR use the pd CLI (brew install pagerduty/tap/pd)

pd incident resolve --id P1A2B3C

# Log the resolution note (captured in postmortem timeline)
pd incident note add --id P1A2B3C \
  --content "Resolved 15:47 UTC. Rolled back deploy abc123. Error rate returned to baseline. Postmortem scheduled for tomorrow 10:00 UTC. Owner: @jane"

Phase 6: Postmortem

The postmortem is where the lifecycle closes the loop. A blameless postmortem — written within 48-72 hours while memory is fresh — documents the full timeline, the contributing factors (not a single "root cause," because complex systems rarely have one), and a set of action items with owners and due dates. The goal is not to assign fault but to make the system more resilient and the team better prepared. Postmortems are covered in depth in Lesson 7; what matters here is understanding that the postmortem is not optional overhead — it is the mechanism that converts incidents from pure cost into organisational learning.

Key Metrics: TTD, TTM, TTR

Every phase of the lifecycle corresponds to a measurement your team should be tracking. These are the industry-standard incident health metrics used at Google, AWS, and Stripe:

Time to Detect (TTD) — from first user impact to when the on-call engineer is engaged. Target: under 5 minutes for P0/S0. Improved by better alerting and synthetic monitoring.
Time to Mitigate (TTM) — from engagement to when user impact stops. Target: under 30 minutes for P0. Improved by runbooks, fast rollback tooling, and circuit breakers.
Time to Resolve (TTR) — from engagement to full system health and incident closure. Can be hours or days if the mitigation was a workaround and the real fix takes time.
Mean Time Between Failures (MTBF) — average time between incidents of comparable severity. Improved by reliability engineering work surfaced in postmortems.

Production pitfall — conflating mitigation with resolution: Many teams report TTR when they actually mean TTM. They declare the incident "resolved" the moment the workaround is in place, without ensuring the underlying system is healthy and the mitigation is not itself fragile. Six hours later, the workaround falls over and the incident re-opens. Track TTM and TTR separately; they tell different stories about your team's capability.

The Incident Lifecycle in Context

The phases above describe the mechanics of a single incident. In practice, your on-call rotation is managing the aggregate of all incidents over a rolling window, measured against your error budget. A team that can reliably detect in under 3 minutes, mitigate in under 20 minutes, and run blameless postmortems with completed action items will, over time, reduce incident frequency and severity — compounding the same way that good engineering compounds. This tutorial covers the tools and practices that make each phase faster and more reliable. The first step is knowing the map.