Incident Management & On-Call

Communication During Incidents

18 min Lesson 6 of 28

Communication During Incidents

When a production system is on fire, the engineering work of fixing it is only half the job. The other half is communication — keeping stakeholders informed, coordinating the response team, and maintaining trust with users. Poor incident communication is one of the most common reasons post-mortems cite "confusion" and "delayed resolution." At Google, PagerDuty, and Cloudflare, incident comms is treated as a distinct discipline with clear ownership, structured cadence, and dedicated tooling.

The Three Communication Channels

Every incident runs three parallel communication streams simultaneously. Conflating them is a production pitfall that slows resolution:

Internal operational channel — the incident war room (Slack/Teams channel, Zoom bridge). This is where engineers share raw findings, debate hypotheses, run commands, and coordinate actions. It is noisy by design.
Internal stakeholder channel — regular updates to Engineering leadership, Product, Customer Success, and Legal. Concise, no jargon, action-oriented. They do not need to know which replica set lost quorum; they need to know impact, ETA, and what they should tell their contacts.
External customer channel — the public status page. Crafted language, no blame, factual about impact scope, updated on a fixed cadence.

Incident Commander owns communications. The IC does not fix the system. Their job is to coordinate, update stakeholders, and drive the response to resolution. Assign a dedicated Comms Lead for P0/P1 events — a person whose only job is writing updates.

Status Pages: Architecture and Content

A status page is your external communication contract. It must answer three questions: Is anything broken right now? — What is the impact? — When will it be fixed?

Hosted options (Statuspage.io, Betterstack, Cachet) are preferred over self-hosted because they stay up when your infrastructure is down. Configure your status page to auto-update component status from your alerting pipeline — do not rely on humans to flip the toggle when they are already triaging.

# Statuspage.io API — programmatically create an incident from your alerting pipeline
# Called from an Alertmanager webhook receiver or a PagerDuty automation action

curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${STATUSPAGE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "incident": {
      "name": "Elevated API error rate — checkout service",
      "status": "investigating",
      "impact_override": "major",
      "body": "We are investigating elevated error rates affecting the checkout API. Users may experience failed transactions. Engineers are actively investigating.",
      "component_ids": ["'${CHECKOUT_COMPONENT_ID}'"],
      "components": {
        "'${CHECKOUT_COMPONENT_ID}'": "degraded_performance"
      },
      "deliver_notifications": true
    }
  }'

# Update the incident as the situation evolves:
curl -X PATCH https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents/${INCIDENT_ID} \
  -H "Authorization: OAuth ${STATUSPAGE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "incident": {
      "status": "identified",
      "body": "Root cause identified: a bad deploy at 14:32 UTC introduced a regression in payment validation. We are rolling back now. ETA to resolution: 15 minutes."
    }
  }'

Status update language follows a strict formula at production-grade companies. Each external update must contain: time (UTC always), current status (Investigating / Identified / Monitoring / Resolved), impact scope (what % of users, which features, which regions), and a next-update time. Never promise an ETA you cannot keep — update with "continuing to investigate" rather than silence.

Stakeholder Update Cadence

Internal stakeholders need a different rhythm than the public. The standard cadence used by SRE teams at major cloud providers:

P0 (site down / data loss risk) — immediate page to VP/CTO, then updates every 15 minutes until resolved.
P1 (major feature degraded) — update every 30 minutes to Engineering director and Customer Success lead.
P2 (minor degradation, workaround exists) — update every hour; Customer Success notified once at start and once at resolution.

Send stakeholder updates to a dedicated Slack channel (#incident-updates) with a consistent template so recipients can scan history quickly:

# Slack Block Kit message template — post via Slack API or /incident bot command
# Used by the Comms Lead on a timer; some orgs automate this from PagerDuty

{
  "blocks": [
    {
      "type": "header",
      "text": { "type": "plain_text", "text": ":fire: [P1] Incident Update — 15:45 UTC" }
    },
    {
      "type": "section",
      "fields": [
        { "type": "mrkdwn", "text": "*Incident:* INC-2847 — Checkout API errors" },
        { "type": "mrkdwn", "text": "*Status:* Identified" },
        { "type": "mrkdwn", "text": "*Impact:* ~12% of checkout attempts failing, EU-WEST region only" },
        { "type": "mrkdwn", "text": "*Started:* 15:10 UTC (35 min ago)" }
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*What we know:* Bad config pushed to payment-service v2.4.1 breaks 3DS auth for cards issued in FR/DE/NL.\n*Action:* Rolling back to v2.4.0. Expected resolution: 16:00 UTC.\n*Workarounds:* Customers can retry with PayPal."
      }
    },
    {
      "type": "section",
      "text": { "type": "mrkdwn", "text": "*Next update:* 16:00 UTC or sooner if resolved.\n*IC:* @sarah | *Comms:* @james | *Runbook:* " }
    }
  ]
}

Internal War Room Discipline

The operational channel gets chaotic fast. Three practices prevent it from becoming useless noise:

Thread every hypothesis. Do not paste 50-line stack traces in the main channel. Thread them. The main channel should read as a timeline of decisions.
Use a bot to pin actions. Every action taken gets logged: /inc action "rolling back payment-service to v2.4.0" @devops-oncall. This creates an audit trail and feeds the post-mortem timeline automatically.
Silence irrelevant voices. The IC enforces a "speak only if you have data or can take an action" rule. Managers asking "what is the ETA" in the war room should be redirected to #incident-updates.

Automate the first external update. Your Alertmanager or PagerDuty webhook should automatically post a "We are investigating" status page update and stakeholder Slack message the moment a P0/P1 is declared. Humans are slow when under pressure; automation ensures the first update goes out in seconds, not 20 minutes.

Communication Architecture: End-to-End Flow

Incident communication flow: alert fires, IC coordinates three parallel channels — war room, stakeholder cadence, and public status page — until resolution.

Resolving and Closing the Comms Loop

When the incident is resolved, every channel needs a closure update — not just the status page. A common failure is resolving the incident in the war room but forgetting to post a "resolved" message in #incident-updates, leaving executives thinking the outage is still ongoing. The IC checklist at resolution:

Update status page to Resolved, summarize impact duration and root cause in plain language.
Post final message in #incident-updates with duration, user impact count, and next steps (post-mortem date).
Send a customer email if the SLA breach threshold was crossed (typically: P0 outage over 15 minutes affecting more than 1% of users).
Close the war room channel and archive it with an /inc close bot command that timestamps the resolution.

Never delete the war room channel. Archive it. The raw Slack/Teams log is often the most accurate timeline you have for the post-mortem. Auto-archiving bots that delete channels after 30 days have cost teams hours of reconstructing timelines from memory — a practice banned at several top-tier SRE organizations.

Avoiding Communication Anti-Patterns

The anti-patterns that appear in post-mortems repeatedly:

Silent updates — going 45 minutes without a status page post because engineers are deep in triage. Users assume you do not know what is happening. Automate the "still investigating" post on a 15-minute cron if no human posts.
Technical jargon in external updates — "Cassandra compaction storm causing GC pressure" means nothing to a customer. Translate: "Database performance issues are slowing down search results."
Premature "resolved" status — declaring resolved before you have verified with synthetic monitoring and real user metrics. A false resolution update followed by a second "still investigating" destroys trust faster than a single long outage.
Over-communicating in the war room — managers pasting encouragement, non-responders asking for updates. Every message is a notification that pulls an engineer out of flow state.

Production-grade incident communication is a learned skill. The cadence, the language, the channel discipline — these are as engineerable as any system component. Document your comms runbook alongside your technical runbooks, and practice it in game days.