Alerting Philosophy
Alerting Philosophy
An alert wakes someone up. That is a serious act. Every time an alert fires at 3 AM and the on-call engineer investigates only to find nothing actionable, you have made an implicit statement: our system's noise matters more than your sleep and your judgment. Multiply that by dozens of engineers and hundreds of alerts, and you have the most common failure mode of observability programs at scale — alert fatigue. Teams stop trusting alerts, start ignoring them, and the first time a real incident fires, nobody responds in time.
Google's Site Reliability Engineering handbook devotes an entire chapter to this problem. The core principle that emerges: every alert must be urgent, actionable, and customer-visible. If an alert does not meet all three criteria, it should be a dashboard warning or a ticket, not a page. This lesson teaches how to build an alerting system that your on-call team trusts and that catches real problems before users notice them.
Symptom-Based Alerting
The most important design decision in alerting is alert on symptoms, not causes. A symptom is something the user experiences: slow responses, errors, unavailable features. A cause is an internal system state: high CPU, disk filling, a pod in CrashLoopBackOff. The mistake most teams make is alerting on causes, which leads to two failure modes:
- False positives: CPU at 95% for 2 minutes does not always mean users are impacted. Maybe it is a batch job. Maybe auto-scaling is already reacting. You page an engineer for nothing.
- False negatives: A silent memory leak causes the recommendation engine to serve stale results without any CPU or error rate spike. No cause-based alert fires. Users are degraded for hours.
Symptom-based alerting inverts this. Instead of "alert when CPU > 80%", you ask "alert when p99 latency exceeds the SLO threshold" or "alert when error budget burn rate is too high." The internal cause becomes an input to investigation, not the trigger for waking someone up.
Severity Levels and the Escalation Contract
Not every alert deserves a 3 AM phone call. A robust alerting architecture defines a clear severity taxonomy and enforces a routing contract for each level. At most mature companies the tiers look like this:
- P1 — Critical / Page immediately: Users are completely unable to use the product, or data integrity is at risk. Revenue impact is active. Requires immediate human response regardless of time. Example: checkout service returning 100% errors, database primary is down with no automatic failover completing.
- P2 — High / Page during business hours, wake if prolonged: A significant portion of users are degraded. SLO breach is imminent or occurring. Requires response within 30 minutes. Example: p99 latency 3× SLO threshold for more than 10 minutes, payment success rate dropped 15%.
- P3 — Medium / Ticket, respond next business day: A non-critical path is degraded or a capacity limit is approaching. No immediate user impact, but left unaddressed it could escalate. Example: background job queue depth trending toward limit, a specific API endpoint elevated errors affecting less than 0.5% of users.
- P4 — Low / Dashboard warning: Informational. Worth watching but requires no action today. Never sends a notification — only visible to someone actively looking at dashboards.
The contract is sacred: if P1 fires and the on-call does not respond within 5 minutes, escalate automatically to the team lead. PagerDuty, OpsGenie, and VictorOps all support this escalation policy out of the box. The moment P1 ever fires for a non-emergency, the team will start routing P1 alerts to silence — and that is the beginning of an incident response breakdown.
Runbooks: The Alert Is Not Complete Without One
An alert without a runbook is an alarm with no instructions. When an engineer is woken at 3 AM, their cognitive capacity is reduced, their stress is elevated, and they may be junior enough to have never seen this failure mode before. A runbook is the bridge between "alert fired" and "incident resolved."
Every production alert must link directly to its runbook. The link goes in the alert annotation. The runbook must answer, in order:
- What does this alert mean? Plain English: what user-visible symptom is occurring or at risk?
- What is the immediate triage step? The first command or dashboard to check. Answerable in under 2 minutes.
- What are the likely root causes? Ordered by historical frequency. Each cause links to its specific remediation.
- What are the safe remediations? Exact commands to run, flagged as safe vs. destructive. Destructive steps require explicit confirmation.
- When to escalate? If still unresolved after N minutes, who to call next, and what info to provide them.
Runbooks live in version control alongside the alert rules. When an alert rule changes, its runbook is updated in the same PR. Stale runbooks that reference old commands or non-existent services are worse than no runbook — they waste time and erode trust.
Alert Fatigue: The Silent Killer of On-Call Culture
Alert fatigue is not a metaphor — it is a measurable phenomenon. Studies of hospital alarm systems (where fatigue has killed patients) and cloud operations teams show the same pattern: when the ratio of actionable alerts to total alerts drops below roughly 50%, on-call engineers begin to develop automatic suppression behavior. They acknowledge alerts without reading them. They silence PagerDuty for the first hour after waking because they expect it to be noise. When the real incident fires, the response is slow.
The causes of alert fatigue in production systems:
- Alerting on causes, not symptoms — leads to high false-positive rates for non-user-impacting conditions.
- Thresholds set too aggressively — "alert if p99 ever exceeds 200ms" fires constantly in a system where the SLO is 500ms.
- No
forduration — a single metric spike lasting 30 seconds pages an engineer at 3 AM. - Alert proliferation without review — engineers add alerts when they find a new problem, but never remove alerts when the underlying issue is fixed or the service is decommissioned. Alert count grows monotonically.
- Flapping alerts — an alert that oscillates between firing and resolving every few minutes, sending multiple notifications.
The cure is a regular alert review process. Every quarter, pull the alert firing history and categorize each alert into: (a) fired and led to a user-impacting incident, (b) fired and was noise, (c) did not fire during an incident it should have caught. Category (b) alerts are candidates for deletion, threshold increase, or conversion to a ticket. Category (c) alerts reveal gaps in coverage.
Practical Alert Hygiene at Scale
Beyond the philosophy, production alerting requires operational discipline. These are the rules followed by SRE teams at hyperscalers:
- Every alert has an owner. An alert with no team label gets silenced and removed. Ownership is enforced by the routing config — if there is no route for a label, the alert goes to a catch-all that opens a ticket against the platform team to find the owner.
- Alerts must have a
forduration. Never fire on a single evaluation. Minimumfor: 2mfor critical,for: 5mfor everything else. This alone eliminates 60-70% of false positives from transient spikes. - Use multi-window burn rate for SLO alerts. A single threshold on a 5-minute window misses slow burns. Alert when the 1-hour burn rate is fast AND the 5-minute burn rate confirms it is still ongoing. This is the algorithm in Google's SRE Workbook: it catches fast burns early and slow burns before budget exhaustion, with very low false-positive rates.
- Suppress during maintenance windows. Alertmanager
inhibit_rulesandtime_intervalslet you silence derivative alerts during planned downtime. Forgetting this causes 50+ alerts during a scheduled maintenance, training the team to ignore them. - Track your alert signal-to-noise ratio. Export a metric:
alerts_total{actionable="true|false"}. If actionable rate drops below 70%, schedule an alert audit sprint.
Building the Culture, Not Just the Config
Alerting philosophy is ultimately a team agreement, not a Prometheus YAML file. The config enforces the philosophy, but the philosophy must be agreed upon first. This means having explicit conversations about: what counts as an emergency, what is acceptable to let burn through business hours, and what the on-call rotation's implicit social contract is. Teams that skip this conversation end up with engineers who individually tune their own alert sensitivity — some muting everything, others paging for every blip — and the system as a whole becomes unpredictable.
Institutionalize this as a quarterly alerting review: pull the last 90 days of alert history, calculate the actionable rate per alert, identify the top-10 noisiest alerts by volume, and spend one sprint reducing them. This feedback loop, applied consistently, produces an alert system that the team trusts — which means a team that responds when it matters.