Alertmanager
Alertmanager
Prometheus fires alerts — it evaluates alerting rules and marks them as pending or firing. But Prometheus itself does not send emails, page on-call engineers, or post to Slack. That responsibility belongs to Alertmanager: a dedicated daemon that receives alert notifications from one or many Prometheus servers, applies routing logic, deduplicates, groups, suppresses, and fans out to the right people through the right channel at the right time. Understanding Alertmanager deeply is what separates a system that pages you fifty times during a single outage from one that sends a single, well-described ticket to the right team at 2 AM.
Core Concepts Before the Config
Alertmanager operates on alert notifications pushed by Prometheus over HTTP. Each notification carries a set of labels (the same labels on the alerting rule), annotations, a generator URL, and timing metadata. Alertmanager's job is to answer four questions for every batch of incoming alerts:
- Where does it go? — The routing tree maps label sets to receivers.
- When does it go? — Grouping and group_wait / group_interval / repeat_interval control timing to avoid notification storms.
- Should it be suppressed? — Silences and inhibition rules suppress redundant or expected noise.
- Who gets it? — Receivers define the actual integration (PagerDuty, Slack, email, OpsGenie, webhook).
The Routing Tree
Alertmanager's routing configuration is a tree of route nodes. Each node carries match conditions (exact label matchers or regex), a receiver name, and optional timing overrides. Incoming alert groups walk the tree depth-first; the first matching node wins unless continue: true is set to allow further matching.
The group_by key is critical and often misunderstood. It tells Alertmanager which label dimensions define a "group" for the purpose of batching. If you group by [alertname, cluster, env], all alerts sharing those three label values fire as a single notification, even if they differ on pod or instance. This prevents ten simultaneous pod restarts from generating ten separate pages.
group_wait is a buffer time at the birth of a new group, giving Prometheus a chance to fire related alerts before Alertmanager sends the first notification. group_interval governs subsequent flushes to that same group when new alerts join it. Setting group_wait too short causes notification storms during cascading failures; too long delays the first page. 30 seconds is a sane default for most production environments.Inhibition Rules
Inhibition is the Alertmanager feature most engineers under-use and most wish they had configured earlier. An inhibition rule says: "if alert A is firing with these labels, suppress any alert B that matches these other labels." This is indispensable for preventing symptom noise when the root cause alert is already firing.
The canonical example: a NodeDown alert fires. Seconds later, twenty PodCrashLooping and HighLatency alerts fire from that same node. Without inhibition, your on-call engineer gets twenty-one pages. With an inhibition rule that says "suppress everything on the same node label when NodeDown is firing," they get one. The inhibition source and target must share matching label values (defined in equal) for the suppression to apply.
equal labels were too broad.Silences
Silences are temporary, label-based mutes applied through the Alertmanager UI or API. They are the correct tool for planned maintenance windows and known-bad periods: you silence a set of labels for a duration, and all matching alerts are suppressed without any changes to routing or inhibition config. Silences carry a creator, a comment, and an expiry — they are auditable.
The amtool CLI is the production-grade way to manage silences programmatically, particularly inside maintenance automation scripts:
The Alert Lifecycle: End-to-End Path
Understanding the complete path an alert travels helps you debug missed pages and duplicate notifications — two of the most common Alertmanager complaints at scale.
On-Call Integrations in Production
At big-tech scale, the PagerDuty and OpsGenie integrations carry the most operational weight. A few patterns that matter:
- Severity mapping: Map Prometheus
severitylabels directly to PagerDuty severity levels (critical→ P1 immediate,warning→ P3 business hours). This is configured in thepagerduty_configs.severitytemplate field. - Deduplication keys: Alertmanager sends a stable
dedup_key(derived from the alert fingerprint) so that repeated notifications for the same firing alert update the existing PagerDuty incident rather than opening a new one. This is automatic — but only works correctly if yourgroup_bylabels are stable across alert firings. - Runbook links in annotations: Always include a
runbook_urlannotation on your alerting rules. Surface it in the notification template so the on-call engineer lands on the correct runbook from the first click. - Webhook receivers: For custom routing logic beyond what Alertmanager's tree supports, a webhook receiver posting to a small Lambda or Cloud Run function gives you arbitrary routing power — useful for multi-tenant products where the alert needs to fan out to the correct customer-specific channel.
High Availability for Alertmanager
A single Alertmanager is a single point of failure for your entire alerting chain. Alertmanager supports a gossip-based HA cluster: run three instances, point all your Prometheus servers at all three via alertmanager_config, and Alertmanager uses the Memberlist protocol to coordinate notification deduplication so that only one instance fires each alert. The critical flag is --cluster.peer:
Validating and Debugging
Two tools every Alertmanager operator should have muscle-memory for: amtool check-config (validates YAML and routing logic before a reload) and the /api/v2/alerts endpoint (shows currently firing alerts, useful in runbooks). The Alertmanager UI at :9093 provides a visual routing tree debugger under "Status → Routing Tree" — paste any label set and it shows exactly which receiver would receive it. This is invaluable when debugging why an alert went to the wrong channel.