Monitoring, Logging & Alerting
Monitoring, Logging & Alerting
You cannot fix what you cannot see. When a production system degrades at 2 a.m., the three pillars of observability — metrics, logs, and traces — determine whether your on-call engineer restores service in five minutes or five hours. Observability is not a luxury you bolt on after launch; it is a first-class design requirement, as fundamental to a reliable system as redundancy or failover.
The Three Pillars of Observability
The term "observability" comes from control theory: a system is observable if you can infer its internal state purely from its external outputs. In distributed systems engineering, this translates to three complementary data types, each answering a different question:
- Metrics — What is happening right now? Numeric time-series data aggregated over time (e.g. request rate, error rate, CPU utilisation, p99 latency).
- Logs — What exactly happened? Immutable, timestamped records of discrete events (e.g. an individual HTTP request, a database query, an exception stack trace).
- Traces — Why is it slow or broken? End-to-end records of a single request's journey across multiple services, showing where time was spent and where errors occurred.
Pillar 1: Metrics
Metrics are cheap to store, fast to query, and ideal for dashboards and alerting. They are pre-aggregated numbers — you lose individual event detail but gain the ability to query billions of events in milliseconds. The industry standard framework is the RED method (for services) and the USE method (for resources):
- RED (for every service): Rate (requests/sec), Errors (failed requests/sec), Duration (latency distribution).
- USE (for every resource): Utilisation (% busy), Saturation (queue depth), Errors (error events).
Real-world numbers matter. Netflix's Hystrix dashboard showed circuit-breaker states in real time across hundreds of microservices. Google's internal Monarch system ingests over 1 trillion metric data points per day across its fleet. At this scale, choosing the right metric resolution matters: storing per-second granularity for 13 months costs 10× more than 1-minute granularity — most teams keep high-resolution data for 15 days, then downsample.
The dominant open-source stack is Prometheus (pull-based scraping, powerful query language PromQL) with Grafana for visualisation. Cloud-managed equivalents include AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor. Datadog and New Relic offer SaaS alternatives with tighter integrations.
Pillar 2: Logs
A log is an append-only, immutable record of a discrete event. Logs provide the narrative that metrics cannot: which specific user's request failed, what the SQL query looked like, what error message the third-party API returned. The key engineering choice is structured logging vs plain text.
Plain text logs (e.g. Apache Combined Log Format) are human-readable but require fragile regex parsing to analyse programmatically. Structured logs emit JSON objects with consistent field names (user_id, trace_id, status_code, duration_ms), making them trivially indexable and queryable. Always prefer structured logging for production systems.
The standard pipeline for log management is the ELK stack: Elasticsearch (index and query), Logstash or Fluentd (ingest and transform), Kibana (visualise and search). At large scale (petabytes/day), teams switch to columnar stores like ClickHouse or managed services like AWS CloudWatch Logs, Google Cloud Logging, or Datadog Logs.
Pillar 3: Distributed Traces
In a monolith, a slow request is easy to pin down: look at the stack trace. In a microservices architecture with 20 services collaborating on a single user action, the culprit could be any one of them — or the network between them. Distributed tracing solves this by propagating a unique trace_id through every service call, then assembling a span timeline (a Gantt chart of time spent at each hop).
A trace is composed of spans. Each span records the start time, duration, service name, operation name, and any errors or metadata for one unit of work. The root span represents the user-facing request; child spans represent downstream calls (database queries, cache lookups, RPC calls to other services). Tools like Jaeger, Zipkin, and AWS X-Ray visualise these span trees.
The OpenTelemetry project (CNCF) provides a vendor-neutral SDK and wire format that instruments once and exports to any backend (Jaeger, Datadog, Honeycomb, etc.). It is rapidly becoming the industry standard — adopt it for new systems to avoid vendor lock-in.
Alerting: From Signal to Action
Metrics and logs are only useful if the right people are notified at the right time. Effective alerting is harder than it sounds — most teams suffer from alert fatigue, where too many noisy alerts cause on-call engineers to start ignoring them, which is the worst possible outcome.
The key principles of effective alerting are:
- Alert on symptoms, not causes. "User-facing error rate > 1%" is a symptom (something the user feels). "CPU > 80%" is a cause — it may or may not affect users. Alert on symptoms; let runbooks and traces identify causes.
- Alert on SLO burn rate. Instead of a fixed threshold, alert when you are burning your error budget faster than sustainable. If your monthly error budget would be exhausted in 2 hours at the current burn rate, page someone immediately — even if absolute error count seems small.
- Every alert must be actionable. If the response to an alert is "check and do nothing," remove the alert. Pages that have no defined response action are pure noise.
- Distinguish severity: page vs ticket. Reserve pager alerts for issues that need immediate human response (SLO breach imminent). Lower-severity issues go to a ticket queue for next-business-day handling.
Popular alerting pipelines: Prometheus Alertmanager → PagerDuty / OpsGenie for paging; Slack / email for lower-severity notifications. Datadog and Grafana Cloud include built-in alerting engines.
Putting It Together: Incident Investigation Workflow
When an alert fires, observability data drives a structured investigation:
- Metrics dashboard — identify which service's RED signals degraded and when exactly (pinpoint the start of the incident).
- Logs — filter by the affected time window and service; find error messages, trace IDs of failed requests.
- Traces — use a trace ID from a failing log entry to pull up the full span tree; identify the slow or erroring span.
- Fix, verify, document — deploy the fix, confirm metrics recover, write a post-mortem, and improve the runbook or alert.
This loop closes the observability cycle. The average mean time to detect (MTTD) at companies with mature observability is under 5 minutes; mean time to resolve (MTTR) is under 30 minutes. Companies without structured observability often see MTTD of 30–60 minutes and MTTR of several hours — a stark difference in customer impact.
Summary
Observability rests on three pillars: metrics (aggregated time-series for dashboards and alerting), logs (structured event records for forensic investigation), and traces (end-to-end span timelines for root-cause analysis in distributed systems). Effective alerting means alarming on symptoms and SLO burn rate, not raw resource thresholds, and ensuring every alert is actionable. Together, these practices compress incident response times from hours to minutes — a direct multiplier on your system's effective reliability.