Reliability, Availability & Resilience

Monitoring, Logging & Alerting

18 min Lesson 8 of 10

Monitoring, Logging & Alerting

You cannot fix what you cannot see. When a production system degrades at 2 a.m., the three pillars of observability — metrics, logs, and traces — determine whether your on-call engineer restores service in five minutes or five hours. Observability is not a luxury you bolt on after launch; it is a first-class design requirement, as fundamental to a reliable system as redundancy or failover.

The Three Pillars of Observability

The term "observability" comes from control theory: a system is observable if you can infer its internal state purely from its external outputs. In distributed systems engineering, this translates to three complementary data types, each answering a different question:

Metrics — What is happening right now? Numeric time-series data aggregated over time (e.g. request rate, error rate, CPU utilisation, p99 latency).
Logs — What exactly happened? Immutable, timestamped records of discrete events (e.g. an individual HTTP request, a database query, an exception stack trace).
Traces — Why is it slow or broken? End-to-end records of a single request's journey across multiple services, showing where time was spent and where errors occurred.

Key idea: Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where and why. You need all three; they are complements, not substitutes.

The three pillars of observability feed a unified platform that powers dashboards, alerting, and incident investigation.

Pillar 1: Metrics

Metrics are cheap to store, fast to query, and ideal for dashboards and alerting. They are pre-aggregated numbers — you lose individual event detail but gain the ability to query billions of events in milliseconds. The industry standard framework is the RED method (for services) and the USE method (for resources):

RED (for every service): Rate (requests/sec), Errors (failed requests/sec), Duration (latency distribution).
USE (for every resource): Utilisation (% busy), Saturation (queue depth), Errors (error events).

Real-world numbers matter. Netflix's Hystrix dashboard showed circuit-breaker states in real time across hundreds of microservices. Google's internal Monarch system ingests over 1 trillion metric data points per day across its fleet. At this scale, choosing the right metric resolution matters: storing per-second granularity for 13 months costs 10× more than 1-minute granularity — most teams keep high-resolution data for 15 days, then downsample.

The dominant open-source stack is Prometheus (pull-based scraping, powerful query language PromQL) with Grafana for visualisation. Cloud-managed equivalents include AWS CloudWatch, Google Cloud Monitoring, and Azure Monitor. Datadog and New Relic offer SaaS alternatives with tighter integrations.

Best practice — the four golden signals: Google SRE popularised four golden signals every service should expose: Latency, Traffic, Errors, and Saturation. If you can only instrument one thing, instrument these four. They map perfectly to user experience and are sufficient to drive most alert policies.

Pillar 2: Logs

A log is an append-only, immutable record of a discrete event. Logs provide the narrative that metrics cannot: which specific user's request failed, what the SQL query looked like, what error message the third-party API returned. The key engineering choice is structured logging vs plain text.

Plain text logs (e.g. Apache Combined Log Format) are human-readable but require fragile regex parsing to analyse programmatically. Structured logs emit JSON objects with consistent field names (user_id, trace_id, status_code, duration_ms), making them trivially indexable and queryable. Always prefer structured logging for production systems.

// Plain text (hard to parse at scale)
[2024-03-15 14:23:01] GET /api/orders/98 200 142ms user=47391

// Structured JSON (machine-readable, easy to query)
{
  "timestamp": "2024-03-15T14:23:01.342Z",
  "method": "GET",
  "path": "/api/orders/98",
  "status": 200,
  "duration_ms": 142,
  "user_id": 47391,
  "trace_id": "4bf92f3577b34da6"
}

The standard pipeline for log management is the ELK stack: Elasticsearch (index and query), Logstash or Fluentd (ingest and transform), Kibana (visualise and search). At large scale (petabytes/day), teams switch to columnar stores like ClickHouse or managed services like AWS CloudWatch Logs, Google Cloud Logging, or Datadog Logs.

Pitfall — logging too much or too little: Over-logging floods storage and makes signal extraction impossible (a 10 MB/s log stream at DEBUG level generates 864 GB/day per service). Under-logging leaves you blind during incidents. Use log levels correctly: DEBUG in dev only, INFO for normal operations, WARN for recoverable anomalies, ERROR for failures requiring attention. Never log sensitive data (passwords, PII, tokens) — it creates compliance and security exposure.

Pillar 3: Distributed Traces

In a monolith, a slow request is easy to pin down: look at the stack trace. In a microservices architecture with 20 services collaborating on a single user action, the culprit could be any one of them — or the network between them. Distributed tracing solves this by propagating a unique trace_id through every service call, then assembling a span timeline (a Gantt chart of time spent at each hop).

A trace is composed of spans. Each span records the start time, duration, service name, operation name, and any errors or metadata for one unit of work. The root span represents the user-facing request; child spans represent downstream calls (database queries, cache lookups, RPC calls to other services). Tools like Jaeger, Zipkin, and AWS X-Ray visualise these span trees.

The OpenTelemetry project (CNCF) provides a vendor-neutral SDK and wire format that instruments once and exports to any backend (Jaeger, Datadog, Honeycomb, etc.). It is rapidly becoming the industry standard — adopt it for new systems to avoid vendor lock-in.

A distributed trace shows the Gantt-chart of spans across services. The slow Postgres query (140 ms) is immediately visible as the bottleneck inside the Order Service.

Alerting: From Signal to Action

Metrics and logs are only useful if the right people are notified at the right time. Effective alerting is harder than it sounds — most teams suffer from alert fatigue, where too many noisy alerts cause on-call engineers to start ignoring them, which is the worst possible outcome.

The key principles of effective alerting are:

Alert on symptoms, not causes. "User-facing error rate > 1%" is a symptom (something the user feels). "CPU > 80%" is a cause — it may or may not affect users. Alert on symptoms; let runbooks and traces identify causes.
Alert on SLO burn rate. Instead of a fixed threshold, alert when you are burning your error budget faster than sustainable. If your monthly error budget would be exhausted in 2 hours at the current burn rate, page someone immediately — even if absolute error count seems small.
Every alert must be actionable. If the response to an alert is "check and do nothing," remove the alert. Pages that have no defined response action are pure noise.
Distinguish severity: page vs ticket. Reserve pager alerts for issues that need immediate human response (SLO breach imminent). Lower-severity issues go to a ticket queue for next-business-day handling.

Popular alerting pipelines: Prometheus Alertmanager → PagerDuty / OpsGenie for paging; Slack / email for lower-severity notifications. Datadog and Grafana Cloud include built-in alerting engines.

Best practice — the on-call engineer should be able to answer three questions from the alert alone: (1) What is broken? (2) Who is affected? (3) What is the immediate mitigation? If any of these require more than one click, your alert message and runbook need improvement.

Putting It Together: Incident Investigation Workflow

When an alert fires, observability data drives a structured investigation:

Metrics dashboard — identify which service's RED signals degraded and when exactly (pinpoint the start of the incident).
Logs — filter by the affected time window and service; find error messages, trace IDs of failed requests.
Traces — use a trace ID from a failing log entry to pull up the full span tree; identify the slow or erroring span.
Fix, verify, document — deploy the fix, confirm metrics recover, write a post-mortem, and improve the runbook or alert.

This loop closes the observability cycle. The average mean time to detect (MTTD) at companies with mature observability is under 5 minutes; mean time to resolve (MTTR) is under 30 minutes. Companies without structured observability often see MTTD of 30–60 minutes and MTTR of several hours — a stark difference in customer impact.

Real-world example: GitHub's 2018 major outage lasted 24 hours largely because the internal monitoring did not have sufficient distributed tracing to pinpoint the cascading failure source quickly. The post-mortem drove a significant investment in observability tooling. The lesson: observability infrastructure is a prerequisite for fast incident resolution, not an optional add-on.

Summary

Observability rests on three pillars: metrics (aggregated time-series for dashboards and alerting), logs (structured event records for forensic investigation), and traces (end-to-end span timelines for root-cause analysis in distributed systems). Effective alerting means alarming on symptoms and SLO burn rate, not raw resource thresholds, and ensuring every alert is actionable. Together, these practices compress incident response times from hours to minutes — a direct multiplier on your system's effective reliability.