The Three Pillars of Observability
The Three Pillars of Observability
The three pillars — metrics, logs, and traces — are the raw signals that allow you to understand what a distributed system is doing at any moment. The framing comes from Cory Watson's 2013 work at Twitter and was popularised by Cindy Sridharan's writing on observability. By 2025, every major platform team at Google, Meta, Uber, and Stripe has built its production tooling around exactly these three signal types, with a dedicated store for each: Prometheus or Thanos for metrics, Elasticsearch or Loki for logs, Jaeger or Tempo for traces.
No single pillar is sufficient on its own. They are complementary: metrics tell you that something is wrong, logs tell you what happened, and traces tell you where in the call chain the problem originated. Miss any one and your on-call engineer will be flying blind during an incident.
Pillar 1: Metrics
Metrics are numeric measurements aggregated over time. A counter increments every time an HTTP request is served; a gauge tracks current memory usage; a histogram buckets request durations and lets you compute arbitrary percentiles. They are cheap to store (a single float + labels + timestamp), cheap to query at scale, and the right foundation for dashboards and alerts.
The Prometheus data model is the industry standard. Every metric has a name and a set of key-value labels — http_requests_total{method="POST", status="500", service="checkout"}. Labels are what make metrics powerful: you can sum across all services, filter to a single endpoint, or compare error rates by region. But labels have a cost: every unique combination of label values is a separate time series. A label with unbounded cardinality (like user_id or request_id) will explode your Prometheus TSDB.
Pillar 2: Logs
Logs are discrete, timestamped records of individual events. A log line captures what happened at an exact moment — an HTTP request was received with these headers, a database query timed out, a user was denied access. Logs give you the full context that metrics cannot: the exact query that failed, the exact user ID, the exact stack trace.
The shift from unstructured ("printf-style") logs to structured JSON logs is one of the highest-leverage improvements a team can make. Structured logs can be indexed, filtered, and aggregated by machines. An unstructured log string is a debugging artifact; a structured log record is data.
INFO level for normal operations, ERROR only on actionable failures. Use dynamic log sampling — log 100% of errors and 1% of successful requests. Never log full request/response bodies at INFO level. Attach the trace_id to every log line so you can correlate with traces.
Pillar 3: Distributed Traces
A distributed trace is the story of a single request as it flows across every service that handles it. Each unit of work within a service is a span — it has a start time, a duration, a set of tags (key-value metadata), and a reference to its parent span. The collection of all spans for one request, linked by a shared trace_id, forms the trace.
Traces answer questions that neither metrics nor logs can resolve: "Which service in our 40-service mesh is responsible for the 2-second tail latency our customers see on checkout?" Metrics might tell you checkout p99 is slow. Logs might show individual slow requests. Only traces reveal that the bottleneck is a specific inventory-service → postgres span that is consistently slow for requests involving more than 10 SKUs.
tail_sampling processor work.
How the Three Pillars Complement Each Other
The real power comes from using all three together in a workflow. Here is a representative on-call scenario at a production scale company:
- Alert fires (metrics): A Prometheus alert triggers —
checkout_error_rate > 2%for 3 minutes on the US-EAST region. - Narrow the blast radius (metrics): The on-call engineer opens Grafana. Error rate is elevated only on
pod/checkout-v2-*, notcheckout-v1-*. A recent deploy is the suspect. Dashboard shows p99 latency also spiking. - Understand what happened (logs): Filter Loki for
service="checkout" level="error"in the affected window. Logs show"error": "upstream timeout", "upstream": "inventory-service"— the checkout service is timing out waiting on inventory. - Find the root cause (traces): Open Tempo, filter for
service=checkout status=error. A trace shows theinventory.GetStockspan is consuming 1,800 ms of the total 2,000 ms request. Drill into that span — it is making 47 sequential database queries (an N+1 bug introduced in the new deploy). - Correlate (cross-pillar linking): The
trace_idembedded in each log line lets you jump directly from the relevant log record to the exact trace in Tempo. Grafana Explore supports this natively when logs and traces share the sametrace_idfield.
This workflow — alert on metrics, investigate with logs, locate with traces — is the canonical observability loop. Each pillar does one job well and hands off to the next. Trying to do this with logs alone is expensive and slow at scale. Trying to do it with metrics alone leaves you unable to understand causation.
Cost and Cardinality: The Engineering Trade-off
Each pillar has a different cost profile that determines how much data you can afford to keep at what resolution:
- Metrics: Very low cost per data point. A counter scrape from 1,000 pods every 15 seconds is millions of data points per day but still cheap in Prometheus. The main cost driver is label cardinality — too many unique label combinations and Prometheus will OOM. Keep label values bounded and use recording rules to pre-aggregate expensive queries.
- Logs: High storage cost at scale. Ingest compression helps (Loki stores compressed chunks), but the volume is fundamentally proportional to request rate × log lines per request. Sampling, retention tiers (hot/warm/cold), and log levels are the primary levers.
- Traces: Medium cost, controlled by sampling rate. 100% trace capture is only feasible at low request rates. At >1,000 RPS, use tail-based sampling targeting 5–10% overall with 100% error/slow retention. Tempo stores trace data compressed and is significantly cheaper than Jaeger's Elasticsearch backend at scale.
The next lesson goes deeper into which specific metrics actually matter — the signals proven to predict user-visible failures, how to define your Golden Signals for each service tier, and the PromQL patterns that surface them in under 60 seconds during an incident.