Distributed Tracing & OpenTelemetry

Why Distributed Tracing?

18 min Lesson 1 of 28

Why Distributed Tracing?

You have Prometheus metrics. You have structured logs in Loki or Elasticsearch. Your Grafana dashboards are green, your SLOs are within budget, and yet — your senior engineers are spending hours in post-mortems tracing a 400ms latency spike that affected 0.3% of requests last Tuesday afternoon. The metrics showed a small blip. The logs showed some slow database queries. But nobody could answer the simple question: which service actually caused the slowdown, and why did only those requests hit it?

This is the problem distributed tracing was built to solve. It is not a replacement for metrics or logs — it is the third pillar that lets you ask a fundamentally different class of questions: what happened to this specific request as it traveled across your system?

The Latency Attribution Problem

In a microservices architecture, a single user-facing request fans out into dozens of downstream calls. A checkout request might hit an API gateway, an auth service, a cart service, a pricing service, a payment service, a fraud detection service, and an order service — each potentially calling their own databases, caches, or third-party APIs. The total end-to-end latency that the user experiences is the sum of all these hops, plus the network time between them.

Metrics give you aggregates: "the checkout service p99 latency is 380ms." But which 380ms? Is it 200ms in the payment service, 100ms in the fraud check, and 80ms in everything else? Or is it 350ms in the cart service on a cold cache miss? These are completely different problems with completely different solutions, and an aggregate metric cannot tell you which one you are dealing with.

Logs tell you what happened inside each service, but stitching together the story of one request across ten services from ten separate log streams — each with its own timestamp skew, its own log format, its own sampling rate — is genuinely painful at scale. It requires a human to manually correlate entries by request ID, often across different UI screens or grep pipelines.

Key idea: Distributed tracing attaches a single trace ID to every request when it enters your system. Every service that handles that request records its own work as a span — a timed, labeled unit with the same trace ID. A backend assembles all spans with the same ID into a flame graph showing the exact timeline and duration of every operation, across every service, for that one request. You go from "checkout is slow" to "fraud service call to a third-party API is adding 340ms, only for requests with cart value above $500" in minutes.

Traces vs Metrics vs Logs

Understanding when to reach for each signal type is a core SRE skill. They are not interchangeable — each has a distinct strength and a distinct cost profile.

Metrics are pre-aggregated numeric measurements sampled or counted over time windows. They are cheap to store (a counter is a few bytes), fast to query (time-series databases are optimized for range queries), and excellent for alerting on known conditions. Their weakness is low cardinality: a Prometheus counter has labels, but you cannot add a user_id label to a counter that fires millions of times per second without exploding your cardinality and destroying your scrape performance. Metrics answer: is something wrong, and how widespread is it?

Logs are immutable event records emitted at arbitrary points in code. They carry unlimited context — any key/value you want to add. Modern structured logging (JSON lines, logfmt) has made logs queryable, and tools like Loki or Datadog Logs let you filter and aggregate across high-cardinality fields. Their weakness is cost and correlation: at high request rates, storing and indexing every log line is expensive, and correlating logs from multiple services for a single request requires an explicit shared identifier.

Traces are causally-linked records of a request's journey. A trace is a tree of spans, where each span represents a unit of work in one service. Spans carry timing data, status, metadata, and links to their parent spans. Traces excel at latency attribution (which service took how long), dependency mapping (which services call which), and request-level debugging. Their weakness is cardinality in a different sense: storing every span for every request at high throughput is expensive, which is why sampling strategies (covered in lesson 7) are essential.

Traces vs Metrics vs Logs — signal types compared METRICS http_requests_total{status="200"} checkout_latency_p99 = 380ms error_rate = 0.04% Strengths Cheap, fast, great for alerting on known conditions Weakness Low cardinality — cannot answer "which user?" Answers: Is something wrong? How widespread? LOGS {"level":"error","svc":"cart" "trace_id":"abc123" "msg":"db timeout 320ms"} Strengths Unlimited context, queryable rich detail per event Weakness Expensive at scale; manual cross-service correlation Answers: What happened inside one service? TRACES api-gateway 0–12ms cart-service 12–48ms db-query 22–46ms fraud-check 48–388ms ! order-svc 388–400ms Strengths End-to-end latency map, causal request graph Weakness Storage cost at volume; requires sampling strategy Answers: Where did latency come from, for this request?
The three observability pillars: metrics detect breadth, logs explain depth, traces reveal causal latency across services.

Anatomy of a Trace: The Waterfall Diagram

A trace is visualized as a waterfall chart (also called a flame graph or Gantt chart). The horizontal axis is time. Each row is one span — one unit of work in one service. Spans are nested to show causality: if span B was initiated by span A, span B is indented under A, and its time range falls within A's time range.

Every span carries a standard set of fields:

  • trace_id — the 128-bit identifier shared by every span in the same trace.
  • span_id — a 64-bit identifier unique to this span.
  • parent_span_id — the span_id of the span that initiated this one (empty for the root span).
  • name — a human-readable operation name, e.g. cart.GetItems or db.query.
  • start_time and end_time — nanosecond-precision timestamps.
  • status — OK, ERROR, or UNSET.
  • attributes — key/value pairs with arbitrary context: http.method, db.statement, user.id, cart.item_count.
  • events — timestamped annotations within the span lifetime (e.g. "cache miss", "retry attempt 2").
Trace waterfall — checkout request across five services 0ms 100ms 200ms 300ms 400ms api-gateway POST /checkout 400ms (root) auth-service 27ms cart-service 60ms redis 12 fraud-check third-party API call 275ms ← bottleneck order-service 30ms root span normal span slow / error span
A trace waterfall for a checkout request: each row is one service, width shows duration, nesting shows causality. The fraud-check span is immediately visible as the bottleneck.

Why Metrics and Logs Cannot Do This Alone

Here is the scenario that convinced every major tech company to invest heavily in distributed tracing. You have a p99 latency regression — checkout went from 120ms to 400ms overnight. Your metrics show the regression clearly. You know it is real. Now what?

With metrics alone: you check the downstream service dashboards one by one. Auth looks fine. Cart looks fine. Fraud looks fine — its p99 is 280ms, up from 95ms, but you only notice this after 25 minutes of dashboard hunting, and you still do not know if fraud is the culprit or if it is itself a victim of a database slowdown.

With logs alone: you grep for requests with high latency, find a trace ID in the logs, then open four different log search interfaces to find all log lines with that trace ID, manually calculate the time gaps between them. Possible, but takes 45 minutes and requires that every service actually logged the trace ID (which they often do not).

With traces: you open the tracing UI, filter by service=checkout AND duration>200ms, click one trace, and see the waterfall immediately. The fraud-check span is red and 275ms wide. You click it, read its attributes: fraud.provider=acme-fraud-api, http.url=https://api.acmefraud.com/v2/check. You check the fraud provider status page — they had a degradation starting at 23:47 last night. Total investigation time: 4 minutes.

Pro practice at scale: At Google (with Dapper), Uber (with Jaeger), and Twitter (with Zipkin), the trace-first workflow is standard for latency investigation. The principle: never debug distributed latency by looking at per-service metrics in isolation. Always start with a trace that represents the degraded request, then use the span attributes to drill into the specific service and operation that is misbehaving. This cuts mean-time-to-diagnosis (MTTD) from hours to minutes for latency regressions.

Context Propagation: The Glue That Holds It Together

For a trace to work across service boundaries, each service must pass the trace context to the next one. When service A calls service B over HTTP, it injects the trace ID and span ID into HTTP headers. When service A calls service B over gRPC, it injects them into gRPC metadata. When service A publishes a message to Kafka, it injects them into the message headers. Service B extracts the context on receipt, creates a new child span, and continues the trace.

The standard header format (defined by the W3C Trace Context specification, adopted by OpenTelemetry) is:

# W3C traceparent header — the standard for context propagation # Format: traceparent: {version}-{trace-id}-{parent-span-id}-{flags} traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # ^^ version (always 00) # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 128-bit trace ID (hex) # ^^^^^^^^^^^^^^^^ 64-bit parent span ID (hex) # ^^ flags (01 = sampled) # When service A (api-gateway) calls service B (cart-service): # api-gateway creates root span: # trace_id = 4bf92f3577b34da6a3ce929d0e0e4736 # span_id = 00f067aa0ba902b7 # api-gateway injects into outgoing HTTP request: # GET /cart/items HTTP/1.1 # Host: cart-service.internal # traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # tracestate: vendor-specific-data (optional) # cart-service extracts the header, creates a child span: # trace_id = 4bf92f3577b34da6a3ce929d0e0e4736 (same!) # parent_id = 00f067aa0ba902b7 (api-gateway span) # span_id = a3ce929d0e0e4736 (new, unique to cart) # The backend assembles all spans with the same trace_id into one trace.

Context propagation is what transforms a collection of per-service spans into a coherent distributed trace. Without it, you have disconnected local measurements. With it, you have a complete causal graph of a request's journey across your entire system.

Production pitfall — silent context loss: Context propagation breaks silently in several common patterns: (1) any service that creates a new background goroutine/thread without propagating the context loses the trace from that point forward; (2) message queue consumers that do not extract trace headers from message metadata start a new disconnected trace; (3) any service that calls a third party and does not inject headers creates a gap in the trace. The result is a truncated trace that makes it look like your service is the bottleneck when it is actually a downstream call. Instrument and test your propagation paths explicitly — broken propagation is one of the top causes of misleading traces in production.

The Production Case: When Tracing Pays for Itself

Distributed tracing has real costs: instrumentation work, collector infrastructure, storage for spans. At high throughput (10,000+ requests per second), storing every span naively is prohibitive. But the ROI calculation is straightforward for any organization running microservices at scale.

Consider: a p99 latency regression that degrades checkout completion rate by 2% at a company processing $10M/day. Every hour that regression persists costs roughly $83K in lost revenue. If traces cut the time-to-diagnosis from 3 hours (manual log correlation) to 10 minutes (trace waterfall), that is 2 hours and 50 minutes saved per incident — approximately $233K per incident. The cost of running a Jaeger or Tempo cluster with 10% head-based sampling is a few thousand dollars per month. The math is not close.

This is why distributed tracing went from a research prototype at Google (Dapper, 2010) to a table-stakes requirement at any company operating microservices at scale. It is also why OpenTelemetry — the vendor-neutral standard for emitting traces, metrics, and logs — was created: to prevent the vendor lock-in of proprietary tracing SDKs. The remaining lessons in this tutorial build out the full OpenTelemetry stack, from instrumentation to backends to sampling to production debugging workflows.

Where we are going: This lesson established the why. Lesson 2 defines spans, traces, and context propagation in precise technical terms. Lesson 3 introduces OpenTelemetry as the standard SDK and wire format. Lessons 4–9 cover instrumentation, the OTel Collector, Jaeger/Tempo backends, sampling strategies, and production debugging. Lesson 10 puts it all together with a hands-on project tracing a real microservices request end-to-end.