Distributed Tracing & OpenTelemetry

Tracing in Practice: Debugging with Traces

18 min Lesson 8 of 28

Tracing in Practice: Debugging with Traces

Knowing that traces exist is one thing; knowing how to read them under pressure — at 2 AM when p99 latency just spiked to 8 seconds — is another. This lesson is about the practical craft: how to open a trace waterfall, identify the slow span in seconds, and attribute an error to its true origin rather than the first service to surface it.

The Anatomy of a Trace Waterfall

Every distributed tracing UI presents a waterfall diagram — a Gantt-style chart where the x-axis is wall-clock time and each row is one span. The root span sits at the top; its children are indented beneath it, each starting when the call was made and ending when the response returned. Spans may overlap when a service fans out to multiple upstreams concurrently.

Trace Waterfall: slow database span identified 0 ms 200 ms 400 ms 600 ms API Gateway HTTP POST /order [595 ms] Auth Svc JWT validate [47ms] Order Svc createOrder [525 ms] Inventory checkStock [55ms] Postgres INSERT orders + N+1 queries [440 ms] ⚠ Notify Svc Root span Fast / OK Slow / Error — root cause Root cause here
A trace waterfall for a 595 ms order request. The red Postgres span (440 ms) is immediately visible as the bottleneck.

Three things to read at a glance:

  • Width of a span bar — proportional to elapsed time. A bar that stretches nearly the full root span is your prime suspect.
  • Gaps between parent and first child — time spent in the parent itself (serialization, business logic, lock waits).
  • Concurrent fan-out vs. serial chains — parallel children (overlapping bars at the same depth) indicate intentional concurrency. Serial chains (each child starts after the previous ends) multiply latency; removing even one hop can halve total duration.

Finding the Slow Span

Most UIs (Jaeger, Grafana Tempo, Honeycomb) let you sort spans by duration in the span table. But at scale you arrive at a trace from a latency alert, not by browsing. The workflow is:

  1. Correlate the alert to a trace ID. Your alerting rule fires on p99 latency. The slow requests carry a trace_id in their logs (structured logging + OTel log correlation). Pull the trace ID from a slow log line.
  2. Open the trace in the UI. Sort the span table by duration descending — the slowest span rises to the top. The waterfall shows it as the widest bar.
  3. Inspect span attributes. Click the slow span. Look for: db.statement (the exact query), http.url, rpc.method, db.row_count. OTel semantic conventions mean these attributes are standardized across languages.
  4. Look at the span's parent. The parent tells you what triggered this span. If a single HTTP handler calls the slow database span 47 times, you have an N+1 query.
Use exemplars to jump from metrics to traces. Prometheus exemplars attach a trace_id label to histogram samples. When Grafana shows a latency spike on a heatmap, click a spike point — if exemplars are enabled, Grafana renders a direct link to the trace that caused that spike. This cuts triage from minutes to seconds.

Pulling a Trace from Logs

In a production incident you typically start in Grafana (Loki/CloudWatch) with a log-based alert. With OTel log correlation enabled, every log line carries trace_id and span_id fields. Use them to jump into Tempo:

# 1. Find slow requests in Loki (LogQL) {app="order-service"} | json | duration > 500ms # Example matching log line (structured JSON): # {"level":"error","message":"request failed","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736", # "span_id":"00f067aa0ba902b7","duration_ms":601,"http.status_code":500} # 2. Copy trace_id, open Tempo in Grafana: # Explore → Data source: Tempo → TraceID → paste 4bf92f3577b34da6a3ce929d0e0e4736 # 3. Or query Tempo HTTP API directly: curl -s "http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736" \ | jq '.batches[].scopeSpans[].spans[] | {name, durationNanos, status}'

Error Attribution: Finding the True Origin

The hardest skill in trace debugging is distinguishing where an error was generated from where it was surfaced. In a microservices chain, a downstream database deadlock propagates as a 500 from the order service, which propagates as a 502 from the API gateway, which is what the user sees. Three services appear to have errors. Only one caused them.

OTel encodes this in span status. The OTel spec says: only set ERROR status on a span if that span itself caused the error — not if it merely propagated a downstream failure. A well-instrumented service that catches a downstream 500 and re-throws it should record the remote call span as ERROR (because that call failed), but its own parent span should also be ERROR only if the business operation itself failed.

The rule for error attribution: Find the deepest span marked ERROR with no ERROR child spans. That is the origin. Every ERROR span above it is a propagated failure — real for alerting purposes, but not the root cause. Fixing the origin silences all the downstream errors automatically.

In Jaeger, errors are shown as red spans. In Grafana Tempo, the span table has a status column. In both UIs, collapse all spans and look for the deepest red node. The span's exception.message and exception.stacktrace attributes hold the original exception.

# OTel Python SDK — correct error attribution pattern from opentelemetry import trace from opentelemetry.trace import StatusCode tracer = trace.get_tracer(__name__) def create_order(order_data): with tracer.start_as_current_span("createOrder") as order_span: try: # This inner span will be marked ERROR if the DB call fails with tracer.start_as_current_span("db.insert_order") as db_span: db.execute("INSERT INTO orders ...", order_data) except Exception as exc: # Mark THIS span (db.insert_order) as the error origin db_span.set_status(StatusCode.ERROR, str(exc)) db_span.record_exception(exc) # Propagate — the parent span (createOrder) will also mark ERROR # because the business operation failed, but the trace clearly shows origin raise

Reading N+1 Patterns and Parallel Fan-Out

Two structural patterns are worth recognizing on sight:

  • N+1 query: In the waterfall, you see dozens of near-identical short spans from a database client, each starting right after the previous one ends, all children of the same parent span. The total duration is the sum of N small queries. Fix: batch the queries or use a join. The trace makes the problem undeniable — you can count the spans.
  • Unnecessary serial chain: Service A calls B, waits for the full response, then calls C. If B and C are independent, this doubles latency unnecessarily. The waterfall shows B and C as sequential bars at the same indent level. Fix: issue both calls concurrently (Promise.all, asyncio.gather, goroutines). After the fix, the waterfall shows B and C overlapping — and total duration drops to max(B, C) rather than B + C.
Do not trust the root span duration alone. A root span of 600 ms with a 440 ms DB child and a 55 ms auth child leaves ~105 ms unaccounted. That gap is spent in the service itself — serialization, business logic, or blocked I/O. If that gap grows over time without any child span growing, you have a CPU or lock-contention problem in the service that no downstream trace can reveal. Add an internal span around the suspected code path.

A Real Triage Workflow

This is the sequence an SRE at a big-tech company actually follows when an alert fires:

  1. Alert fires: p99 latency for POST /order exceeded 500 ms for 5 minutes.
  2. Open Grafana dashboard. Confirm the metric is elevated. Check the exemplar link on the spike.
  3. Exemplar opens the slowest trace. Sort spans by duration. Top span: db.insert_order, 440 ms.
  4. Click span. db.statement attribute shows SELECT * FROM products WHERE id = ? — executed 47 times. Classic N+1.
  5. Check parent span: createOrder → code path is order_service/handlers/create.py:83. Navigate there directly.
  6. Fix, deploy, watch p99 drop. Close the incident.

Total time from alert to root cause: under 4 minutes. Without tracing, this same diagnosis requires log grepping, adding instrumentation, redeploying, and hoping to reproduce — easily 30-60 minutes.