Tracing in Practice: Debugging with Traces
Tracing in Practice: Debugging with Traces
Knowing that traces exist is one thing; knowing how to read them under pressure — at 2 AM when p99 latency just spiked to 8 seconds — is another. This lesson is about the practical craft: how to open a trace waterfall, identify the slow span in seconds, and attribute an error to its true origin rather than the first service to surface it.
The Anatomy of a Trace Waterfall
Every distributed tracing UI presents a waterfall diagram — a Gantt-style chart where the x-axis is wall-clock time and each row is one span. The root span sits at the top; its children are indented beneath it, each starting when the call was made and ending when the response returned. Spans may overlap when a service fans out to multiple upstreams concurrently.
Three things to read at a glance:
- Width of a span bar — proportional to elapsed time. A bar that stretches nearly the full root span is your prime suspect.
- Gaps between parent and first child — time spent in the parent itself (serialization, business logic, lock waits).
- Concurrent fan-out vs. serial chains — parallel children (overlapping bars at the same depth) indicate intentional concurrency. Serial chains (each child starts after the previous ends) multiply latency; removing even one hop can halve total duration.
Finding the Slow Span
Most UIs (Jaeger, Grafana Tempo, Honeycomb) let you sort spans by duration in the span table. But at scale you arrive at a trace from a latency alert, not by browsing. The workflow is:
- Correlate the alert to a trace ID. Your alerting rule fires on p99 latency. The slow requests carry a
trace_idin their logs (structured logging + OTel log correlation). Pull the trace ID from a slow log line. - Open the trace in the UI. Sort the span table by duration descending — the slowest span rises to the top. The waterfall shows it as the widest bar.
- Inspect span attributes. Click the slow span. Look for:
db.statement(the exact query),http.url,rpc.method,db.row_count. OTel semantic conventions mean these attributes are standardized across languages. - Look at the span's parent. The parent tells you what triggered this span. If a single HTTP handler calls the slow database span 47 times, you have an N+1 query.
trace_id label to histogram samples. When Grafana shows a latency spike on a heatmap, click a spike point — if exemplars are enabled, Grafana renders a direct link to the trace that caused that spike. This cuts triage from minutes to seconds.
Pulling a Trace from Logs
In a production incident you typically start in Grafana (Loki/CloudWatch) with a log-based alert. With OTel log correlation enabled, every log line carries trace_id and span_id fields. Use them to jump into Tempo:
Error Attribution: Finding the True Origin
The hardest skill in trace debugging is distinguishing where an error was generated from where it was surfaced. In a microservices chain, a downstream database deadlock propagates as a 500 from the order service, which propagates as a 502 from the API gateway, which is what the user sees. Three services appear to have errors. Only one caused them.
OTel encodes this in span status. The OTel spec says: only set ERROR status on a span if that span itself caused the error — not if it merely propagated a downstream failure. A well-instrumented service that catches a downstream 500 and re-throws it should record the remote call span as ERROR (because that call failed), but its own parent span should also be ERROR only if the business operation itself failed.
ERROR with no ERROR child spans. That is the origin. Every ERROR span above it is a propagated failure — real for alerting purposes, but not the root cause. Fixing the origin silences all the downstream errors automatically.
In Jaeger, errors are shown as red spans. In Grafana Tempo, the span table has a status column. In both UIs, collapse all spans and look for the deepest red node. The span's exception.message and exception.stacktrace attributes hold the original exception.
Reading N+1 Patterns and Parallel Fan-Out
Two structural patterns are worth recognizing on sight:
- N+1 query: In the waterfall, you see dozens of near-identical short spans from a database client, each starting right after the previous one ends, all children of the same parent span. The total duration is the sum of N small queries. Fix: batch the queries or use a join. The trace makes the problem undeniable — you can count the spans.
- Unnecessary serial chain: Service A calls B, waits for the full response, then calls C. If B and C are independent, this doubles latency unnecessarily. The waterfall shows B and C as sequential bars at the same indent level. Fix: issue both calls concurrently (Promise.all, asyncio.gather, goroutines). After the fix, the waterfall shows B and C overlapping — and total duration drops to max(B, C) rather than B + C.
A Real Triage Workflow
This is the sequence an SRE at a big-tech company actually follows when an alert fires:
- Alert fires: p99 latency for
POST /orderexceeded 500 ms for 5 minutes. - Open Grafana dashboard. Confirm the metric is elevated. Check the exemplar link on the spike.
- Exemplar opens the slowest trace. Sort spans by duration. Top span:
db.insert_order, 440 ms. - Click span.
db.statementattribute showsSELECT * FROM products WHERE id = ?— executed 47 times. Classic N+1. - Check parent span:
createOrder→ code path isorder_service/handlers/create.py:83. Navigate there directly. - Fix, deploy, watch p99 drop. Close the incident.
Total time from alert to root cause: under 4 minutes. Without tracing, this same diagnosis requires log grepping, adding instrumentation, redeploying, and hoping to reproduce — easily 30-60 minutes.