Mesh Observability
Mesh Observability
One of the most underappreciated gifts of a service mesh is what it gives you for free the moment you inject sidecars: a complete, consistent observability plane that requires zero application changes. Every pod that gains an Envoy proxy automatically emits the four Google SRE golden signals — latency, traffic, errors, and saturation — for every service-to-service edge in your topology. No SDK instrumentation. No per-team OTel configuration. No code reviews for forgotten metric registrations. The mesh sees every TCP byte and HTTP exchange and knows who sent it, who received it, whether it succeeded, and how long it took.
In production, this matters enormously. At Lyft — the company that originally created Envoy — the mesh telemetry layer was one of the primary justifications for the multi-year investment. Before the mesh, each team instrumented metrics differently. After, every service automatically exposed istio_requests_total, istio_request_duration_milliseconds, and istio_tcp_sent_bytes_total with a consistent label schema. P99 cross-service latency became a first-class observable with no per-engineer work.
The Golden Signals from the Sidecar
Envoy (Istio) and the linkerd2-proxy (Linkerd) both emit the four golden signals at the L4 and L7 layers automatically. The key Prometheus metrics you will use every day in production:
- Traffic (request rate):
istio_requests_total— a counter with labelssource_workload,destination_workload,response_code,request_protocol. Derive RPS withrate(istio_requests_total[1m]). - Errors: Filter
istio_requests_totalbyresponse_code!~"2.."for client/server error rates. The mesh reports errors at the transport layer — connection refused, upstream timeout, circuit-breaker open — using the x-envoy-upstream-service-time header and upstream_rq_time. - Latency:
istio_request_duration_milliseconds_bucket— a histogram exposing the full distribution. Usehistogram_quantile(0.99, rate(istio_request_duration_milliseconds_bucket[5m]))for p99 per workload pair. This is the latency of the proxied request as seen by the sidecar, which includes application processing time. - Saturation: Derived from
envoy_cluster_upstream_cx_active(active connections) versusenvoy_cluster_upstream_cx_overflow(connection-pool overflow). Rising overflow is the earliest warning sign of overload before latency degrades visibly.
source_workload, destination_workload, source_namespace, destination_namespace, destination_service, request_protocol, response_code, response_flags. The response_flags label is Envoy-specific and invaluable — it encodes the reason for a non-2xx response: UH (no healthy upstream), UT (upstream timeout), UC (upstream connection failure), URX (upstream retry limit exceeded), DC (downstream connection termination). Filtering on response_flags in a PromQL alert can distinguish "your service is returning 500s" from "Envoy is timing out waiting for your service."The standard Prometheus scrape configuration for Istio uses pod annotations. Istiod's telemetry API exposes metrics on port 15020 of each sidecar. The Istio Prometheus integration (or a PodMonitor if you use the Prometheus Operator) scrapes this port.
For Linkerd, the equivalent is the built-in Prometheus scrape annotations that linkerd inject writes onto each pod: prometheus.io/scrape: "true" and prometheus.io/port: "4191". The linkerd-viz extension ships a pre-built Grafana dashboard that renders the golden signals out of the box.
The Service Graph and Kiali
Golden-signal metrics give you numbers; Kiali gives you a live service graph that maps those numbers onto your topology. Kiali is the official Istio observability UI. It queries Prometheus, Jaeger/Zipkin, and the Kubernetes API simultaneously to render a real-time dependency graph where every edge is annotated with RPS, error rate, p99 latency, and the mTLS status of the connection.
Distributed Tracing Integration
The mesh handles the hardest part of distributed tracing: it automatically generates trace spans for every proxied request and propagates the W3C traceparent (or Zipkin B3) headers between services. You do not need to instrument your application code to get inter-service spans — the Envoy sidecars create them. What you do need from the application is a single behavior: propagate the incoming trace headers on every outgoing call. Envoy reads the headers on ingress, creates a server span, then injects updated headers into the downstream request. If your service code swallows the headers (does not forward them), Envoy on the next hop starts a new disconnected trace.
x-request-id, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags, and x-ot-span-context (or just traceparent if you switch to W3C mode). The fix is simple, but it requires touching every service. Do it systematically during the mesh rollout; leaving it until after means a retrofit project across dozens of teams.Configuring Istio to send traces to a Jaeger (or OTel Collector) backend is done via a Telemetry API resource. The MeshConfig in the istio ConfigMap sets the global sampling rate and tracing provider:
In production at 10,000 RPS, a 1% head-based sampling rate produces 100 traces per second — more than enough for latency analysis and incident debugging. The critical upgrade for production is tail-based sampling: sample 100% of error traces and slow traces (p99 > threshold) regardless of the global rate, and sample fast/successful traces at 1%. The Tempo or OTel Collector tail sampler implements this, ensuring you never miss a failure trace while keeping storage costs bounded.
Grafana Dashboards: the Canonical Stack
The Istio project ships pre-built Grafana dashboards that you import from the grafana.com catalog (or directly from the istio/istio GitHub repo under samples/addons/grafana). The four dashboards you will use in production:
- Istio Mesh Dashboard (ID 7639) — global view: total RPS, global error rate, p50/p90/p99 latency across the entire mesh. The first screen you open during an incident.
- Istio Service Dashboard (ID 7636) — per-service drilldown: inbound RPS by source, outbound RPS by destination, error rate broken down by response code and Envoy response flag, latency histograms. Sufficient for most incident root-cause investigations.
- Istio Workload Dashboard (ID 7630) — pod-level view: useful when multiple deployments serve the same service (canary analysis, multi-version traffic splits).
- Istio Performance Dashboard (ID 11829) — control-plane health: istiod CPU/memory, xDS push rate, config distribution latency. Essential for diagnosing mesh-layer problems as opposed to application problems.
The reporter="destination" filter is important: Envoy emits duplicate metrics from both the source sidecar (reporter=source) and the destination sidecar (reporter=destination). Using destination avoids double-counting and gives you the latency as seen by the receiving end, which is the correct view for SLO compliance. Use reporter=source only when you specifically need the client-perceived latency including network transit time.
request_total, response_latency_ms_bucket, tcp_open_connections. The Linkerd Viz extension ships a set of Grafana dashboards and a CLI: linkerd viz stat deploy gives you a live terminal table of success rate, RPS, and p99 latency per deployment. linkerd viz edges shows the mTLS status of every pod-to-pod edge. linkerd viz tap deploy/payments --to deploy/database streams live request details — method, path, response code, latency — without a full trace backend. This tap feature is the Linkerd equivalent of Envoy's access logs and is invaluable during development.Connecting Metrics, Traces, and Logs: Exemplars
The final piece of production mesh observability is exemplar linkage — the ability to click on a spike in a Grafana latency panel and jump directly to a representative trace. Prometheus 2.43+ supports native exemplars: a histogram bucket can carry a trace_id label alongside the observation. Envoy 1.24+ emits exemplars when tracing is enabled.
With Grafana 9+, a latency panel backed by a Prometheus histogram with exemplars shows scatter dots overlaid on the quantile line. Each dot is a real request with a real trace ID. Clicking a dot opens Jaeger or Tempo filtered to that trace. This closes the loop: you see the p99 spike in your SLO dashboard, click the worst exemplar, land on the trace waterfall that shows exactly which service and which downstream call drove the spike. No grep, no log correlation, no context switching between four tools.
At organizations running hundreds of services — Airbnb, Pinterest, Shopify — the exemplar linkage pattern is considered mandatory for any latency SLO dashboard. The mesh provides the trace IDs for free; the only requirement is Prometheus scraping with exemplar support (--enable-feature=exemplar-storage flag on Prometheus) and a Grafana data source configured to use them.