Observability Foundations

Dashboards That Work

18 min Lesson 7 of 28

Dashboards That Work

A dashboard is a conversation between your system and the engineer on-call. When that conversation is filled with noise — counters that always go up, gauges that look fine even during outages, graphs nobody has looked at in six months — it stops being useful and starts creating false confidence. The engineers at Google, Netflix, and Stripe don't measure dashboard quality by the number of panels. They measure it by how quickly a dashboard lets you move from "something is wrong" to "I know what and where". That is the only metric that matters.

This lesson is about the design decisions that separate dashboards that accelerate incident response from those that make engineers click past them to the raw query interface.

The Overview-to-Detail Hierarchy

Every production system should have at least two tiers of dashboards: an overview (service health at a glance) and detail dashboards (per-component deep-dives). A well-designed hierarchy means you should never need to look at a detail dashboard when everything is healthy, and you should always know which detail dashboard to open when something is wrong.

Dashboard hierarchy: Overview to Detail Tier 0 — Service Overview SLO burn rates · Error % · p99 latency · Saturation drill down Tier 1 — API Service RPS · Latency · DB pool · Cache hit Tier 1 — Worker Service Queue depth · Job duration · Error rate Tier 1 — Data Layer Query latency · Connections · Replication lag Tier 2 — Pod / Node CPU throttling · OOM events · Restarts Tier 2 — Queue Detail Per-queue depth · DLQ size · Throughput Tier 2 — DB Internals Slow queries · Lock waits · Index usage Navigation rule: Alert fires → open Tier 0 to confirm blast radius → open Tier 1 of affected service → Tier 2 for root cause. No panel in Tier 0 should require a legend to interpret — if you need one, it belongs in Tier 1 or Tier 2.
Three-tier dashboard hierarchy: Tier 0 shows SLO health for the whole system; Tier 1 shows per-service detail; Tier 2 digs into infrastructure and internal metrics for root-cause analysis.
The Tier 0 dashboard must answer one question in under five seconds: is the user experience acceptable right now? If an on-call engineer has to think, the dashboard is too complex. Every panel on Tier 0 should have a visual threshold — a red/yellow/green zone — so the cognitive load of deciding "good or bad?" is zero.

Avoiding Vanity Graphs

A vanity graph is any panel that makes the system look active without helping you make a decision. The most common offenders in production dashboards:

  • Total requests ever: a counter that only goes up. It proves the service is running. It never changes shape during an incident.
  • CPU utilization with no context: 60% CPU is either fine (headroom exists) or terrifying (you are at the limit of what the scheduler will do before throttling). Without showing the request rate alongside, it is meaningless.
  • p50 latency only: the median hides tail latency. Your slowest 1% of users — the ones most likely to churn — are invisible. Always show p95, p99, or p99.9 alongside p50.
  • "Deployed 14 hours ago" type annotations: clutter without a clear SLO window makes deployment markers useless. Only annotate deployments when they are inside the current SLO error-budget burn window.
  • Graphs nobody owns: if no team member can explain why a panel exists and what action it triggers, delete it. Dashboards should be owned and reviewed quarterly.

What Every Service Dashboard Needs

Google's USE method (Utilization, Saturation, Errors) and RED method (Rate, Errors, Duration) give you the minimum viable set of panels for any service. In practice, combine them:

  • Rate: requests/sec, jobs/sec — the throughput. Use a rate, not a counter.
  • Errors: error percentage (not count). sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) — a flat line at 0% during normal operation is immediately obvious when it spikes.
  • Duration: p50, p95, p99 latency on a single panel with different colors.
  • Saturation: how close to the limit are you? CPU throttle rate, memory working set vs. limit, connection pool utilization, queue depth vs. capacity.
  • SLO burn rate: is the current error rate consuming the error budget faster than it replenishes? This is the most actionable single number on any service dashboard.

Building the Grafana Dashboard in Code

Dashboards defined in the UI and saved by clicking "Save" are a trap: they exist only in Grafana's database, they cannot be code-reviewed, and they are lost when the database is restored. Every production dashboard must be managed as code — stored in Git, applied via CI, and version-controlled.

The two main approaches are Grafana's JSON model committed directly, or the Grafonnet Jsonnet library which generates the JSON. For teams already using Terraform, the grafana Terraform provider manages dashboards as HCL resources. Here is the Terraform approach for a minimal service overview panel group:

# dashboards.tf resource "grafana_dashboard" "api_service_overview" { folder = grafana_folder.observability.id config_json = jsonencode({ title = "API Service — Overview" uid = "api-svc-overview" refresh = "30s" time = { from = "now-1h", to = "now" } templating = { list = [ { name = "env" type = "custom" options = [ { value = "prod", text = "Production" }, { value = "staging", text = "Staging" } ] current = { value = "prod" } } ] } panels = [ { title = "Request Rate (RPS)" type = "timeseries" gridPos = { x = 0, y = 0, w = 8, h = 8 } targets = [{ expr = "sum(rate(http_requests_total{env=\"$env\"}[5m])) by (route)" legendFormat = "{{route}}" }] fieldConfig = { defaults = { unit = "reqps" color = { mode = "palette-classic" } } } }, { title = "Error Rate (%)" type = "timeseries" gridPos = { x = 8, y = 0, w = 8, h = 8 } targets = [{ expr = "100 * sum(rate(http_requests_total{env=\"$env\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{env=\"$env\"}[5m]))" legendFormat = "error %" }] fieldConfig = { defaults = { unit = "percent" thresholds = { mode = "absolute" steps = [ { color = "green", value = null }, { color = "yellow", value = 0.1 }, { color = "red", value = 1 } ] } } } }, { title = "Latency — p50 / p95 / p99" type = "timeseries" gridPos = { x = 16, y = 0, w = 8, h = 8 } targets = [ { expr = "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{env=\"$env\"}[5m])) by (le))" legendFormat = "p50" }, { expr = "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{env=\"$env\"}[5m])) by (le))" legendFormat = "p95" }, { expr = "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{env=\"$env\"}[5m])) by (le))" legendFormat = "p99" } ] fieldConfig = { defaults = { unit = "s" } } } ] }) }
Always use a template variable for environment. A dashboard with $env as a drop-down can be used for both staging and production incident response without maintaining two separate copies. Add a second variable for $cluster if you run multiple Kubernetes clusters. Grafana evaluates these at query time — there is no duplication in the JSON model.

Linking Dashboards Together

A dashboard hierarchy only works if the links between tiers are explicit. In Grafana, data links and panel links let you click a metric spike and jump directly to the detail dashboard with the same time window and the same environment filter pre-populated. Configure them in the panel JSON:

# Inside a panel definition (Grafana JSON) "links": [ { "title": "Open pod detail for this service", "url": "/d/pod-detail?var-env=${__field.labels.env}&var-service=${__field.labels.service}&from=${__from}&to=${__to}", "targetBlank": false } ] # In Terraform (grafana provider >= 1.x supports dataLinks in fieldConfig) fieldConfig = { defaults = { links = [ { title = "Open DB Detail Dashboard" url = "/d/db-detail?var-env=$${__field.labels.env}&from=$${__from}&to=$${__to}" } ] } }

The time-range variables ${__from} and ${__to} are critical — they preserve the incident time window as you drill down. Without them, the detail dashboard opens at "last 1 hour" and you lose context about exactly when the anomaly happened.

Production Failure Modes in Dashboards

  • Dashboard refresh rate too high: a 5-second refresh on 20 panels with expensive Prometheus queries can itself cause a query storm, degrading the metrics database during the incident you are trying to debug. Default to 30s refresh; use 10s only on Tier 0 during active incidents.
  • Undefined "normal": without a reference band showing the same period last week, engineers cannot tell if the current spike is unusual or just Tuesday traffic. Add a offset 1w query on every RPS and latency panel as a faint background reference line.
  • Missing data vs. zero: a panel showing a flat zero can mean "all is quiet" or "metrics stopped being emitted." Use absent() or a separate "last scrape" timestamp panel to distinguish silence from failure.
  • Dashboard rot: services are renamed, metrics are changed, and old panels silently return "No data." Instrument your dashboards: periodically run a query against the Grafana API to detect panels with zero data points and alert on them.
Never create an alert directly from a dashboard panel. Alerts defined via the Grafana UI panel editor are stored in Grafana, not in your alerting config-as-code. When you restore the Grafana database or migrate to a new instance, those alerts vanish. Define all alerts in Prometheus alerting rules YAML or Alertmanager, commit them to Git, and let the dashboard panel be a visualization only — with a link to the rule in the annotation.

The Dashboard Review Checklist

Before merging a dashboard into production, every panel should pass this review:

  1. Actionable: if this panel shows a bad value, what do I do? If you cannot answer, remove the panel.
  2. Correctly aggregated: are you using rate() for counters? histogram_quantile() for latency? Summing instead of averaging across pods?
  3. Threshold set: does the panel have green/yellow/red thresholds, or does it require the engineer to remember what "normal" looks like?
  4. Labeled: would a new engineer on their first on-call shift understand what this panel shows without asking? Title, Y-axis unit, and legend format must all be explicit.
  5. Linked: does clicking a spike take you to the relevant detail dashboard or log query?

Dashboards that survive this checklist become institutional knowledge. They encode the team's collective understanding of how the system behaves, and they make every future incident faster to resolve. That is the standard to build toward.