Grafana Dashboards
Grafana Dashboards
Prometheus is a data engine — it collects, stores, and evaluates time-series metrics with ruthless efficiency. But a terminal full of PromQL results is not how humans detect degrading SLOs or communicate system health to an incident bridge. Grafana is the visualization and operational intelligence layer that turns raw Prometheus data into the dashboards that engineering teams actually use under pressure. This lesson covers how Grafana is architected, how to build panels and variables that scale across hundreds of services, and — critically — how to manage dashboards as code so they survive beyond the engineer who created them.
How Grafana Connects to Prometheus
Grafana treats data sources as first-class, pluggable objects. A Prometheus data source is configured once — pointing at your Prometheus or Thanos/Cortex query frontend — and then referenced by every panel across all dashboards. Each panel fires a PromQL query against that data source and renders the result as a time-series graph, stat, gauge, table, heatmap, or bar chart.
The critical thing to understand is that Grafana is a pure query layer: it sends PromQL to Prometheus on demand (or on the refresh interval you set) and renders whatever comes back. It stores no metric data itself. This matters for architecture: Grafana can safely be restarted, scaled horizontally, or replaced without touching your metric storage. Multiple Grafana instances can point at the same Prometheus and render identical dashboards.
Anatomy of a Dashboard
A Grafana dashboard is a JSON document containing an ordered list of panels, a set of variables, time range settings, and metadata. Each panel has:
- Query targets — one or more PromQL expressions, each producing a set of time series.
- Visualization type — time series, stat (single big number), gauge, bar chart, table, heatmap, logs (with Loki), flame graph (with Pyroscope).
- Transformations — server-side data manipulation: merge, join, group by, calculate fields. Applied after the query, before rendering.
- Overrides and field config — unit (bytes, seconds, percent), decimal places, color thresholds, axis scale (linear vs log).
Most engineers focus entirely on queries and ignore field config. That is a mistake. A panel showing 0.0023 when it should show 2.3 ms, or a graph without unit labels, is a dashboard that causes misreads during incidents. Always set the unit field — Grafana's unit list covers bytes, bits/s, percent (0–100 and 0.0–1.0 are different units), nanoseconds through hours, RPM, and dozens more.
Variables: The Engine of Reusable Dashboards
A dashboard with hardcoded service names or instance labels is a dashboard that requires copying for every new service. Template variables are the mechanism that makes one dashboard cover your entire fleet.
Variables are declared in the dashboard settings under Variables. The most useful types are:
- Query variable — populated by a Prometheus query at load time. Example:
label_values(http_requests_total, job)returns every distinct value of thejoblabel, giving you a dropdown of all services that have ever emitted that metric. - Custom variable — a static comma-separated list. Use for environment (
prod,staging,dev) or severity. - Interval variable — lets the user pick a step interval (
1m,5m,15m,1h) and reference it in queries as$__intervalor your named variable. - Datasource variable — lets a single dashboard switch between multiple Prometheus data sources (e.g., different regions or clusters).
Once a variable $job is defined, every panel query can reference it: rate(http_requests_total{job="$job"}[5m]). Grafana substitutes the current selection before sending the query. Variables support multi-value selection and All option — selecting All expands to a regex that matches every value: job=~"service-a|service-b|service-c".
$cluster first, then a $namespace query variable whose PromQL filters by cluster="$cluster", then a $job filtered by namespace. Users drill down naturally, and each dropdown shows only the relevant choices for the selections above it — reducing cognitive load and preventing "query returns nothing" confusion.Provisioning: Dashboards as Code
Clicking "Save" in the Grafana UI stores the dashboard JSON in Grafana's internal database. That is fine for experiments. It is not acceptable for production because it means your dashboards are not in version control, they differ between environments, and they vanish if the Grafana database is lost.
Provisioning solves this. Grafana reads YAML manifests and JSON dashboard files from a directory on startup (and watches for changes if updateIntervalSeconds is set). Any dashboard file on disk takes precedence over and overwrites the database version — making disk the source of truth.
The provisioning directory layout:
With this in place, you check your .json dashboard files into Git alongside your Kubernetes manifests or Terraform code. A ConfigMap mounts them into the Grafana pod; a GitOps pipeline (ArgoCD, Flux) ensures every environment has the same dashboards. Engineers open PRs to add panels — the same review workflow as code.
disableDeletion: true but leave dashboards editable ("editable": true in the JSON), users can still make changes in the UI — but those changes will be overwritten on the next provisioning cycle. This causes confusion and lost work. Either set "editable": false in provisioned JSON, or teach your team that the UI is read-only for provisioned dashboards and changes must go through Git. Most big-tech shops enforce the latter.Dashboard JSON and the Grafana CLI
The Grafana UI exports dashboards as JSON via Dashboard settings → JSON model. Two fields to always set before committing:
"uid"— a stable identifier (e.g.,"service-overview"). Without a stable UID, Grafana creates a new dashboard on import instead of updating the existing one, breaking all bookmarks and alert links."version"— set to1when committing a new dashboard and let Grafana increment it; do not manually bump it, or concurrent edits will conflict.
The Grafana CLI (grafana-cli) and the HTTP API (/api/dashboards/db) let you import dashboards programmatically — useful in CI pipelines to validate that JSON is well-formed before merging.
Effective Panel Design for On-Call Use
Dashboards built for on-call use have different requirements than dashboards built for quarterly reviews. A few big-tech patterns that survive production incidents:
- Separate overview and drill-down dashboards. The overview shows the four golden signals (latency, traffic, errors, saturation) at the service level with one row per service. Drill-down dashboards expose per-endpoint, per-pod, or per-dependency detail. Link them with Grafana's data link feature: clicking a spike on the overview opens the drill-down filtered to that time range.
- Use thresholds and color deliberately. Green/yellow/red on a stat panel should map to your actual SLO thresholds, not arbitrary defaults. A panel that is green at 490 ms when your SLO is 300 ms is actively misleading.
- Set the refresh interval per dashboard purpose. Operational dashboards: 15–30 s. Capacity planning: 5 min. Historical trend: no auto-refresh. An aggressive refresh interval on a high-cardinality query taxes both Prometheus and Grafana — profile with the Grafana query inspector before setting low intervals.
- Annotate deployments. Use Grafana's Annotations API or a webhook from your CD pipeline to add vertical lines at deploy times. The most common root cause of production degradation is "we deployed something" — annotated dashboards make that correlation instantaneous.
Grafana Alerting vs Prometheus Alerting
Grafana has its own alerting engine (Grafana Alerting, formerly Grafana Unified Alerting). For teams already using Prometheus alerting rules and Alertmanager, the recommendation is to keep alert logic in Prometheus — it runs even when Grafana is down and keeps your alert logic in version-controlled YAML alongside your recording rules. Use Grafana alerts only for data sources that Prometheus cannot reach (Loki, Tempo, SQL databases). Never duplicate the same alert in both systems.
The completed provisioning pipeline — dashboards in Git, loaded via ConfigMap, watched by Grafana, data source pointing at a Thanos query frontend, alert rules in Prometheus — is the pattern used across large-scale Kubernetes clusters at tech companies. It is fully reproducible, reviewable, and survives Grafana restarts or full cluster rebuilds without manual intervention.