Prometheus & Grafana

Grafana Dashboards

18 min Lesson 8 of 32

Grafana Dashboards

Prometheus is a data engine — it collects, stores, and evaluates time-series metrics with ruthless efficiency. But a terminal full of PromQL results is not how humans detect degrading SLOs or communicate system health to an incident bridge. Grafana is the visualization and operational intelligence layer that turns raw Prometheus data into the dashboards that engineering teams actually use under pressure. This lesson covers how Grafana is architected, how to build panels and variables that scale across hundreds of services, and — critically — how to manage dashboards as code so they survive beyond the engineer who created them.

How Grafana Connects to Prometheus

Grafana treats data sources as first-class, pluggable objects. A Prometheus data source is configured once — pointing at your Prometheus or Thanos/Cortex query frontend — and then referenced by every panel across all dashboards. Each panel fires a PromQL query against that data source and renders the result as a time-series graph, stat, gauge, table, heatmap, or bar chart.

The critical thing to understand is that Grafana is a pure query layer: it sends PromQL to Prometheus on demand (or on the refresh interval you set) and renders whatever comes back. It stores no metric data itself. This matters for architecture: Grafana can safely be restarted, scaled horizontally, or replaced without touching your metric storage. Multiple Grafana instances can point at the same Prometheus and render identical dashboards.

Key idea: In large-scale setups (Google, Uber, Netflix) the Grafana query path goes through a query frontend — Thanos Query or Cortex — not directly to a single Prometheus. This is invisible to Grafana: the data source URL just points at the frontend. Knowing this matters when you tune query timeouts and cache headers.

Anatomy of a Dashboard

A Grafana dashboard is a JSON document containing an ordered list of panels, a set of variables, time range settings, and metadata. Each panel has:

Query targets — one or more PromQL expressions, each producing a set of time series.
Visualization type — time series, stat (single big number), gauge, bar chart, table, heatmap, logs (with Loki), flame graph (with Pyroscope).
Transformations — server-side data manipulation: merge, join, group by, calculate fields. Applied after the query, before rendering.
Overrides and field config — unit (bytes, seconds, percent), decimal places, color thresholds, axis scale (linear vs log).

Most engineers focus entirely on queries and ignore field config. That is a mistake. A panel showing 0.0023 when it should show 2.3 ms, or a graph without unit labels, is a dashboard that causes misreads during incidents. Always set the unit field — Grafana's unit list covers bytes, bits/s, percent (0–100 and 0.0–1.0 are different units), nanoseconds through hours, RPM, and dozens more.

Variables: The Engine of Reusable Dashboards

A dashboard with hardcoded service names or instance labels is a dashboard that requires copying for every new service. Template variables are the mechanism that makes one dashboard cover your entire fleet.

Variables are declared in the dashboard settings under Variables. The most useful types are:

Query variable — populated by a Prometheus query at load time. Example: label_values(http_requests_total, job) returns every distinct value of the job label, giving you a dropdown of all services that have ever emitted that metric.
Custom variable — a static comma-separated list. Use for environment (prod,staging,dev) or severity.
Interval variable — lets the user pick a step interval (1m,5m,15m,1h) and reference it in queries as $__interval or your named variable.
Datasource variable — lets a single dashboard switch between multiple Prometheus data sources (e.g., different regions or clusters).

Once a variable $job is defined, every panel query can reference it: rate(http_requests_total{job="$job"}[5m]). Grafana substitutes the current selection before sending the query. Variables support multi-value selection and All option — selecting All expands to a regex that matches every value: job=~"service-a|service-b|service-c".

Pro practice: Always chain variables. Define $cluster first, then a $namespace query variable whose PromQL filters by cluster="$cluster", then a $job filtered by namespace. Users drill down naturally, and each dropdown shows only the relevant choices for the selections above it — reducing cognitive load and preventing "query returns nothing" confusion.

Grafana resolves template variables into PromQL, queries Prometheus (or a Thanos/Cortex frontend), and renders results as typed panels. Dashboards and data sources are loaded from disk at startup via provisioning.

Provisioning: Dashboards as Code

Clicking "Save" in the Grafana UI stores the dashboard JSON in Grafana's internal database. That is fine for experiments. It is not acceptable for production because it means your dashboards are not in version control, they differ between environments, and they vanish if the Grafana database is lost.

Provisioning solves this. Grafana reads YAML manifests and JSON dashboard files from a directory on startup (and watches for changes if updateIntervalSeconds is set). Any dashboard file on disk takes precedence over and overwrites the database version — making disk the source of truth.

The provisioning directory layout:

# /etc/grafana/provisioning/
#   datasources/prometheus.yaml   ← define data sources
#   dashboards/default.yaml       ← tell Grafana WHERE to find JSON files
#   /var/lib/grafana/dashboards/  ← the actual JSON dashboard files

# datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus-prod          # stable UID — panels reference this
    url: http://thanos-query:9090
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: "15s"         # match your scrape interval
      queryTimeout: "60s"
      httpMethod: POST            # POST supports larger queries (Thanos)

# dashboards/default.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Services"
    type: file
    disableDeletion: true         # prevent UI delete of provisioned dashboards
    updateIntervalSeconds: 30     # hot-reload when JSON files change on disk
    options:
      path: /var/lib/grafana/dashboards

With this in place, you check your .json dashboard files into Git alongside your Kubernetes manifests or Terraform code. A ConfigMap mounts them into the Grafana pod; a GitOps pipeline (ArgoCD, Flux) ensures every environment has the same dashboards. Engineers open PRs to add panels — the same review workflow as code.

Production pitfall: If you set disableDeletion: true but leave dashboards editable ("editable": true in the JSON), users can still make changes in the UI — but those changes will be overwritten on the next provisioning cycle. This causes confusion and lost work. Either set "editable": false in provisioned JSON, or teach your team that the UI is read-only for provisioned dashboards and changes must go through Git. Most big-tech shops enforce the latter.

Dashboard JSON and the Grafana CLI

The Grafana UI exports dashboards as JSON via Dashboard settings → JSON model. Two fields to always set before committing:

"uid" — a stable identifier (e.g., "service-overview"). Without a stable UID, Grafana creates a new dashboard on import instead of updating the existing one, breaking all bookmarks and alert links.
"version" — set to 1 when committing a new dashboard and let Grafana increment it; do not manually bump it, or concurrent edits will conflict.

The Grafana CLI (grafana-cli) and the HTTP API (/api/dashboards/db) let you import dashboards programmatically — useful in CI pipelines to validate that JSON is well-formed before merging.

# Validate dashboard JSON syntax (CI step) using the Grafana API
curl -sf \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  -X POST http://grafana:3000/api/dashboards/db \
  --data-binary @dashboards/my-service.json \
  | jq '.status'
# → "success" on valid JSON; non-zero exit on error

# Export a dashboard for version control (replace UID and token):
curl -s \
  -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  "http://grafana:3000/api/dashboards/uid/service-overview" \
  | jq '.dashboard' > dashboards/service-overview.json

# Kubernetes ConfigMap mounting dashboards into the Grafana pod:
# (excerpt from a Helm values file or raw manifest)
#
# extraConfigmapMounts:
#   - name: dashboards
#     configMap: grafana-dashboards
#     mountPath: /var/lib/grafana/dashboards
#     readOnly: true

Effective Panel Design for On-Call Use

Dashboards built for on-call use have different requirements than dashboards built for quarterly reviews. A few big-tech patterns that survive production incidents:

Separate overview and drill-down dashboards. The overview shows the four golden signals (latency, traffic, errors, saturation) at the service level with one row per service. Drill-down dashboards expose per-endpoint, per-pod, or per-dependency detail. Link them with Grafana's data link feature: clicking a spike on the overview opens the drill-down filtered to that time range.
Use thresholds and color deliberately. Green/yellow/red on a stat panel should map to your actual SLO thresholds, not arbitrary defaults. A panel that is green at 490 ms when your SLO is 300 ms is actively misleading.
Set the refresh interval per dashboard purpose. Operational dashboards: 15–30 s. Capacity planning: 5 min. Historical trend: no auto-refresh. An aggressive refresh interval on a high-cardinality query taxes both Prometheus and Grafana — profile with the Grafana query inspector before setting low intervals.
Annotate deployments. Use Grafana's Annotations API or a webhook from your CD pipeline to add vertical lines at deploy times. The most common root cause of production degradation is "we deployed something" — annotated dashboards make that correlation instantaneous.

Pro practice: Keep a "dashboard health" review in your quarterly SRE rotation. Prune panels nobody looks at, update queries that reference deprecated metric names, and verify that every alert in Alertmanager links to the correct dashboard panel. Dashboards rot faster than application code — unused panels accumulate, queries break when metric names change, and on-call engineers stop trusting dashboards they cannot read under pressure.

Grafana Alerting vs Prometheus Alerting

Grafana has its own alerting engine (Grafana Alerting, formerly Grafana Unified Alerting). For teams already using Prometheus alerting rules and Alertmanager, the recommendation is to keep alert logic in Prometheus — it runs even when Grafana is down and keeps your alert logic in version-controlled YAML alongside your recording rules. Use Grafana alerts only for data sources that Prometheus cannot reach (Loki, Tempo, SQL databases). Never duplicate the same alert in both systems.

The completed provisioning pipeline — dashboards in Git, loaded via ConfigMap, watched by Grafana, data source pointing at a Thanos query frontend, alert rules in Prometheus — is the pattern used across large-scale Kubernetes clusters at tech companies. It is fully reproducible, reviewable, and survives Grafana restarts or full cluster rebuilds without manual intervention.