Prometheus & Grafana

PromQL for Real Questions

18 min Lesson 4 of 32

PromQL for Real Questions

The previous lesson introduced PromQL syntax and operators in isolation. This lesson is different: it starts from the questions an on-call engineer or SRE actually asks during an incident or capacity review, and shows the exact PromQL that answers them. The three categories covered here — error ratios, latency percentiles from histograms, and capacity and saturation queries — account for roughly 80 % of the dashboards and alerting rules you will write in production. Master these patterns and you can construct virtually any observability query you need.

Pattern 1: Error Ratios

The most common SLO at companies like Google, Netflix, and Amazon is an error rate SLO: for example, "99.9 % of requests must succeed." Prometheus exposes request counts through a counter, typically http_requests_total with labels such as status, job, and handler. To compute the error ratio you need two rates — errors over the window and total requests over the same window — then divide them.

# 5-minute error ratio per job (values 0-1; multiply by 100 for a percentage)
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))

# SLO burn-rate: how fast are we spending the error budget?
# Error budget = 1 - SLO target (e.g. 0.001 for 99.9%)
# Burn rate > 1 means you will exhaust the budget before the window closes.
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
) / 0.001

The =~"5.." label matcher uses a RE2 regex. It matches any status string whose first character is 5 — covering 500, 503, 504, etc. — without requiring you to enumerate every code. This is the canonical pattern for HTTP server errors in Prometheus.

A critical production habit: always keep the same range window in both the numerator and denominator. If you use a 5-minute window for error counts but a 1-hour window for total requests, the ratio is mathematically meaningless and will produce phantom alerts. Also protect against no-data gaps during healthy periods by anchoring the denominator with a small epsilon or using or vector(0) on the numerator so alerts never fire due to absent data rather than real errors.

# Robust error ratio — returns 0 (not no-data) when there are zero errors
(
  sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
  or
  (sum by (job) (rate(http_requests_total[5m])) * 0)
)
/
sum by (job) (rate(http_requests_total[5m]))

Pattern 2: p95 Latency from Histograms

A histogram metric — for example http_request_duration_seconds — is actually three metric families: _bucket (cumulative counts per configured le boundary), _sum (running total of all observed values), and _count (total number of observations). The average latency is trivially _sum / _count, but averages hide tail latency entirely. In production you need percentiles.

The function histogram_quantile(φ, le_buckets) interpolates the φ-th quantile from the bucket boundaries. Wrapping _bucket series in rate() converts cumulative counters into a per-second rate of change per bucket, which lets histogram_quantile answer "what is the p95 over the past 5 minutes?" rather than "since process start?"

# p95 latency per job over the last 5 minutes
histogram_quantile(
  0.95,
  sum by (job, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# p99 latency broken down by handler (for per-endpoint dashboards)
histogram_quantile(
  0.99,
  sum by (handler, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Average latency (always show alongside p95/p99 for context)
sum by (job) (rate(http_request_duration_seconds_sum[5m]))
/
sum by (job) (rate(http_request_duration_seconds_count[5m]))

Bucket coverage is everything. histogram_quantile can only interpolate within a defined bucket boundary. If your p99 consistently equals your highest configured bucket (e.g. le="10"), the function returns that boundary value — not the true 99th percentile — and it will look artificially capped on dashboards. Always configure buckets that span your actual latency distribution, with +Inf as the catch-all. The Prometheus client default buckets (.005 through 10 seconds) suit general web APIs; high-frequency microservices often need finer sub-millisecond resolution at the low end.

When aggregating across multiple instances (pods, nodes), always use sum by (job, le) before calling histogram_quantile. Computing percentiles per instance and then averaging the results — "averaging percentiles" — is statistically incorrect and consistently underestimates tail latency in real-world workloads.

Always aggregate all pod buckets with sum by(job, le) before calling histogram_quantile — never average per-pod percentiles.

Pattern 3: Capacity and Saturation Queries

Google's USE method (Utilization, Saturation, Errors) frames capacity questions around two dimensions: how close is a resource to its limit, and is it already queuing work it cannot keep up with? The following queries cover the most critical capacity signals in a typical production environment running node_exporter and kube-state-metrics.

# CPU utilization per node (0-1; alert > 0.80 to preserve headroom)
1 - avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
)

# Memory utilization per node
1 - (
  node_memory_MemAvailable_bytes
  /
  node_memory_MemTotal_bytes
)

# Disk utilization per mount point (alert before 85%)
1 - (
  node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
  /
  node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}
)

# Kubernetes: containers near their CPU limit (saturation signal)
sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
/
sum by (namespace, pod, container) (
  kube_pod_container_resource_limits{resource="cpu"}
)

# Disk I/O saturation: weighted queue depth (values > 1.0 = fully saturated)
rate(node_disk_io_time_weighted_seconds_total[5m])

Forecast remaining capacity. Combine the current utilization with predict_linear() to project when a resource will be exhausted. The following query predicts how many seconds until a filesystem is full based on the growth rate over the past 6 hours — extremely useful for catching slow disk leaks before they become incidents:

predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0

A result below zero means the disk is projected to fill within 4 hours. Wire this into an alert with a 30-minute for clause to filter transient spikes.

Combining Patterns: The Golden Signals Dashboard

In production, these three patterns live together in a single "Golden Signals" dashboard with four rows: Traffic (total request rate), Errors (error ratio), Latency (p50/p95/p99 histograms), and Saturation (CPU + memory + disk). Every service should have this dashboard before it goes to production. The queries above are the building blocks; the SLO burn-rate query belongs in the alerting layer (Lesson 6) on top of them.

At companies like Google and Spotify, the golden signals dashboard is a prerequisite for launch readiness. Teams cannot pass a production-readiness review without demonstrating that these four panels are populated, thresholds are set, and on-call runbooks reference them. Build this habit early — it is the single highest-leverage observability investment per hour of engineering time.