Prometheus & Grafana

PromQL for Real Questions

18 min Lesson 4 of 32

PromQL for Real Questions

The previous lesson introduced PromQL syntax and operators in isolation. This lesson is different: it starts from the questions an on-call engineer or SRE actually asks during an incident or capacity review, and shows the exact PromQL that answers them. The three categories covered here — error ratios, latency percentiles from histograms, and capacity and saturation queries — account for roughly 80 % of the dashboards and alerting rules you will write in production. Master these patterns and you can construct virtually any observability query you need.

Pattern 1: Error Ratios

The most common SLO at companies like Google, Netflix, and Amazon is an error rate SLO: for example, "99.9 % of requests must succeed." Prometheus exposes request counts through a counter, typically http_requests_total with labels such as status, job, and handler. To compute the error ratio you need two rates — errors over the window and total requests over the same window — then divide them.

# 5-minute error ratio per job (values 0-1; multiply by 100 for a percentage) sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # SLO burn-rate: how fast are we spending the error budget? # Error budget = 1 - SLO target (e.g. 0.001 for 99.9%) # Burn rate > 1 means you will exhaust the budget before the window closes. ( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) / 0.001
The =~"5.." label matcher uses a RE2 regex. It matches any status string whose first character is 5 — covering 500, 503, 504, etc. — without requiring you to enumerate every code. This is the canonical pattern for HTTP server errors in Prometheus.

A critical production habit: always keep the same range window in both the numerator and denominator. If you use a 5-minute window for error counts but a 1-hour window for total requests, the ratio is mathematically meaningless and will produce phantom alerts. Also protect against no-data gaps during healthy periods by anchoring the denominator with a small epsilon or using or vector(0) on the numerator so alerts never fire due to absent data rather than real errors.

# Robust error ratio — returns 0 (not no-data) when there are zero errors ( sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) or (sum by (job) (rate(http_requests_total[5m])) * 0) ) / sum by (job) (rate(http_requests_total[5m]))

Pattern 2: p95 Latency from Histograms

A histogram metric — for example http_request_duration_seconds — is actually three metric families: _bucket (cumulative counts per configured le boundary), _sum (running total of all observed values), and _count (total number of observations). The average latency is trivially _sum / _count, but averages hide tail latency entirely. In production you need percentiles.

The function histogram_quantile(φ, le_buckets) interpolates the φ-th quantile from the bucket boundaries. Wrapping _bucket series in rate() converts cumulative counters into a per-second rate of change per bucket, which lets histogram_quantile answer "what is the p95 over the past 5 minutes?" rather than "since process start?"

# p95 latency per job over the last 5 minutes histogram_quantile( 0.95, sum by (job, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) # p99 latency broken down by handler (for per-endpoint dashboards) histogram_quantile( 0.99, sum by (handler, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) # Average latency (always show alongside p95/p99 for context) sum by (job) (rate(http_request_duration_seconds_sum[5m])) / sum by (job) (rate(http_request_duration_seconds_count[5m]))
Bucket coverage is everything. histogram_quantile can only interpolate within a defined bucket boundary. If your p99 consistently equals your highest configured bucket (e.g. le="10"), the function returns that boundary value — not the true 99th percentile — and it will look artificially capped on dashboards. Always configure buckets that span your actual latency distribution, with +Inf as the catch-all. The Prometheus client default buckets (.005 through 10 seconds) suit general web APIs; high-frequency microservices often need finer sub-millisecond resolution at the low end.

When aggregating across multiple instances (pods, nodes), always use sum by (job, le) before calling histogram_quantile. Computing percentiles per instance and then averaging the results — "averaging percentiles" — is statistically incorrect and consistently underestimates tail latency in real-world workloads.

histogram_quantile aggregation order WRONG — averaging percentiles Pod A _bucket rate(_bucket[5m]) Pod B _bucket rate(_bucket[5m]) histogram_quantile per pod then avg() WRONG p95 underestimates tail CORRECT — aggregate buckets first All Pods _bucket sum by(job, le) rate() histogram_quantile(0.95, aggregated buckets) CORRECT p95 true tail latency
Always aggregate all pod buckets with sum by(job, le) before calling histogram_quantile — never average per-pod percentiles.

Pattern 3: Capacity and Saturation Queries

Google's USE method (Utilization, Saturation, Errors) frames capacity questions around two dimensions: how close is a resource to its limit, and is it already queuing work it cannot keep up with? The following queries cover the most critical capacity signals in a typical production environment running node_exporter and kube-state-metrics.

# CPU utilization per node (0-1; alert > 0.80 to preserve headroom) 1 - avg by (instance) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) # Memory utilization per node 1 - ( node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ) # Disk utilization per mount point (alert before 85%) 1 - ( node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} ) # Kubernetes: containers near their CPU limit (saturation signal) sum by (namespace, pod, container) ( rate(container_cpu_usage_seconds_total{container!=""}[5m]) ) / sum by (namespace, pod, container) ( kube_pod_container_resource_limits{resource="cpu"} ) # Disk I/O saturation: weighted queue depth (values > 1.0 = fully saturated) rate(node_disk_io_time_weighted_seconds_total[5m])
Forecast remaining capacity. Combine the current utilization with predict_linear() to project when a resource will be exhausted. The following query predicts how many seconds until a filesystem is full based on the growth rate over the past 6 hours — extremely useful for catching slow disk leaks before they become incidents:

predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0

A result below zero means the disk is projected to fill within 4 hours. Wire this into an alert with a 30-minute for clause to filter transient spikes.

Combining Patterns: The Golden Signals Dashboard

In production, these three patterns live together in a single "Golden Signals" dashboard with four rows: Traffic (total request rate), Errors (error ratio), Latency (p50/p95/p99 histograms), and Saturation (CPU + memory + disk). Every service should have this dashboard before it goes to production. The queries above are the building blocks; the SLO burn-rate query belongs in the alerting layer (Lesson 6) on top of them.

At companies like Google and Spotify, the golden signals dashboard is a prerequisite for launch readiness. Teams cannot pass a production-readiness review without demonstrating that these four panels are populated, thresholds are set, and on-call runbooks reference them. Build this habit early — it is the single highest-leverage observability investment per hour of engineering time.