PromQL for Real Questions
PromQL for Real Questions
The previous lesson introduced PromQL syntax and operators in isolation. This lesson is different: it starts from the questions an on-call engineer or SRE actually asks during an incident or capacity review, and shows the exact PromQL that answers them. The three categories covered here — error ratios, latency percentiles from histograms, and capacity and saturation queries — account for roughly 80 % of the dashboards and alerting rules you will write in production. Master these patterns and you can construct virtually any observability query you need.
Pattern 1: Error Ratios
The most common SLO at companies like Google, Netflix, and Amazon is an error rate SLO: for example, "99.9 % of requests must succeed." Prometheus exposes request counts through a counter, typically http_requests_total with labels such as status, job, and handler. To compute the error ratio you need two rates — errors over the window and total requests over the same window — then divide them.
=~"5.." label matcher uses a RE2 regex. It matches any status string whose first character is 5 — covering 500, 503, 504, etc. — without requiring you to enumerate every code. This is the canonical pattern for HTTP server errors in Prometheus.A critical production habit: always keep the same range window in both the numerator and denominator. If you use a 5-minute window for error counts but a 1-hour window for total requests, the ratio is mathematically meaningless and will produce phantom alerts. Also protect against no-data gaps during healthy periods by anchoring the denominator with a small epsilon or using or vector(0) on the numerator so alerts never fire due to absent data rather than real errors.
Pattern 2: p95 Latency from Histograms
A histogram metric — for example http_request_duration_seconds — is actually three metric families: _bucket (cumulative counts per configured le boundary), _sum (running total of all observed values), and _count (total number of observations). The average latency is trivially _sum / _count, but averages hide tail latency entirely. In production you need percentiles.
The function histogram_quantile(φ, le_buckets) interpolates the φ-th quantile from the bucket boundaries. Wrapping _bucket series in rate() converts cumulative counters into a per-second rate of change per bucket, which lets histogram_quantile answer "what is the p95 over the past 5 minutes?" rather than "since process start?"
histogram_quantile can only interpolate within a defined bucket boundary. If your p99 consistently equals your highest configured bucket (e.g. le="10"), the function returns that boundary value — not the true 99th percentile — and it will look artificially capped on dashboards. Always configure buckets that span your actual latency distribution, with +Inf as the catch-all. The Prometheus client default buckets (.005 through 10 seconds) suit general web APIs; high-frequency microservices often need finer sub-millisecond resolution at the low end.When aggregating across multiple instances (pods, nodes), always use sum by (job, le) before calling histogram_quantile. Computing percentiles per instance and then averaging the results — "averaging percentiles" — is statistically incorrect and consistently underestimates tail latency in real-world workloads.
Pattern 3: Capacity and Saturation Queries
Google's USE method (Utilization, Saturation, Errors) frames capacity questions around two dimensions: how close is a resource to its limit, and is it already queuing work it cannot keep up with? The following queries cover the most critical capacity signals in a typical production environment running node_exporter and kube-state-metrics.
predict_linear() to project when a resource will be exhausted. The following query predicts how many seconds until a filesystem is full based on the growth rate over the past 6 hours — extremely useful for catching slow disk leaks before they become incidents:
predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0
A result below zero means the disk is projected to fill within 4 hours. Wire this into an alert with a 30-minute
for clause to filter transient spikes.Combining Patterns: The Golden Signals Dashboard
In production, these three patterns live together in a single "Golden Signals" dashboard with four rows: Traffic (total request rate), Errors (error ratio), Latency (p50/p95/p99 histograms), and Saturation (CPU + memory + disk). Every service should have this dashboard before it goes to production. The queries above are the building blocks; the SLO burn-rate query belongs in the alerting layer (Lesson 6) on top of them.