Project: An SLO & Error Budget Policy
Project: An SLO & Error Budget Policy
Everything in this tutorial has been building to this lesson. You know how to write SLIs and SLOs, compute error budgets, build burn-rate alerts, eliminate toil, and run production readiness reviews. Now you will put all of that into a single, coherent artifact — an SLO and error budget policy document — for a realistic production service. This is the deliverable that an SRE team presents to engineering leadership, product management, and on-call engineers. It is the contract that governs how fast you ship, when you freeze deployments, and how you recover reliability after incidents.
We will work through a concrete example: PaymentService, a critical API that processes financial transactions. This is deliberately the hardest class of service — low latency requirements, strong correctness guarantees, regulatory audit trails, and near-zero tolerance for data loss. If you can write an SLO policy for this, you can write one for anything.
Step 1: Define the Service and Its Users
Start with context. Every SLO policy opens by answering: who uses this service, what does it do for them, and what does failure actually cost? For PaymentService:
- Service: PaymentService — processes charge, refund, and authorization requests via REST and gRPC APIs.
- Users: External customers (checkout flow), internal services (fraud engine, order service), and third-party merchants via API key.
- Failure impact: A failed payment = lost revenue, possible duplicate charge risk, customer churn. A slow checkout = cart abandonment. Every 100ms of added latency at the 99th percentile correlates with measurable revenue drop (documented by Amazon, Google, Shopify).
- Measurement window: Rolling 28 days. (28-day windows smooth over weekend/monthly seasonality better than calendar months.)
Step 2: Select SLIs
For an API like PaymentService, the canonical SLI categories are availability, latency, and correctness. You do not need to measure everything — pick the SLIs that most directly represent what users care about.
- Availability SLI:
sum(rate(http_requests_total{service="payment",code!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment"}[5m]))— proportion of non-5xx responses over all requests. - Latency SLI: Proportion of requests completing within 200ms at the 99th percentile. Use histogram buckets, not averages. Averages hide tail latency; p99 is what users feel.
- Correctness SLI: Proportion of charge requests that result in exactly one ledger entry (idempotency check). Measured via a background reconciliation job that compares the payment DB to the downstream ledger service every 5 minutes.
/health 60 times per minute can dilute your real availability signal significantly. Add a label filter: code!~"5..", path!="/health".Step 3: Set SLO Targets
SLO targets are not aspirational — they are calibrated to what users actually need and what your system can realistically deliver. The process: start from historical data, identify the reliability level below which users demonstrably complain (look at support tickets, churn data, SLA breach history), then set your internal SLO slightly above that threshold. Leave headroom between your SLO and any external SLA you have committed to.
For PaymentService, after analyzing 12 months of production data and correlating error rates with support volume:
- Availability SLO: 99.95% of requests succeed (non-5xx) over a rolling 28-day window. Error budget: 0.05% ≈ 21.6 minutes of total downtime equivalent.
- Latency SLO: 99% of requests complete within 200ms; 95% within 80ms. (Two latency targets capture both tail and bulk experience.)
- Correctness SLO: 99.999% of charge requests result in exactly one ledger entry. (Five nines here because data integrity violations are catastrophic and regulatorily reportable.)
- External SLA (customer-facing): 99.9% monthly availability. Your internal SLO at 99.95% gives you a 0.05% buffer before you breach the SLA.
Step 4: Write the Burn-Rate Alert Rules
SLO-based alerts fire when you are consuming your error budget faster than sustainable. The multi-window, multi-burn-rate pattern (from Google's alerting chapter) gives you fast detection of catastrophic burns and slow detection of chronic burns, with low false-positive rates. Here are the Prometheus alert rules for PaymentService availability:
Step 5: Write the Error Budget Policy
The alert rules tell you when something is wrong. The error budget policy tells your organization what to do about it — and who has authority to make decisions. This is the governance half of your SLO document. Without it, individual engineers make different calls in incidents, product teams negotiate exceptions ad hoc, and the SLO has no teeth.
A production-grade error budget policy has four sections: budget thresholds and their triggers, the escalation and decision authority matrix, the process for requesting exceptions, and the review cadence. Here is the PaymentService policy:
Step 6: Wire It Into Your Dashboard and Runbook
The policy is only useful if the on-call engineer can see budget consumption at a glance and knows exactly what to do when a threshold is crossed. Two final artifacts complete the package:
Grafana SLO dashboard panels — one panel per SLI showing: current 28-day SLO attainment vs. target, budget remaining (as percentage and as minutes), burn rate over the last 1h/6h/24h, and a color-coded status indicator (green/yellow/red/black matching the four policy thresholds). Use recording rules to pre-compute the 28-day range queries — raw 28-day PromQL range queries are expensive on large clusters.
Runbook links in every alert: The runbook annotation on each alert should point to a page that answers, in order: what is this alert, what is the user impact, what are the first three diagnostic commands, what are the common causes and their fixes, and when to escalate. The runbook is the policy made operational — it translates the governance decision ("freeze deploys") into the specific Slack message to send, the incident channel to open, and the people to page.
GET /slo/paymentsvc/budget returns the budget percentage; the pipeline fails (or creates a manual approval gate) when below threshold.The Complete Artifact Checklist
When you submit an SLO and error budget policy for a service, it should contain:
- Service context: Who uses it, what failure costs, why these SLIs were chosen over alternatives.
- SLI definitions: Exact PromQL or SQL queries for each SLI. No ambiguity — the same query must reproduce the same number every time.
- SLO targets: Percentage, window, and the data-driven rationale (historical attainment, user complaint threshold).
- Error budget table: Budget in minutes and in request count at median traffic, for each SLO.
- Burn-rate alert rules: YAML for fast-burn and slow-burn alerts, with severity labels and runbook links.
- Budget policy matrix: The four thresholds, the action each triggers, and who has authority.
- Exception process: How teams request a deployment exception during a freeze.
- Review schedule: Weekly, monthly, quarterly cadence with named owners.
- Dashboard link: URL to the live Grafana SLO panel so any stakeholder can verify the current state.
This is the deliverable. Nine sections, one to three pages. It is the difference between an SRE practice and an SRE culture — because once leadership has signed off on the policy, every deploy freeze has organizational authority, every exception is traceable, and every SLO review produces actionable improvements rather than arguments.