Site Reliability Engineering (SRE)

Project: An SLO & Error Budget Policy

28 min Lesson 10 of 29

Project: An SLO & Error Budget Policy

Everything in this tutorial has been building to this lesson. You know how to write SLIs and SLOs, compute error budgets, build burn-rate alerts, eliminate toil, and run production readiness reviews. Now you will put all of that into a single, coherent artifact — an SLO and error budget policy document — for a realistic production service. This is the deliverable that an SRE team presents to engineering leadership, product management, and on-call engineers. It is the contract that governs how fast you ship, when you freeze deployments, and how you recover reliability after incidents.

We will work through a concrete example: PaymentService, a critical API that processes financial transactions. This is deliberately the hardest class of service — low latency requirements, strong correctness guarantees, regulatory audit trails, and near-zero tolerance for data loss. If you can write an SLO policy for this, you can write one for anything.

What a policy document actually is: An SLO policy is not a monitoring config file. It is a governance document — typically one to three pages — that defines what you promise, how you measure it, what happens when you violate it, and who has the authority to make exceptions. It turns reliability from a feeling into an agreement. Every production service at a big-tech company should have one; most do not until an SRE team forces the conversation.

Step 1: Define the Service and Its Users

Start with context. Every SLO policy opens by answering: who uses this service, what does it do for them, and what does failure actually cost? For PaymentService:

Service: PaymentService — processes charge, refund, and authorization requests via REST and gRPC APIs.
Users: External customers (checkout flow), internal services (fraud engine, order service), and third-party merchants via API key.
Failure impact: A failed payment = lost revenue, possible duplicate charge risk, customer churn. A slow checkout = cart abandonment. Every 100ms of added latency at the 99th percentile correlates with measurable revenue drop (documented by Amazon, Google, Shopify).
Measurement window: Rolling 28 days. (28-day windows smooth over weekend/monthly seasonality better than calendar months.)

Step 2: Select SLIs

For an API like PaymentService, the canonical SLI categories are availability, latency, and correctness. You do not need to measure everything — pick the SLIs that most directly represent what users care about.

Availability SLI: sum(rate(http_requests_total{service="payment",code!~"5.."}[5m])) / sum(rate(http_requests_total{service="payment"}[5m])) — proportion of non-5xx responses over all requests.
Latency SLI: Proportion of requests completing within 200ms at the 99th percentile. Use histogram buckets, not averages. Averages hide tail latency; p99 is what users feel.
Correctness SLI: Proportion of charge requests that result in exactly one ledger entry (idempotency check). Measured via a background reconciliation job that compares the payment DB to the downstream ledger service every 5 minutes.

Pro practice — exclude known non-user traffic: Scrape your SLI denominators carefully. Health check probes from load balancers, synthetic canary traffic, and internal cronjob requests should be excluded from SLI calculations. A load balancer hammering /health 60 times per minute can dilute your real availability signal significantly. Add a label filter: code!~"5..", path!="/health".

Step 3: Set SLO Targets

SLO targets are not aspirational — they are calibrated to what users actually need and what your system can realistically deliver. The process: start from historical data, identify the reliability level below which users demonstrably complain (look at support tickets, churn data, SLA breach history), then set your internal SLO slightly above that threshold. Leave headroom between your SLO and any external SLA you have committed to.

For PaymentService, after analyzing 12 months of production data and correlating error rates with support volume:

Availability SLO: 99.95% of requests succeed (non-5xx) over a rolling 28-day window. Error budget: 0.05% ≈ 21.6 minutes of total downtime equivalent.
Latency SLO: 99% of requests complete within 200ms; 95% within 80ms. (Two latency targets capture both tail and bulk experience.)
Correctness SLO: 99.999% of charge requests result in exactly one ledger entry. (Five nines here because data integrity violations are catastrophic and regulatorily reportable.)
External SLA (customer-facing): 99.9% monthly availability. Your internal SLO at 99.95% gives you a 0.05% buffer before you breach the SLA.

PaymentService SLO policy at a glance: three SLIs, their targets, the resulting error budgets, and the policy action triggered at each budget threshold.

Step 4: Write the Burn-Rate Alert Rules

SLO-based alerts fire when you are consuming your error budget faster than sustainable. The multi-window, multi-burn-rate pattern (from Google's alerting chapter) gives you fast detection of catastrophic burns and slow detection of chronic burns, with low false-positive rates. Here are the Prometheus alert rules for PaymentService availability:

# file: alerts/paymentsvc-slo.yaml
# PaymentService SLO burn-rate alerts (Prometheus + Alertmanager)
# SLO: 99.95% availability over 28 days
# Error budget: 0.05%
# Burn rate multipliers:
#   14.4x burn = exhausted in 2 hours    (page immediately)
#    6.0x burn = exhausted in 8 hours    (page, but less urgent)
#    3.0x burn = exhausted in 4 days     (ticket, no page)
#    1.0x burn = baseline / sustainable  (no alert)

groups:
  - name: paymentsvc.slo.alerts
    rules:

      # --- Fast burn: page now ---
      - alert: PaymentSvcAvailabilityBurnCritical
        expr: |
          (
            sum(rate(http_requests_total{service="paymentsvc",code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="paymentsvc"}[1h]))
          ) > (14.4 * 0.0005)
          and
          (
            sum(rate(http_requests_total{service="paymentsvc",code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="paymentsvc"}[5m]))
          ) > (14.4 * 0.0005)
        for: 2m
        labels:
          severity: critical
          team: sre-payments
          slo: availability
        annotations:
          summary: "PaymentSvc: critical availability burn rate"
          description: "Burning error budget at 14.4x — budget exhausted in ~2h. Error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/paymentsvc-availability"

      # --- Slow burn: ticket + slack ---
      - alert: PaymentSvcAvailabilityBurnWarning
        expr: |
          (
            sum(rate(http_requests_total{service="paymentsvc",code=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{service="paymentsvc"}[6h]))
          ) > (6.0 * 0.0005)
          and
          (
            sum(rate(http_requests_total{service="paymentsvc"}[30m]))
            /
            sum(rate(http_requests_total{service="paymentsvc"}[30m]))
          ) > (6.0 * 0.0005)
        for: 15m
        labels:
          severity: warning
          team: sre-payments
          slo: availability
        annotations:
          summary: "PaymentSvc: elevated availability burn rate"
          description: "Burning at 6x — budget exhausted in ~8h if sustained."

      # --- Budget exhaustion tracker (for dashboards, not paging) ---
      - alert: PaymentSvcBudgetExhausted
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{service="paymentsvc",code!~"5.."}[28d]))
              /
              sum(rate(http_requests_total{service="paymentsvc"}[28d]))
            )
          ) >= 0.0005
        labels:
          severity: critical
          team: sre-payments
        annotations:
          summary: "PaymentSvc: 28-day SLO breached — error budget exhausted"

Step 5: Write the Error Budget Policy

The alert rules tell you when something is wrong. The error budget policy tells your organization what to do about it — and who has authority to make decisions. This is the governance half of your SLO document. Without it, individual engineers make different calls in incidents, product teams negotiate exceptions ad hoc, and the SLO has no teeth.

A production-grade error budget policy has four sections: budget thresholds and their triggers, the escalation and decision authority matrix, the process for requesting exceptions, and the review cadence. Here is the PaymentService policy:

# PaymentService — Error Budget Policy
# Owner: SRE Payments Team | Effective: 2025-Q3 | Review: Quarterly
# SLO: 99.95% availability, rolling 28 days
# Total budget per 28d: ~21.6 minutes of equivalent downtime

## Budget Thresholds & Actions

| Budget Remaining | Condition         | Action                                               | Authority        |
|------------------|-------------------|------------------------------------------------------|-----------------|
| > 50%          | Healthy           | Normal deployment cadence. Ship freely.              | Dev teams        |
| 25%–50%          | Caution           | Review pending deploys for blast radius. Prefer      | SRE + Dev lead   |
|                  |                   | canary / progressive rollout for risky changes.      |                  |
| 10%–25%          | Warning           | Freeze non-critical feature deploys. Only           | SRE team         |
|                  |                   | reliability fixes and Sev-1 hotfixes allowed.        |                  |
| 0%–10%           | Critical          | Full deployment freeze. All changes require SRE      | SRE lead         |
|                  |                   | sign-off. Incident declared if not already open.     |                  |
| Breached (0%)    | SLO Violated      | Deployment freeze remains. Postmortem required.      | Engineering VP   |
|                  |                   | SLA credit process initiated. Board-level if         |                  |
|                  |                   | correctness SLO also breached.                       |                  |

## Exception Process
Any team wishing to deploy during a freeze must:
  1. File an exception request in the SRE oncall queue.
  2. Provide: change description, estimated blast radius, rollback plan.
  3. Obtain sign-off from on-call SRE lead.
  4. Deploy with SRE present in the deploy channel.
  5. Monitor burn rate for 30 minutes post-deploy before stand-down.

## Budget Reset & Recovery
Budget does not reset early. A breached SLO means the 28-day window
must roll past the incident before budget is restored. Recovery actions
(improved circuit breakers, reduced blast radius of deploys, improved
canary analysis) are tracked in the postmortem action items list and
reviewed in the next SLO quarterly review.

## Review Cadence
- Weekly: on-call lead reviews burn rate trend (5-minute standing meeting)
- Monthly: SRE team reviews 28-day SLO attainment vs. target
- Quarterly: Full SLO + budget policy review with product leadership
  - Adjust SLO targets if user expectations or system capability changed
  - Promote/demote services between SLO tiers based on criticality

Production pitfall — stale SLOs: SLO policies written once and never reviewed are dangerous. A service that moved from 100 req/s to 50,000 req/s has a completely different failure profile. A team that shipped a major refactor changed what "correct" means. Review SLOs quarterly as a hard rule. If you have not touched your SLO document in six months, assume it is wrong.

Step 6: Wire It Into Your Dashboard and Runbook

The policy is only useful if the on-call engineer can see budget consumption at a glance and knows exactly what to do when a threshold is crossed. Two final artifacts complete the package:

Grafana SLO dashboard panels — one panel per SLI showing: current 28-day SLO attainment vs. target, budget remaining (as percentage and as minutes), burn rate over the last 1h/6h/24h, and a color-coded status indicator (green/yellow/red/black matching the four policy thresholds). Use recording rules to pre-compute the 28-day range queries — raw 28-day PromQL range queries are expensive on large clusters.

Runbook links in every alert: The runbook annotation on each alert should point to a page that answers, in order: what is this alert, what is the user impact, what are the first three diagnostic commands, what are the common causes and their fixes, and when to escalate. The runbook is the policy made operational — it translates the governance decision ("freeze deploys") into the specific Slack message to send, the incident channel to open, and the people to page.

Pro practice — make the policy self-enforcing: At Google and Stripe, SLO budget state is wired into the CI/CD pipeline. The deploy system queries the current error budget before allowing a production push. If budget is below 10%, the pipeline requires a human approval from the SRE on-call. This removes the need for anyone to remember to check — the policy runs automatically on every deploy. Implement this with a small API that your CI system calls: GET /slo/paymentsvc/budget returns the budget percentage; the pipeline fails (or creates a manual approval gate) when below threshold.

The Complete Artifact Checklist

When you submit an SLO and error budget policy for a service, it should contain:

Service context: Who uses it, what failure costs, why these SLIs were chosen over alternatives.
SLI definitions: Exact PromQL or SQL queries for each SLI. No ambiguity — the same query must reproduce the same number every time.
SLO targets: Percentage, window, and the data-driven rationale (historical attainment, user complaint threshold).
Error budget table: Budget in minutes and in request count at median traffic, for each SLO.
Burn-rate alert rules: YAML for fast-burn and slow-burn alerts, with severity labels and runbook links.
Budget policy matrix: The four thresholds, the action each triggers, and who has authority.
Exception process: How teams request a deployment exception during a freeze.
Review schedule: Weekly, monthly, quarterly cadence with named owners.
Dashboard link: URL to the live Grafana SLO panel so any stakeholder can verify the current state.

This is the deliverable. Nine sections, one to three pages. It is the difference between an SRE practice and an SRE culture — because once leadership has signed off on the policy, every deploy freeze has organizational authority, every exception is traceable, and every SLO review produces actionable improvements rather than arguments.

Congratulations — you have completed the SRE tutorial. You can now define SLIs and SLOs from first principles, compute error budgets, write multi-window burn-rate alerts, run production readiness reviews, and produce a complete governance policy that an engineering organization can actually use. These are the skills that distinguish a senior SRE from an on-call firefighter. Bring them to every service you touch.