Observability Foundations

SLIs, SLOs & SLAs

18 min Lesson 4 of 28

SLIs, SLOs & SLAs

When a production system goes down at 3 AM, the first question every engineer asks is: "Are we violating our SLA?" The second question — which separates senior engineers from the rest — is: "Was our error budget already spent before this incident?" Understanding the full vocabulary of reliability, from the raw measurement all the way to the legal contract, is foundational to operating services at Google, Amazon, and Stripe scale. These are not abstract concepts; they are the operational contracts that determine engineering priorities, on-call escalation, and whether a company issues a refund.

Service Level Indicators (SLIs)

A Service Level Indicator is a quantitative measure of some aspect of the level of service being provided. An SLI answers: "How is the service performing right now, expressed as a ratio?" The canonical form of an SLI is a ratio of good events to total events over a rolling window:

SLI = (good events) / (valid events)

Good SLIs are carefully chosen — not every metric is an SLI. The four categories that matter at production scale are:

Availability: The fraction of requests served successfully. For HTTP APIs: (requests with status < 500) / (total requests). At Google, a request that returns HTTP 500 or times out counts as bad.
Latency: The fraction of requests served faster than a threshold. Example: (requests completed in < 200ms) / (total requests). Note that latency SLIs measure a ratio, not a raw p99 value — this makes them composable with the SLO framework.
Throughput: The fraction of time the service is handling sufficient request volume. Used for batch pipelines: (minutes processing >= target_rate) / (total minutes).
Quality / Correctness: The fraction of responses that are correct. Used for search ranking, recommendation engines, data pipelines. Harder to measure automatically — often requires golden datasets or sampling with human validation.

SLIs are ratios, not raw numbers. "p99 latency = 450ms" is a metric. "92% of requests completed in < 200ms" is an SLI. The ratio form is essential because it makes the SLI directly comparable to the SLO target, and it makes the error budget calculation trivial. Never set an SLO on a raw percentile — set it on the ratio SLI that measures what fraction of users had a good experience.

Service Level Objectives (SLOs)

A Service Level Objective is the target value for an SLI, over a measurement window. An SLO is the internal agreement about how reliable a service must be. The canonical form is:

SLO = SLI target × window — for example: "99.9% of requests will return in < 200ms, measured over a rolling 28-day window."

The measurement window matters enormously. A 28-day rolling window is preferred over calendar month at companies like Google because it has consistent length and rolls forward continuously (no "reset at midnight on the 1st" behaviour that causes perverse incentives to burn budget early in a month). The standard windows are: 28-day rolling, 7-day rolling, or trailing 90 days for quarterly reviews.

An SLO of 99.9% over 28 days means the service is allowed a total of 28 × 24 × 60 × (1 - 0.999) = 40.32 minutes of bad events (the error budget) before it is considered out of compliance. This number drives engineering decisions:

If error budget is healthy (>50% remaining): ship features aggressively, run chaos experiments, do risky schema migrations.
If error budget is at 25%: slow down risky deployments, freeze non-critical changes, prioritize reliability work.
If error budget is exhausted: freeze releases, halt experiments, all hands on reliability until the window rolls forward.

Set SLOs tighter than your SLA, looser than your current performance. If your service actually runs at 99.97% availability but your SLA promises 99.5%, set your SLO at 99.9%. This gives you a warning (SLO breach) before you hit a business consequence (SLA breach), and leaves slack between current performance and the objective so that normal variance does not constantly trigger alerts.

The Error Budget

The error budget is the quantity of unreliability you are allowed to spend before breaching the SLO. It is the most powerful tool the SRE discipline introduced to software engineering. Error budget converts the abstract concept of "reliability" into a finite, spendable resource. Teams that understand this stop having the endless "reliability vs. features" argument — the budget answer it objectively.

# Error budget calculations for common SLO targets
# Window: 28 days = 28 * 24 * 60 * 60 = 2,419,200 seconds = 40,320 minutes

echo "SLO 99.9%  -> error budget = $(echo 'scale=2; 40320 * 0.001' | bc) minutes = ~40 min/28 days"
echo "SLO 99.95% -> error budget = $(echo 'scale=2; 40320 * 0.0005' | bc) minutes = ~20 min/28 days"
echo "SLO 99.99% -> error budget = $(echo 'scale=2; 40320 * 0.0001' | bc) minutes = ~4 min/28 days"
echo "SLO 99.999% -> error budget = $(echo 'scale=2; 40320 * 0.00001' | bc) minutes = ~24 sec/28 days"

# Burn rate: how fast you are consuming the budget relative to the allowed rate
# If SLO = 99.9% and current 1-hour error rate = 5%, burn rate = 5% / 0.1% = 50x
# At 50x burn rate, you will exhaust the 28-day budget in 28 days / 50 = 13.4 hours
echo "Burn rate 50x exhausts 28-day budget in $(echo 'scale=1; 672 / 50' | bc) hours"

Burn rate alerting is the production-grade approach to SLO-based alerting. Instead of alerting on individual metric thresholds, you alert when you are consuming error budget faster than it will replenish. Google's SRE Workbook defines two-window burn rate alerts: a fast alert (1-hour window, >14x burn rate — catches rapid outages) and a slow alert (6-hour window, >6x burn rate — catches slow burns you would otherwise miss).

The reliability contract stack: SLIs feed SLOs, which define the error budget, which enforces the release policy buffer below the SLA.

Service Level Agreements (SLAs)

An SLA is the external, legally-binding contract between a service provider and its customers, specifying the consequences of failing to meet a target. SLAs typically include: the metric being promised (availability, latency), the measurement period, the measurement methodology, and the remedy (service credits, refunds, contract termination). AWS SLAs typically promise 99.9% monthly uptime for EC2 and 99.99% for S3, with service credits of 10-30% of the monthly bill for breaches.

The most dangerous mistake in SLA design: setting your SLA equal to your actual operating performance. If you run at 99.95% and promise 99.95%, any normal variance — a bad deploy, a dependent service degradation, a datacenter power event — puts you immediately in breach. The SLA should always be set at a level you are confident you can meet even in bad months. The gap between your SLO (99.9%) and your SLA (99.5%) is what absorbs normal operational variance without creating refund obligations.

Defining SLIs in Practice: Prometheus and OpenTelemetry

At big-tech scale, SLIs are computed from metrics instrumentation in the service itself. The industry standard is to instrument request counters and histograms, then compute the SLI ratio in your monitoring system. Here is how this looks end-to-end with Prometheus:

# Prometheus recording rules — compute SLI ratios from raw metrics
# File: /etc/prometheus/rules/sli-rules.yaml

groups:
  - name: sli_availability
    interval: 30s
    rules:
      # Raw ratio: fraction of HTTP requests with status < 500
      - record: job:http_requests_success:rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

  - name: sli_latency
    interval: 30s
    rules:
      # Fraction of requests that completed in < 200ms
      # Uses histogram_quantile approximation via le bucket
      - record: job:http_request_latency_sli:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (job)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (job)

  - name: slo_compliance
    interval: 30s
    rules:
      # 28-day availability SLI (rolling window via long-range rate)
      - record: job:slo_availability_28d:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d])) by (job)
          /
          sum(rate(http_requests_total[28d])) by (job)

      # Error budget remaining (SLO = 0.999)
      - record: job:slo_error_budget_remaining:ratio
        expr: |
          (job:slo_availability_28d:ratio - 0.999) / (1 - 0.999)

# Alertmanager rule: SLO burn rate alert (two-window approach)
# Alert fires when error budget is burning >14x faster than allowed (fast burn)
# This catches incidents that will exhaust the budget in < 48 hours

groups:
  - name: slo_burn_rate
    rules:
      - alert: AvailabilitySLOFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) by (job)
            /
            sum(rate(http_requests_total[1h])) by (job)
          ) > (14 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Fast burn: {{ $labels.job }} is exhausting its error budget"
          description: |
            Current 1h error rate exceeds 14x the SLO budget burn rate.
            At this rate the 28-day error budget will be exhausted in < 48h.
            SLO: 99.9% availability over 28 days.
            Current error rate: {{ $value | humanizePercentage }}

      - alert: AvailabilitySLOSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h])) by (job)
            /
            sum(rate(http_requests_total[6h])) by (job)
          ) > (6 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: availability
        annotations:
          summary: "Slow burn: {{ $labels.job }} slow error budget drain"
          description: |
            6h error rate exceeds 6x the SLO budget burn rate.
            At this rate the 28-day error budget exhausts in < 5 days.

Common Pitfalls at Production Scale

Health-check traffic in the SLI denominator: Synthetic health checks from your load balancer or uptime monitor make your availability SLI look better than it is for real users. Exclude them by filtering on a source=synthetic label or using separate counter metrics.
Using availability as your only SLI: A service that responds with HTTP 200 containing an error JSON body is "available" but broken. Include a correctness SLI for critical paths. Stripe monitors whether payments actually clear, not just whether the endpoint responds.
Setting SLOs on individual instances: An SLI that measures one pod will be noisy. SLIs should be aggregated across the entire service fleet — if one pod is bad but 99 are healthy, your users are fine and the SLI should reflect that.
Not distinguishing user-impacting errors from infrastructure errors: A timeout caused by a misconfigured health check probe should not count against your user-facing availability SLI. Classify events by user impact, not by what the load balancer logged.

The SLO is a negotiation, not a decree. At Google, SLOs are set collaboratively between the service team and its stakeholders. Product managers need to understand that a higher SLO means more engineering time on reliability and less on features. The conversation "do we need four nines or three nines?" is a product decision backed by engineering cost analysis, not a unilateral technical choice. Most internal services at Google operate at 99.9% — not 99.99% — because the marginal cost of the last nine is enormous.