Defining Reliability & Availability
Defining Reliability & Availability
Every large-scale system eventually fails. Hardware dies, networks partition, bugs surface under load, and third-party dependencies go dark. The goal is never to eliminate failure — it is to build systems that fail gracefully, recover quickly, and keep users satisfied. Before you can do that, you need precise language: SLA, SLO, and SLI. These three terms are the vocabulary of reliability engineering, and confusing them leads to misaligned expectations, missed incidents, and wasted capacity investment.
SLI — Service Level Indicator
An SLI (Service Level Indicator) is a measured number — a real, quantifiable signal you collect about your system's behaviour. Think of it as a single gauge on your dashboard. Common SLIs include:
- Availability — fraction of valid requests that return a successful response (typically non-5xx).
- Latency — the proportion of requests served below a threshold (e.g. "p99 under 200 ms").
- Error rate — fraction of requests resulting in an error.
- Throughput — requests processed per second.
- Durability — for storage, the probability that data written is still readable months later (e.g. S3 quotes 99.999999999%).
An SLI is always a ratio: good events / valid events. If your API received 1,000,000 requests and 9,950 were 5xx errors, your availability SLI for that window is (1,000,000 − 9,950) / 1,000,000 = 99.005%.
SLO — Service Level Objective
An SLO (Service Level Objective) is a target for an SLI over a measurement window. It is an internal engineering commitment — a threshold you design and operate towards. For example:
- Availability SLO: 99.9% of requests succeed, measured over a rolling 30-day window.
- Latency SLO: 95% of requests complete in under 100 ms; 99% in under 500 ms.
The SLO defines your error budget: the allowed failure headroom before the objective is breached. At 99.9% over 30 days, your budget is 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime (or equivalent bad requests). Teams use the error budget to decide how aggressively to ship: plenty of budget left → ship features fast; budget nearly exhausted → freeze risky releases, focus on reliability.
SLA — Service Level Agreement
An SLA (Service Level Agreement) is a contractual promise to customers — it specifies the SLO and what happens when you miss it (refunds, credits, termination rights). SLAs are negotiated between a business and its customers; SLOs are engineering targets. Google Cloud, AWS, and Azure all publish SLAs with explicit uptime percentages and credit schedules.
The relationship is a hierarchy: SLI → SLO → SLA. You measure with SLIs, you aim for SLOs, and you commit externally via SLAs. Violating an SLA has financial and legal consequences; violating an SLO is an internal engineering alarm.
What the Nines Really Mean
Availability is almost always expressed as a percentage of uptime, and colloquially described in "nines". The table below shows exactly how much downtime each tier allows per year — the numbers are small enough to surprise most engineers.
Availability Downtime / year Downtime / month Downtime / week ───────────────────────────────────────────────────────────────────── 90% 36.5 days 72 hours 16.8 hours 99% 3.65 days 7.2 hours 1.68 hours 99.5% 1.83 days 3.6 hours 50.4 min 99.9% 8.77 hours 43.8 min 10.1 min 99.95% 4.38 hours 21.9 min 5.04 min 99.99% 52.6 min 4.38 min 1.01 min 99.999% 5.26 min 26.3 sec 6.05 sec
Moving from three nines (99.9%) to four nines (99.99%) cuts your allowed downtime from ~8.7 hours/year to ~52 minutes/year. That leap often requires active–active redundancy, automated failover, and chaos-testing — a significant engineering investment. Moving to five nines (99.999%, ~5 min/year) is the domain of telecommunications carriers and demands near-zero human-in-the-loop incident response.
Reliability vs Availability — They Are Not the Same
Availability measures the fraction of time (or requests) a system is up and responding correctly. Reliability is a broader property: a reliable system not only stays up but also produces correct results consistently. A system can be highly available (rarely down) yet unreliable (frequently returns wrong data). For example, a caching layer that serves stale data 20% of the time is available but unreliable. In practice, SLIs for reliability include correctness metrics such as "fraction of orders processed without a data integrity error."
Choosing the Right SLOs
Not every metric deserves a tight SLO. Over-engineering reliability burns engineering budget and increases system complexity. Use these heuristics:
- Start from the user journey. What does a user directly experience? Latency and error rate on the critical path deserve tight SLOs; background jobs can tolerate looser ones.
- Match the SLO to commercial risk. Payment services justify five nines; an internal analytics dashboard can run at 99.5%.
- Measure first, then set targets. Never commit to an SLO you have not first measured. Your historical baseline is your starting point.
- Fewer, well-chosen SLOs beat many loose ones. Google SRE recommends three to five SLOs per service.
Summary
SLIs, SLOs, and SLAs form a chain: you measure real behaviour with SLIs, set internal targets with SLOs, and commit externally via SLAs. The "nines" are not marketing badges — they represent precise, calculated downtime budgets that govern everything from your deployment strategy to your on-call rotations. In the next lessons you will learn the concrete architectural techniques — redundancy, failover, circuit breakers, and more — that let you actually hit those targets.