Reliability, Availability & Resilience

Defining Reliability & Availability

18 min Lesson 1 of 10

Defining Reliability & Availability

Every large-scale system eventually fails. Hardware dies, networks partition, bugs surface under load, and third-party dependencies go dark. The goal is never to eliminate failure — it is to build systems that fail gracefully, recover quickly, and keep users satisfied. Before you can do that, you need precise language: SLA, SLO, and SLI. These three terms are the vocabulary of reliability engineering, and confusing them leads to misaligned expectations, missed incidents, and wasted capacity investment.

SLI — Service Level Indicator

An SLI (Service Level Indicator) is a measured number — a real, quantifiable signal you collect about your system's behaviour. Think of it as a single gauge on your dashboard. Common SLIs include:

Availability — fraction of valid requests that return a successful response (typically non-5xx).
Latency — the proportion of requests served below a threshold (e.g. "p99 under 200 ms").
Error rate — fraction of requests resulting in an error.
Throughput — requests processed per second.
Durability — for storage, the probability that data written is still readable months later (e.g. S3 quotes 99.999999999%).

An SLI is always a ratio: good events / valid events. If your API received 1,000,000 requests and 9,950 were 5xx errors, your availability SLI for that window is (1,000,000 − 9,950) / 1,000,000 = 99.005%.

Key idea: An SLI is objective and measured. It is a number from your monitoring system, not a promise written in a contract. You can have multiple SLIs per service; pick only the ones that actually reflect user happiness.

SLO — Service Level Objective

An SLO (Service Level Objective) is a target for an SLI over a measurement window. It is an internal engineering commitment — a threshold you design and operate towards. For example:

Availability SLO: 99.9% of requests succeed, measured over a rolling 30-day window.
Latency SLO: 95% of requests complete in under 100 ms; 99% in under 500 ms.

The SLO defines your error budget: the allowed failure headroom before the objective is breached. At 99.9% over 30 days, your budget is 0.1% × 30 × 24 × 60 = 43.2 minutes of downtime (or equivalent bad requests). Teams use the error budget to decide how aggressively to ship: plenty of budget left → ship features fast; budget nearly exhausted → freeze risky releases, focus on reliability.

Best practice: Set SLOs slightly tighter than your SLA. If your SLA promises 99.9%, aim for an internal SLO of 99.95%. This gives you a buffer to catch and fix regressions before they breach the external commitment.

SLA — Service Level Agreement

An SLA (Service Level Agreement) is a contractual promise to customers — it specifies the SLO and what happens when you miss it (refunds, credits, termination rights). SLAs are negotiated between a business and its customers; SLOs are engineering targets. Google Cloud, AWS, and Azure all publish SLAs with explicit uptime percentages and credit schedules.

The relationship is a hierarchy: SLI → SLO → SLA. You measure with SLIs, you aim for SLOs, and you commit externally via SLAs. Violating an SLA has financial and legal consequences; violating an SLO is an internal engineering alarm.

The SLI → SLO → SLA hierarchy: measured signals inform internal targets, which back external contracts.

What the Nines Really Mean

Availability is almost always expressed as a percentage of uptime, and colloquially described in "nines". The table below shows exactly how much downtime each tier allows per year — the numbers are small enough to surprise most engineers.

Availability   Downtime / year    Downtime / month   Downtime / week
─────────────────────────────────────────────────────────────────────
90%            36.5 days          72 hours           16.8 hours
99%            3.65 days          7.2 hours          1.68 hours
99.5%          1.83 days          3.6 hours          50.4 min
99.9%          8.77 hours         43.8 min           10.1 min
99.95%         4.38 hours         21.9 min           5.04 min
99.99%         52.6 min           4.38 min           1.01 min
99.999%        5.26 min           26.3 sec           6.05 sec

Moving from three nines (99.9%) to four nines (99.99%) cuts your allowed downtime from ~8.7 hours/year to ~52 minutes/year. That leap often requires active–active redundancy, automated failover, and chaos-testing — a significant engineering investment. Moving to five nines (99.999%, ~5 min/year) is the domain of telecommunications carriers and demands near-zero human-in-the-loop incident response.

Common pitfall: "High availability" is not a number — it is marketing language. Always ask: "What is the SLO, measured over which window, for which user-facing operation?" A service can have 99.99% availability for read requests and 99.5% for writes; they require separate SLIs and SLOs.

The error budget is the allowed failure headroom. Teams use it to balance feature velocity against reliability investment.

Reliability vs Availability — They Are Not the Same

Availability measures the fraction of time (or requests) a system is up and responding correctly. Reliability is a broader property: a reliable system not only stays up but also produces correct results consistently. A system can be highly available (rarely down) yet unreliable (frequently returns wrong data). For example, a caching layer that serves stale data 20% of the time is available but unreliable. In practice, SLIs for reliability include correctness metrics such as "fraction of orders processed without a data integrity error."

Choosing the Right SLOs

Not every metric deserves a tight SLO. Over-engineering reliability burns engineering budget and increases system complexity. Use these heuristics:

Start from the user journey. What does a user directly experience? Latency and error rate on the critical path deserve tight SLOs; background jobs can tolerate looser ones.
Match the SLO to commercial risk. Payment services justify five nines; an internal analytics dashboard can run at 99.5%.
Measure first, then set targets. Never commit to an SLO you have not first measured. Your historical baseline is your starting point.
Fewer, well-chosen SLOs beat many loose ones. Google SRE recommends three to five SLOs per service.

Real-world example: AWS S3 publishes a 99.9% monthly availability SLO in its SLA (with service credits if breached), but its internal durability SLI target for object storage is 99.999999999% (eleven nines). Availability and durability are different SLIs requiring different engineering approaches — replication handles durability; multi-AZ deployment handles availability.

Summary

SLIs, SLOs, and SLAs form a chain: you measure real behaviour with SLIs, set internal targets with SLOs, and commit externally via SLAs. The "nines" are not marketing badges — they represent precise, calculated downtime budgets that govern everything from your deployment strategy to your on-call rotations. In the next lessons you will learn the concrete architectural techniques — redundancy, failover, circuit breakers, and more — that let you actually hit those targets.