Site Reliability Engineering (SRE)

What Is SRE?

18 min Lesson 1 of 29

What Is SRE?

In 2003, Google had a problem. The company was scaling faster than any operations team could reasonably keep up with. Systems were growing more complex, deployment frequency was increasing, and the traditional separation between "development" and "operations" was producing the classic dysfunction: developers wanted to ship fast, operators wanted stability, and the two groups had structurally opposite incentives. The result was slow deploys, fragile releases, and an ops team drowning in toil.

Ben Treynor Sloss, then a Google engineering director, was handed a small team of software engineers and told to run production. His solution was not to hire more traditional sysadmins — it was to approach the operations problem the same way engineers approach any other hard problem: with software, measurement, automation, and feedback loops. Site Reliability Engineering was born.

The founding insight: Operations is a software problem. Every task a human does repeatedly on a computer is a candidate for automation. Every failure mode should be codified into monitoring. Every on-call decision should, over time, become a runbook — and every runbook should, over time, become code. SRE is the discipline of systematically turning ops work into engineering work.

The Google SRE Model: Core Principles

The Google model, codified in the 2016 book Site Reliability Engineering, rests on a small number of powerful ideas that fundamentally reframe how you think about reliability:

1. Hire Software Engineers to Do Operations

Google SREs are software engineers first. They write production code, own services end-to-end, participate in code review, and are held to the same engineering bar as product developers. This is not a cosmetic change — it has structural consequences. An SRE who writes code can automate their own toil, build internal tooling that scales, and have a meaningful conversation with product teams about system design and failure modes. A traditional sysadmin running scripts cannot.

At Google, the hiring bar for SREs is roughly 85% of the software engineering bar, with an additional emphasis on systems knowledge: operating systems internals, networking, distributed systems, and comfort reasoning about complex failure modes at scale. In practice at big-tech companies today, SRE roles require strong coding ability in at least one systems language (Go, Python, Java), deep Linux proficiency, and hands-on distributed systems experience — the skills you have been building throughout this course.

2. Reliability Is Measured, Not Assumed

One of the most important contributions of the SRE model is the introduction of Service Level Objectives (SLOs) as the primary currency of reliability conversations. Instead of vague commitments like "the system should be highly available," SRE demands precision: "99.9% of homepage requests will return a successful response within 300ms, measured over a rolling 28-day window."

This number is derived from a Service Level Indicator (SLI) — the actual measurement — and it lives within a Service Level Agreement (SLA) that defines the contractual consequence if you miss it. The SLO sits between the SLI and SLA: it is your internal target, set conservatively enough that you have a buffer before breaching the SLA.

Why does precision matter? Because it turns reliability from a qualitative argument into a quantitative one. When a product manager wants to ship a risky feature and an SRE has concerns, the conversation shifts from "I think this might break things" to "we have consumed 60% of our monthly error budget in two weeks — shipping this increases our breach probability to 85%." One of those conversations is actionable. The other is not.

3. Error Budgets: Reliability as a Shared Resource

The error budget is the SRE model's most elegant invention. If your SLO is 99.9% availability, then your error budget is the inverse: 0.1% of requests can fail per measurement window without breaching the SLO. Over a 28-day window, that is roughly 43 minutes of downtime, or about 4.3 billion errors per billion requests.

The error budget is shared between development and SRE. Development spends it by deploying new features (which sometimes break things). SRE protects it by enforcing release gates and promoting reliability work. If the budget is healthy, development can move fast. If the budget is nearly exhausted, SRE has the organizational authority to slow down or halt releases until reliability is restored. This is not an arbitrary rule — it is a mathematically derived consequence of the SLO both teams agreed to.

SLI is the raw measurement; SLO is your internal target; SLA is the external contract. The error budget is the gap between perfect and your SLO — it is a shared resource development and SRE manage together.

SRE vs DevOps: Related But Distinct

The relationship between SRE and DevOps is one of the most frequently misunderstood in the industry. They are not competing philosophies — they are complementary, and understanding the distinction matters for how you structure teams and responsibilities.

DevOps is a cultural and organizational movement. It emerged from the same dysfunction that motivated SRE: the wall between development and operations that slows delivery and degrades reliability. DevOps prescribes a set of cultural values — collaboration, shared ownership, automation, fast feedback — and a set of practices (CI/CD, infrastructure as code, blameless post-mortems) that embody those values. DevOps does not specify how to implement these things. It is a philosophy, not an implementation.

SRE is an opinionated implementation of DevOps principles. As Google's SRE book puts it: "SRE is what happens when you ask a software engineer to design an operations function." SRE provides specific mechanisms: SLOs, error budgets, the 50% toil cap, production readiness reviews, blameless postmortems with structured timelines, and defined engagement models between SRE teams and product teams. Where DevOps says "automate everything," SRE says "cap toil at 50% of engineering time and track it quarterly."

How top companies apply this: At Google, the SRE team owns production for a service once it passes a Production Readiness Review (PRR). If the service becomes too unreliable, SRE can hand it back to the product team until they fix it. At Spotify, Netflix, and Stripe, the pattern is different — product teams own their own reliability ("you build it, you run it"), with an SRE platform team providing tooling, standards, and embedded specialists for the most critical services. Neither model is universally correct; the right choice depends on org size, service criticality, and team maturity.

The 50% Toil Cap

One of the most concrete and enforced policies in the Google SRE model is the toil cap: SREs should spend no more than 50% of their time on toil — manual, repetitive, automatable operational work. The other 50% must go to engineering work that reduces future toil or improves service reliability.

Toil has a precise definition in SRE. It is work that is manual (requires a human to do it), repetitive (happens again and again), automatable (a machine could do it), reactive (triggered by an event rather than planned), and adds no enduring value (the system is not more reliable after you do it than before). Restarting a service because it leaks memory is toil. Writing the code to detect and auto-restart the leaking service is not toil — it is engineering. Responding to a page that a dashboard was designed to eliminate is toil. Eliminating the dashboard alert is engineering.

Why does this matter at big-tech scale? Because toil is self-compounding. Every new service added to an SRE team's portfolio brings new toil. If the team does not continually automate, toil grows faster than the team can hire, and eventually every engineer is 100% toil and zero engineering — at which point the organization has an operations team, not an SRE team, and reliability degrades while costs soar.

# Measuring toil vs engineering time — a practical tracking pattern
# Many SRE teams run a weekly or quarterly toil audit.
# Simple approach: tag time in your incident/ticket system.

# Example: query PagerDuty + Jira to classify work
# (pseudocode — adapt to your tooling)

# Toil indicators in incident data:
#   - Alerts that required manual intervention (no auto-remediation)
#   - Runbook steps that are "click this button" or "run this command"
#   - Tickets labeled "ops-task" with no code change attached
#   - Pages that resolve without a postmortem (implying known/routine)

# Engineering indicators:
#   - PRs merged to reliability tooling, alerting rules, runbook automation
#   - Postmortem action items completed
#   - SLO definition or dashboards updated
#   - Load tests, chaos experiments, capacity models produced

# Target: toil <= 50% of eng-hours per quarter
# Red flag: toil trending up quarter-over-quarter despite team growth
# Action: freeze new service onboarding until toil ratio recovers

Why This Model Works (and Where It Fails)

The SRE model works because it aligns incentives. Before SRE, developers were incentivized to ship fast (features = success) and operations was incentivized to block (stability = success). Error budgets break this deadlock: both teams share a single number, both teams lose when it is exhausted, and both teams benefit when it is healthy. The error budget converts a political negotiation into an engineering conversation.

The model also works because it respects engineer time. The toil cap is not just a productivity measure — it is a retention measure. SREs who are 100% on-call firefighting burn out and leave. SREs who spend half their time building tools that make on-call better stay, grow, and produce compounding reliability improvements.

Where the model struggles: organizations that lack the cultural maturity or executive support for SRE teams to actually push back on product teams when error budgets are exhausted. If leadership overrides the SRE brake on releases, the model collapses — engineers spend their engineering time building a reliability system nobody enforces, and they burn out anyway. SRE requires organizational authority, not just engineering practice.

Production pitfall: Many companies rename their ops teams "SRE" without changing the underlying incentive structure or giving the team engineering time. If your "SRE" team has no toil cap, no SLOs, and no authority to slow down releases, you have an ops team with a better job title. The engineering rigor — measured SLOs, error budgets, toil tracking, PRRs — is what makes SRE different, not the name.

What the Rest of This Tutorial Covers

This tutorial systematically builds your SRE practice from first principles. The next lesson goes deep on SLIs and SLOs — how to choose the right indicators for different service types, how to set realistic targets, and the common mistakes that produce SLOs nobody trusts. Then error budgets, toil measurement, release engineering, capacity planning, production readiness reviews, and finally — in the capstone — you will write a complete SLO and error budget policy for a realistic production service. Every lesson ties back to the model you have just learned: operations as a software problem, reliability as a measured, shared responsibility.