Site Reliability Engineering (SRE)

SRE Team Models & Engagement

18 min Lesson 9 of 29

SRE Team Models & Engagement

Knowing what SRE does is only half the story. The other half is knowing how SRE teams are structured relative to product engineering — and how they engage with services over time. These decisions directly determine whether SRE is a strategic reliability multiplier or just a renamed ops team perpetually drowning in pages.

At Google scale, the model is well-defined: a central SRE organization owns production for a curated set of services, engages via formal Production Readiness Reviews, and can hand services back when reliability degrades. But at Stripe, Netflix, Shopify, and most modern big-tech companies, the model has evolved. Understanding the spectrum — and when to choose each point on it — is a core SRE competency.

The Spectrum: Embedded vs. Platform SRE

SRE team models exist on a spectrum defined by two axes: how close the SRE is to the product team, and who holds the pager for the service.

SRE team models span a spectrum from centralized pager ownership (Google classic) to fully distributed dev-on-call (Amazon). Most mature orgs land in the middle — Platform SRE or Embedded SRE.

Central SRE: The Google Classic Model

In the original Google model, SRE is a separate engineering organization. Product teams develop services and hand them to SRE for production operation — but only after passing a rigorous Production Readiness Review. Once accepted, SRE holds the pager. The product team is freed from on-call but is obligated to respond to SRE escalations and fix reliability issues within agreed SLOs.

The central model creates deep specialization. SREs develop expertise across many services, recognize cross-cutting failure modes, and build tooling used company-wide. The cost is coordination overhead: every change in a service that SRE owns requires SRE involvement. At Google scale, with thousands of SREs and mature engagement processes, this works. At a 200-person company, the overhead usually kills velocity.

Why Google can run this model: Google built Borg (now Kubernetes), Spanner, Monarch, and dozens of internal reliability platforms over decades. The platform is so mature that SREs can context-switch between services without rebuilding tribal knowledge each time. Without that infrastructure foundation, the central model fragments into organizational silos.

Platform SRE: Scaling Without Owning

The Platform SRE model is the dominant pattern at hyperscale non-Google companies. A dedicated SRE team owns the reliability infrastructure — the observability stack, the incident management platform, the deployment pipelines, the chaos tooling, the on-call scheduling system, the SLO dashboard framework — but does not own individual service pagers. Product teams are on-call for their own services, but they use standardized, SRE-built tooling to do it.

The key insight: reliability at scale is primarily a tooling and standards problem, not a staffing problem. If every product team deploys through the same pipeline, alerts in the same system, and writes SLOs in the same format, one SRE team can serve hundreds of product teams. Netflix's Spinnaker, Shopify's internal reliability platform, and Stripe's observability infrastructure all follow this pattern.

Platform SRE deliverables: A healthy Platform SRE team should be able to hand any product team: a golden-signal dashboard template, a runbook scaffold, an on-call rotation setup guide, an SLO definition format and burn-rate alerting rules, and a deployment pipeline with automated rollback hooks. If product teams are building these from scratch, Platform SRE has failed its mission.

Embedded SRE: The Feedback-Loop Model

In the embedded model, SREs are assigned to specific product organizations — one or two SREs per product area — and sit alongside product engineers. They co-own the pager, participate in architecture reviews, write reliability requirements into design docs, and act as the reliability conscience of the team. Their dotted-line reporting may go to a central SRE org for standards and career development, but their day-to-day work is with the product team.

The embedded model produces the fastest reliability improvement feedback loop because the SRE directly observes product decisions in real time. When a product engineer designs a synchronous call chain to three downstream services with no fallback, the embedded SRE is in the room to flag the blast radius before any code is written. In a centralized model, that conversation often happens weeks later in a PRR — if at all.

Isolation risk: Embedded SREs who go too long without contact with their SRE peers drift. They adopt the product team's culture, priorities, and blind spots. After six months without cross-SRE review, they may not realize their team's SLOs are defined incorrectly, their runbooks are outdated, or their alert thresholds were never calibrated. Successful embedded programs run quarterly SRE syncs, shared incident retrospectives, and regular rotations through the Platform SRE team to prevent this.

The Engagement Lifecycle

Regardless of the team model, SRE engagement with a service follows a predictable lifecycle. Understanding this lifecycle — and engineering the transitions deliberately — is what separates mature SRE practices from ad-hoc operations.

# Typical SRE engagement lifecycle (formalized as stages)

Stage 0: Pre-engagement
  - Product team builds service, owns everything
  - SRE may consult informally on architecture
  - No formal reliability contract

Stage 1: Engagement kickoff (triggered by scale or criticality)
  - Product team requests SRE engagement
  - SRE conducts a Production Readiness Review (PRR):
      * Is there an SLO with an SLI and measurement pipeline?
      * Is there an on-call rotation with documented runbooks?
      * Is the deployment process automated with rollback capability?
      * Have load tests been run to 2x expected peak?
      * Are dependency SLOs known and accounted for?
      * Is there a capacity plan for the next 6 months?
  - PRR result: PASS, CONDITIONAL PASS (with action items), or FAIL

Stage 2: Active SRE support
  - SRE co-owns on-call or takes full pager (model dependent)
  - Weekly reliability review: error budget burn rate, toil trend, incident count
  - SRE reviews all production changes above a defined blast-radius threshold
  - Quarterly reliability retrospective with product team lead

Stage 3: Stable operation
  - SLOs consistently met
  - Error budget rarely strained
  - SRE on-call load is low (toil <= 50%)
  - Product team is reliability-capable and self-sufficient

Stage 4: Handback (if centralized model)
  - Product team trained to own production
  - Runbooks, dashboards, alerts fully documented
  - Formal handback ceremony: product team takes pager
  - SRE remains available for consultation (no more on-call responsibility)

Handing Back the Pager: The Most Underrated SRE Skill

In centralized and embedded models, the handback — transferring on-call ownership back to the product team — is one of the most operationally sensitive transitions in SRE. Done poorly, it leaves the product team holding a pager for a system they do not understand, runbooks they have never tested, and alerts they cannot interpret. Done well, it is the moment an SRE's investment in a service pays the highest return: a product team that is now self-sufficient in production.

Google's model explicitly anticipates handbacks as a reliability incentive. When a product team repeatedly exhausts its error budget and SRE bears the on-call burden, SRE has organizational authority to hand the service back — forcing the product team to live with the reliability they have built. This is not punitive; it is structural. The team that wakes up at 2am for their own service has much stronger incentives to invest in reliability than the one that knows SRE will handle it.

# Handback readiness checklist (YAML format — can be embedded in a PRR template)

handback_readiness:
  documentation:
    - runbooks_complete: true          # Every alert has a linked runbook
    - architecture_diagram_current: true
    - dependency_map_documented: true
    - on_call_guide_reviewed_by_team: true

  tooling:
    - slo_dashboard_accessible: true
    - alert_routing_configured: true   # Alerts go to product team rotation
    - deployment_pipeline_tested: true # Team has run a rollback drill
    - log_queries_bookmarked: true     # Common triage queries saved

  training:
    - team_shadowed_oncall: true       # Min 2 sprints shadowing SRE on-call
    - incident_drills_conducted: 2     # Gameday exercises run with product team
    - escalation_path_documented: true # Who to call when team is stuck

  agreement:
    - error_budget_policy_signed: true # Team lead has acknowledged policy
    - escalation_sla_agreed: true      # SRE will respond within N min if called
    - review_cadence_set: true         # Quarterly reliability review scheduled

SRE Engagement Anti-Patterns to Avoid

At big-tech companies, the following failure modes appear repeatedly across SRE programs:

Shadow operations: Product teams bypass SRE because the engagement process is too slow. They deploy directly, skip PRRs, and manage production informally. SRE engagement must be faster than the alternative, or teams route around it.
Permanent engagement: A service never reaches the handback stage. SRE owns it forever, the product team atrophies in reliability skills, and the SRE team cannot take on new services without hiring. Every engagement must have a stated exit condition.
PRR as a gate, not a partnership: If the PRR is purely a checklist that product teams resent, it will be gamed. PRRs work when they are collaborative — SRE brings patterns from other services, product team brings domain knowledge, and both leave with a better system.
On-call as ownership: Teams conflate "SRE is on-call" with "SRE is responsible for reliability." In a healthy model, SRE is a reliability partner, not a reliability owner. The product team always owns the system; SRE provides the expertise and bandwidth to operate it safely.

Choosing your model: Org size under 200 engineers: embedded SRE or YBIYRI with a small platform team. 200-2000: Platform SRE plus embedded specialists for tier-0 services. 2000+: Consider central SRE for critical infrastructure (payments, auth, data plane) with Platform SRE for application teams. The model should evolve as the org scales — what works at Series A will not work at Series D.

The engagement lifecycle and team model are not static. As your organization matures, your SRE model should evolve with it — toward more platform leverage, more product-team ownership, and fewer humans standing between code and production. The goal is not more SREs; it is a culture where every engineer thinks like an SRE and a platform team that makes that easy.