Site Reliability Engineering (SRE)

Reliability Reviews & Production Readiness

18 min Lesson 7 of 29

Reliability Reviews & Production Readiness

Every outage has a pre-history. Somewhere before the pager fires, a team made a decision — or skipped one — that made the failure possible. The Production Readiness Review (PRR) is SRE's answer to that pattern: a structured, pre-launch gate that forces engineering teams to answer the hard questions before a service goes live, not after users are impacted. At Google, no service moves into SRE on-call coverage without passing a PRR. At Netflix, it is embedded in the deployment pipeline under the name "LaunchReady." Stripe runs a "Production Readiness Checklist" review as part of every significant release. The specifics differ; the intent is identical — make the implicit explicit and get it into a document before a single production request is handled.

This lesson covers the anatomy of a PRR, how to write and run one, the failure modes that kill launches, and the checklist patterns used at big-tech scale.

What a PRR Is — and Is Not

A PRR is not a security audit, a design review, or a performance test — though it references all three. It is a reliability conversation between the team building the service and the team that will operate it (often SRE). Its job is to answer: "If this service fails at 3 a.m. on a Sunday, can we detect it, contain it, and recover — without heroics?"

A PRR typically produces two artifacts: a PRR document (the completed questionnaire with evidence) and a launch blocking list (gaps that must be resolved before launch). Items on the launch blocking list are not optional suggestions — they are hard gates. A service with P0 blockers does not go live.

PRRs are a forcing function, not a ceremony. The value is not in the document — it is in the conversation that the document forces. A team that has to answer "what is our RTO for a complete database failure?" will either discover they have not thought about it (good: fix it before launch) or confirm they have (good: confidence). Both outcomes improve reliability. Skip the PRR and you get neither.

The PRR Lifecycle

A PRR is not a one-time event. It follows the service through its lifecycle:

  1. Pre-launch PRR — before the first production request. The most important one. Catches fundamental gaps in observability, runbooks, and failure planning.
  2. Significant change PRR — triggered by major architectural changes (new database, new dependency, 10x traffic projection). Not every release; only changes that materially change the risk profile.
  3. Annual PRR refresh — services evolve. The runbook written 18 months ago may no longer reflect reality. Annual reviews catch drift between documentation and implementation.
PRR lifecycle: from design to production and re-review Design Review Architecture & risk Pre-launch PRR Full checklist review Blockers P0 / P1 gaps must be resolved fix & re-review Launch Approved SRE coverage begins Significant Change New DB / 10x scale / arch shift Annual Refresh Runbook drift / new risk All change types re-enter the PRR gate — the checklist never expires.
PRR lifecycle: design flows into a pre-launch review, blockers loop back for remediation, and the approved service re-enters the PRR cycle for significant changes and annual refreshes.

The PRR Checklist: What Big Tech Actually Asks

The checklist is the heart of the PRR. Every item has an owner, a status (Met / Not Met / N/A), and evidence. Below are the domains and the specific questions that block launch when unanswered.

1. Observability

  • Are the four golden signals (latency, traffic, errors, saturation) instrumented and dashboarded?
  • Is there a Tier-0 dashboard that shows SLO burn rate in real time?
  • Are structured logs being emitted and queryable in the log aggregation platform?
  • Are distributed traces enabled and sampled at a rate that preserves p99 coverage?
  • Is there a synthetic canary or health-check endpoint that is hit from outside the cluster?

2. Alerting

  • Are SLO-burn-rate alerts defined in code (not in the UI)?
  • Is the on-call rotation configured and tested (have all members received a test page)?
  • Are alert severity levels defined (P0 wakes someone; P1 is next-business-day)?
  • Is alert fatigue tested — have you counted the expected alert volume per week and confirmed it is below 5 actionable pages per on-call shift?

3. Runbooks & Incident Response

  • Is there a runbook for every P0 alert? Does it pass the "3 a.m. test" — would an engineer unfamiliar with the service be able to follow it?
  • Is the incident management process documented (who to page, escalation path, communication channel)?
  • Has a game-day exercise been run against at least one failure scenario?

4. Capacity & Traffic Management

  • Is the launch traffic profile defined (expected RPS, p99 latency target, data volume)?
  • Has load testing been run at 2x the expected launch traffic? At 10x?
  • Are there load shedding or circuit breaker mechanisms? What happens when an upstream dependency fails?
  • Is autoscaling configured, tested, and time-bounded (what is the ceiling, and have you modelled cost at that ceiling)?

5. Rollback & Recovery

  • Is there a tested rollback procedure? How long does it take?
  • Are database migrations reversible? If not, is there a read-replica promotion plan?
  • Is the RTO (Recovery Time Objective) defined, and does it match the SLA?
  • Has a DR test been conducted within the last six months?

6. Security & Compliance

  • Are secrets managed via a secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager) — never in environment variables hardcoded in images?
  • Is mTLS enabled between services, or is the network boundary otherwise secured?
  • Has a threat model been completed?
  • Are audit logs enabled for all data-plane operations?
Automate the evidence collection. The most painful part of a PRR is gathering evidence ("show me the dashboard," "link the runbook"). Wire your PRR template to your tooling: a GitHub Actions workflow that opens a PRR issue pre-populated with links to the Grafana dashboard, the on-call schedule in PagerDuty, and the load-test results in Datadog means the team spends time fixing gaps, not hunting for links.

Writing the Launch Checklist as Code

The checklist lives in version control alongside the service. A simple, durable format is a YAML file committed to the service repository — parsed by CI to enforce launch gates:

# .prr/checklist.yaml — committed alongside service code service: payment-processor version: "1.4.0" prr_status: approved # draft | in_review | approved | blocked approved_by: sreteam@company.com approved_at: "2025-09-15" observability: golden_signals_dashboard: status: met evidence: "https://grafana.internal/d/payment-overview" slo_burn_alert: status: met evidence: "https://github.com/org/payment/blob/main/alerts/slo.yaml" distributed_tracing: status: met evidence: "Tempo sampling at 10% base, 100% on error" synthetic_canary: status: met evidence: "Blackbox exporter probing /healthz every 30s from us-east-1, eu-west-1" alerting: oncall_rotation_tested: status: met evidence: "Test page sent to all 4 rotation members 2025-09-12" alert_volume_within_threshold: status: met evidence: "Projected 2.1 pages/week based on staging burn rate" capacity: load_test_2x: status: met evidence: "k6 run at 1,200 RPS (2x 600 baseline) — p99 <= 180ms, 0 errors" load_test_10x: status: not_met # BLOCKER blocker_severity: P1 blocker_owner: platform-team due: "2025-09-22" notes: "10x test shows DB connection pool exhaustion at 6,000 RPS. PgBouncer config update in progress." circuit_breaker: status: met evidence: "Resilience4j configured; fallback returns cached response" rollback: rollback_procedure_tested: status: met evidence: "Rollback drill completed 2025-09-10, RTO measured at 4 min" database_migrations_reversible: status: met evidence: "All migrations use up/down; tested against prod-clone" security: secrets_in_vault: status: met evidence: "All env vars read from Vault dynamic secrets; no secrets in image" threat_model_complete: status: met evidence: "STRIDE model in Confluence: https://wiki.internal/payment-threat-model"

A CI job fails the merge if any item has status: not_met and blocker_severity: P0. P1 blockers trigger a warning annotation on the pull request but do not block — they must be resolved before the post-launch review.

#!/usr/bin/env python3 # scripts/prr_gate.py — run in CI on every release branch push import sys, yaml, pathlib checklist = yaml.safe_load(pathlib.Path(".prr/checklist.yaml").read_text()) p0_blockers = [] for domain, items in checklist.items(): if not isinstance(items, dict): continue for item_name, item in items.items(): if not isinstance(item, dict): continue if item.get("status") == "not_met" and item.get("blocker_severity") == "P0": p0_blockers.append(f" - {domain}.{item_name}: {item.get('notes','no notes')}") if p0_blockers: print("PRR GATE FAILED — P0 blockers present:") print("\n".join(p0_blockers)) sys.exit(1) print("PRR gate passed. No P0 blockers.")

Taking a Service to Production: The Launch Day Sequence

PRR approval does not mean "deploy immediately." Big-tech launches follow a controlled rollout sequence that limits blast radius and confirms reliability at each step before expanding exposure.

  1. Internal canary (0.1% of production traffic) — deploy to one shard, confirm golden signals are nominal for 24 hours. Any SLO burn above 10x triggers automatic rollback.
  2. Expanded canary (1–5%) — 48-hour soak. The SLO burn alert threshold drops from 10x to 2x. Confirm that the p99 latency target is met under real user load, not synthetic load.
  3. Region-by-region rollout (10% → 25% → 50% → 100%) — each stage gated by a manual approval step in the deployment pipeline. The on-call engineer signs off; the pipeline records who approved and when.
  4. Post-launch review (48–72 hours after full rollout) — confirm the SLO is being met, burn rate is nominal, and there are no unexpected long-tail errors in the tail of the histogram.
The most dangerous time is right after a successful launch. Teams relax — the service "works" — and deferred PRR items stay deferred. P1 blockers from the PRR must have an owner and a due date enforced by your project management system. A service that goes live with unresolved P1s and no follow-up calendar entry will carry those risks indefinitely. Close the loop within two sprints.

Production Failure Modes That PRRs Catch

  • Missing runbooks for known failure modes: the team knows the database can fail but has never written down what to do. The PRR checklist question "is there a runbook for every P0 alert?" makes this visible.
  • Untested rollback: a rollback that has never been drilled takes 45 minutes under incident pressure. The same procedure drilled three times takes 4 minutes. PRRs require evidence, not just documentation.
  • Connection pool misconfiguration at scale: the service works fine at 100 RPS in staging but exhausts the database connection pool at 800 RPS in production. The 10x load-test requirement surfaces this before launch.
  • Silent dependency on a single availability zone: the service is "multi-region" in the architecture diagram but all caching is in one AZ. The PRR's "DR test" item catches this drift.
  • Alert-to-runbook gaps: an alert fires but the on-call engineer does not know what it means or what to do. Every P0 alert must map 1:1 to a runbook section that explains the signal, its causes, and the remediation steps.

Production Readiness Reviews are the difference between a service that survives its first incident and one that creates one. The checklist is not bureaucracy — it is the codification of every production failure mode your organization has ever suffered, turned into a gate that stops the next one before it starts.