Site Reliability Engineering (SRE)

Reliability Reviews & Production Readiness

18 min Lesson 7 of 29

Reliability Reviews & Production Readiness

Every outage has a pre-history. Somewhere before the pager fires, a team made a decision — or skipped one — that made the failure possible. The Production Readiness Review (PRR) is SRE's answer to that pattern: a structured, pre-launch gate that forces engineering teams to answer the hard questions before a service goes live, not after users are impacted. At Google, no service moves into SRE on-call coverage without passing a PRR. At Netflix, it is embedded in the deployment pipeline under the name "LaunchReady." Stripe runs a "Production Readiness Checklist" review as part of every significant release. The specifics differ; the intent is identical — make the implicit explicit and get it into a document before a single production request is handled.

This lesson covers the anatomy of a PRR, how to write and run one, the failure modes that kill launches, and the checklist patterns used at big-tech scale.

What a PRR Is — and Is Not

A PRR is not a security audit, a design review, or a performance test — though it references all three. It is a reliability conversation between the team building the service and the team that will operate it (often SRE). Its job is to answer: "If this service fails at 3 a.m. on a Sunday, can we detect it, contain it, and recover — without heroics?"

A PRR typically produces two artifacts: a PRR document (the completed questionnaire with evidence) and a launch blocking list (gaps that must be resolved before launch). Items on the launch blocking list are not optional suggestions — they are hard gates. A service with P0 blockers does not go live.

PRRs are a forcing function, not a ceremony. The value is not in the document — it is in the conversation that the document forces. A team that has to answer "what is our RTO for a complete database failure?" will either discover they have not thought about it (good: fix it before launch) or confirm they have (good: confidence). Both outcomes improve reliability. Skip the PRR and you get neither.

The PRR Lifecycle

A PRR is not a one-time event. It follows the service through its lifecycle:

Pre-launch PRR — before the first production request. The most important one. Catches fundamental gaps in observability, runbooks, and failure planning.
Significant change PRR — triggered by major architectural changes (new database, new dependency, 10x traffic projection). Not every release; only changes that materially change the risk profile.
Annual PRR refresh — services evolve. The runbook written 18 months ago may no longer reflect reality. Annual reviews catch drift between documentation and implementation.

PRR lifecycle: design flows into a pre-launch review, blockers loop back for remediation, and the approved service re-enters the PRR cycle for significant changes and annual refreshes.

The PRR Checklist: What Big Tech Actually Asks

The checklist is the heart of the PRR. Every item has an owner, a status (Met / Not Met / N/A), and evidence. Below are the domains and the specific questions that block launch when unanswered.

1. Observability

Are the four golden signals (latency, traffic, errors, saturation) instrumented and dashboarded?
Is there a Tier-0 dashboard that shows SLO burn rate in real time?
Are structured logs being emitted and queryable in the log aggregation platform?
Are distributed traces enabled and sampled at a rate that preserves p99 coverage?
Is there a synthetic canary or health-check endpoint that is hit from outside the cluster?

2. Alerting

Are SLO-burn-rate alerts defined in code (not in the UI)?
Is the on-call rotation configured and tested (have all members received a test page)?
Are alert severity levels defined (P0 wakes someone; P1 is next-business-day)?
Is alert fatigue tested — have you counted the expected alert volume per week and confirmed it is below 5 actionable pages per on-call shift?

3. Runbooks & Incident Response

Is there a runbook for every P0 alert? Does it pass the "3 a.m. test" — would an engineer unfamiliar with the service be able to follow it?
Is the incident management process documented (who to page, escalation path, communication channel)?
Has a game-day exercise been run against at least one failure scenario?

4. Capacity & Traffic Management

Is the launch traffic profile defined (expected RPS, p99 latency target, data volume)?
Has load testing been run at 2x the expected launch traffic? At 10x?
Are there load shedding or circuit breaker mechanisms? What happens when an upstream dependency fails?
Is autoscaling configured, tested, and time-bounded (what is the ceiling, and have you modelled cost at that ceiling)?

5. Rollback & Recovery

Is there a tested rollback procedure? How long does it take?
Are database migrations reversible? If not, is there a read-replica promotion plan?
Is the RTO (Recovery Time Objective) defined, and does it match the SLA?
Has a DR test been conducted within the last six months?

6. Security & Compliance

Are secrets managed via a secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager) — never in environment variables hardcoded in images?
Is mTLS enabled between services, or is the network boundary otherwise secured?
Has a threat model been completed?
Are audit logs enabled for all data-plane operations?

Automate the evidence collection. The most painful part of a PRR is gathering evidence ("show me the dashboard," "link the runbook"). Wire your PRR template to your tooling: a GitHub Actions workflow that opens a PRR issue pre-populated with links to the Grafana dashboard, the on-call schedule in PagerDuty, and the load-test results in Datadog means the team spends time fixing gaps, not hunting for links.

Writing the Launch Checklist as Code

The checklist lives in version control alongside the service. A simple, durable format is a YAML file committed to the service repository — parsed by CI to enforce launch gates:

# .prr/checklist.yaml  — committed alongside service code
service: payment-processor
version: "1.4.0"
prr_status: approved          # draft | in_review | approved | blocked
approved_by: sreteam@company.com
approved_at: "2025-09-15"

observability:
  golden_signals_dashboard:
    status: met
    evidence: "https://grafana.internal/d/payment-overview"
  slo_burn_alert:
    status: met
    evidence: "https://github.com/org/payment/blob/main/alerts/slo.yaml"
  distributed_tracing:
    status: met
    evidence: "Tempo sampling at 10% base, 100% on error"
  synthetic_canary:
    status: met
    evidence: "Blackbox exporter probing /healthz every 30s from us-east-1, eu-west-1"

alerting:
  oncall_rotation_tested:
    status: met
    evidence: "Test page sent to all 4 rotation members 2025-09-12"
  alert_volume_within_threshold:
    status: met
    evidence: "Projected 2.1 pages/week based on staging burn rate"

capacity:
  load_test_2x:
    status: met
    evidence: "k6 run at 1,200 RPS (2x 600 baseline) — p99 <= 180ms, 0 errors"
  load_test_10x:
    status: not_met          # BLOCKER
    blocker_severity: P1
    blocker_owner: platform-team
    due: "2025-09-22"
    notes: "10x test shows DB connection pool exhaustion at 6,000 RPS. PgBouncer config update in progress."
  circuit_breaker:
    status: met
    evidence: "Resilience4j configured; fallback returns cached response"

rollback:
  rollback_procedure_tested:
    status: met
    evidence: "Rollback drill completed 2025-09-10, RTO measured at 4 min"
  database_migrations_reversible:
    status: met
    evidence: "All migrations use up/down; tested against prod-clone"

security:
  secrets_in_vault:
    status: met
    evidence: "All env vars read from Vault dynamic secrets; no secrets in image"
  threat_model_complete:
    status: met
    evidence: "STRIDE model in Confluence: https://wiki.internal/payment-threat-model"

A CI job fails the merge if any item has status: not_met and blocker_severity: P0. P1 blockers trigger a warning annotation on the pull request but do not block — they must be resolved before the post-launch review.

#!/usr/bin/env python3
# scripts/prr_gate.py  — run in CI on every release branch push
import sys, yaml, pathlib

checklist = yaml.safe_load(pathlib.Path(".prr/checklist.yaml").read_text())
p0_blockers = []

for domain, items in checklist.items():
    if not isinstance(items, dict):
        continue
    for item_name, item in items.items():
        if not isinstance(item, dict):
            continue
        if item.get("status") == "not_met" and item.get("blocker_severity") == "P0":
            p0_blockers.append(f"  - {domain}.{item_name}: {item.get('notes','no notes')}")

if p0_blockers:
    print("PRR GATE FAILED — P0 blockers present:")
    print("\n".join(p0_blockers))
    sys.exit(1)

print("PRR gate passed. No P0 blockers.")

Taking a Service to Production: The Launch Day Sequence

PRR approval does not mean "deploy immediately." Big-tech launches follow a controlled rollout sequence that limits blast radius and confirms reliability at each step before expanding exposure.

Internal canary (0.1% of production traffic) — deploy to one shard, confirm golden signals are nominal for 24 hours. Any SLO burn above 10x triggers automatic rollback.
Expanded canary (1–5%) — 48-hour soak. The SLO burn alert threshold drops from 10x to 2x. Confirm that the p99 latency target is met under real user load, not synthetic load.
Region-by-region rollout (10% → 25% → 50% → 100%) — each stage gated by a manual approval step in the deployment pipeline. The on-call engineer signs off; the pipeline records who approved and when.
Post-launch review (48–72 hours after full rollout) — confirm the SLO is being met, burn rate is nominal, and there are no unexpected long-tail errors in the tail of the histogram.

The most dangerous time is right after a successful launch. Teams relax — the service "works" — and deferred PRR items stay deferred. P1 blockers from the PRR must have an owner and a due date enforced by your project management system. A service that goes live with unresolved P1s and no follow-up calendar entry will carry those risks indefinitely. Close the loop within two sprints.

Production Failure Modes That PRRs Catch

Missing runbooks for known failure modes: the team knows the database can fail but has never written down what to do. The PRR checklist question "is there a runbook for every P0 alert?" makes this visible.
Untested rollback: a rollback that has never been drilled takes 45 minutes under incident pressure. The same procedure drilled three times takes 4 minutes. PRRs require evidence, not just documentation.
Connection pool misconfiguration at scale: the service works fine at 100 RPS in staging but exhausts the database connection pool at 800 RPS in production. The 10x load-test requirement surfaces this before launch.
Silent dependency on a single availability zone: the service is "multi-region" in the architecture diagram but all caching is in one AZ. The PRR's "DR test" item catches this drift.
Alert-to-runbook gaps: an alert fires but the on-call engineer does not know what it means or what to do. Every P0 alert must map 1:1 to a runbook section that explains the signal, its causes, and the remediation steps.

Production Readiness Reviews are the difference between a service that survives its first incident and one that creates one. The checklist is not bureaucracy — it is the codification of every production failure mode your organization has ever suffered, turned into a gate that stops the next one before it starts.