Reliability Reviews & Production Readiness
Reliability Reviews & Production Readiness
Every outage has a pre-history. Somewhere before the pager fires, a team made a decision — or skipped one — that made the failure possible. The Production Readiness Review (PRR) is SRE's answer to that pattern: a structured, pre-launch gate that forces engineering teams to answer the hard questions before a service goes live, not after users are impacted. At Google, no service moves into SRE on-call coverage without passing a PRR. At Netflix, it is embedded in the deployment pipeline under the name "LaunchReady." Stripe runs a "Production Readiness Checklist" review as part of every significant release. The specifics differ; the intent is identical — make the implicit explicit and get it into a document before a single production request is handled.
This lesson covers the anatomy of a PRR, how to write and run one, the failure modes that kill launches, and the checklist patterns used at big-tech scale.
What a PRR Is — and Is Not
A PRR is not a security audit, a design review, or a performance test — though it references all three. It is a reliability conversation between the team building the service and the team that will operate it (often SRE). Its job is to answer: "If this service fails at 3 a.m. on a Sunday, can we detect it, contain it, and recover — without heroics?"
A PRR typically produces two artifacts: a PRR document (the completed questionnaire with evidence) and a launch blocking list (gaps that must be resolved before launch). Items on the launch blocking list are not optional suggestions — they are hard gates. A service with P0 blockers does not go live.
The PRR Lifecycle
A PRR is not a one-time event. It follows the service through its lifecycle:
- Pre-launch PRR — before the first production request. The most important one. Catches fundamental gaps in observability, runbooks, and failure planning.
- Significant change PRR — triggered by major architectural changes (new database, new dependency, 10x traffic projection). Not every release; only changes that materially change the risk profile.
- Annual PRR refresh — services evolve. The runbook written 18 months ago may no longer reflect reality. Annual reviews catch drift between documentation and implementation.
The PRR Checklist: What Big Tech Actually Asks
The checklist is the heart of the PRR. Every item has an owner, a status (Met / Not Met / N/A), and evidence. Below are the domains and the specific questions that block launch when unanswered.
1. Observability
- Are the four golden signals (latency, traffic, errors, saturation) instrumented and dashboarded?
- Is there a Tier-0 dashboard that shows SLO burn rate in real time?
- Are structured logs being emitted and queryable in the log aggregation platform?
- Are distributed traces enabled and sampled at a rate that preserves p99 coverage?
- Is there a synthetic canary or health-check endpoint that is hit from outside the cluster?
2. Alerting
- Are SLO-burn-rate alerts defined in code (not in the UI)?
- Is the on-call rotation configured and tested (have all members received a test page)?
- Are alert severity levels defined (P0 wakes someone; P1 is next-business-day)?
- Is alert fatigue tested — have you counted the expected alert volume per week and confirmed it is below 5 actionable pages per on-call shift?
3. Runbooks & Incident Response
- Is there a runbook for every P0 alert? Does it pass the "3 a.m. test" — would an engineer unfamiliar with the service be able to follow it?
- Is the incident management process documented (who to page, escalation path, communication channel)?
- Has a game-day exercise been run against at least one failure scenario?
4. Capacity & Traffic Management
- Is the launch traffic profile defined (expected RPS, p99 latency target, data volume)?
- Has load testing been run at 2x the expected launch traffic? At 10x?
- Are there load shedding or circuit breaker mechanisms? What happens when an upstream dependency fails?
- Is autoscaling configured, tested, and time-bounded (what is the ceiling, and have you modelled cost at that ceiling)?
5. Rollback & Recovery
- Is there a tested rollback procedure? How long does it take?
- Are database migrations reversible? If not, is there a read-replica promotion plan?
- Is the RTO (Recovery Time Objective) defined, and does it match the SLA?
- Has a DR test been conducted within the last six months?
6. Security & Compliance
- Are secrets managed via a secrets manager (Vault, AWS Secrets Manager, GCP Secret Manager) — never in environment variables hardcoded in images?
- Is mTLS enabled between services, or is the network boundary otherwise secured?
- Has a threat model been completed?
- Are audit logs enabled for all data-plane operations?
Writing the Launch Checklist as Code
The checklist lives in version control alongside the service. A simple, durable format is a YAML file committed to the service repository — parsed by CI to enforce launch gates:
A CI job fails the merge if any item has status: not_met and blocker_severity: P0. P1 blockers trigger a warning annotation on the pull request but do not block — they must be resolved before the post-launch review.
Taking a Service to Production: The Launch Day Sequence
PRR approval does not mean "deploy immediately." Big-tech launches follow a controlled rollout sequence that limits blast radius and confirms reliability at each step before expanding exposure.
- Internal canary (0.1% of production traffic) — deploy to one shard, confirm golden signals are nominal for 24 hours. Any SLO burn above 10x triggers automatic rollback.
- Expanded canary (1–5%) — 48-hour soak. The SLO burn alert threshold drops from 10x to 2x. Confirm that the p99 latency target is met under real user load, not synthetic load.
- Region-by-region rollout (10% → 25% → 50% → 100%) — each stage gated by a manual approval step in the deployment pipeline. The on-call engineer signs off; the pipeline records who approved and when.
- Post-launch review (48–72 hours after full rollout) — confirm the SLO is being met, burn rate is nominal, and there are no unexpected long-tail errors in the tail of the histogram.
Production Failure Modes That PRRs Catch
- Missing runbooks for known failure modes: the team knows the database can fail but has never written down what to do. The PRR checklist question "is there a runbook for every P0 alert?" makes this visible.
- Untested rollback: a rollback that has never been drilled takes 45 minutes under incident pressure. The same procedure drilled three times takes 4 minutes. PRRs require evidence, not just documentation.
- Connection pool misconfiguration at scale: the service works fine at 100 RPS in staging but exhausts the database connection pool at 800 RPS in production. The 10x load-test requirement surfaces this before launch.
- Silent dependency on a single availability zone: the service is "multi-region" in the architecture diagram but all caching is in one AZ. The PRR's "DR test" item catches this drift.
- Alert-to-runbook gaps: an alert fires but the on-call engineer does not know what it means or what to do. Every P0 alert must map 1:1 to a runbook section that explains the signal, its causes, and the remediation steps.
Production Readiness Reviews are the difference between a service that survives its first incident and one that creates one. The checklist is not bureaucracy — it is the codification of every production failure mode your organization has ever suffered, turned into a gate that stops the next one before it starts.