Reliability, Availability & Resilience

Project: Make a System Resilient

18 min Lesson 10 of 10

Project: Make a System Resilient

Throughout this tutorial you have studied each resilience pattern in isolation: redundancy, failover, circuit breakers, bulkheads, rate limiting, retries with backoff, health checks, monitoring, and graceful degradation. Real systems require all of them applied together, in the right places, with the right trade-offs. This capstone lesson walks through a realistic e-commerce platform, identifies every single-point of failure, and then applies the full toolkit to produce a production-grade resilient architecture — with concrete numbers and rationale at each step.

The Starting System: A Fragile Monolith

Imagine a mid-size online retailer: ShopFast. It serves 50,000 daily active users, processes 200 orders per minute at peak, and has an SLO of 99.9% availability (43 minutes downtime budget per month). The current architecture looks like this:

One web application server (Node.js) behind a single Nginx reverse proxy.
One primary PostgreSQL database. No replicas.
One Redis instance for session and cart data.
One third-party payment API (Stripe), called synchronously on every checkout.
One SMTP server for order-confirmation emails, also called synchronously.
No health checks. No circuit breakers. No rate limiting. No monitoring beyond OS-level CPU alerts.

Every component is a single point of failure (SPOF). The database goes down → the whole site is down. Stripe times out → checkout hangs indefinitely → the web process pool is exhausted → the whole site is down. A spike of traffic → no protection → database connection pool saturated → 500 errors for all users. The system is not resilient; it is brittle.

ShopFast before hardening: every component is a single point of failure, and synchronous external calls create cascading failure paths.

Cascading failure anatomy: Stripe responds slowly (200 ms → 5 s timeout). The web server waits, holding a connection. 200 concurrent checkouts × 5 s = all 200 worker threads busy. New requests queue, then time out. The whole site becomes unavailable — not because Stripe is down, but because one slow dependency was called synchronously with no timeout, no bulkhead, and no circuit breaker. This is the most common real-world outage pattern.

Step 1: Eliminate Single Points of Failure with Redundancy

Apply N+1 redundancy to every stateless component and active-passive replication to every stateful one:

Load Balancer: Replace single Nginx with two load balancers in active-passive behind a virtual IP (keepalived / cloud NLB). Health-check every 5 s; failover in < 10 s.
App Servers: Run a minimum of 3 app servers (2 serve traffic, 1 absorbs a failure with headroom). The load balancer distributes with least-connections.
PostgreSQL: Add one synchronous standby in the same AZ (RPO ≈ 0), plus one asynchronous cross-AZ replica for reads and disaster recovery (RPO ≈ 1–5 s). Automated failover via Patroni; promotion SLA < 30 s.
Redis: Sentinel mode with one primary and two replicas. Automatic promotion on primary failure; client-side retry with exponential backoff.

Step 2: Add a Message Queue — Decouple External Calls

The synchronous Stripe and SMTP calls are the biggest reliability risk after the DB. Decouple them using an async queue (RabbitMQ or a managed alternative like SQS):

Checkout handler: validate the cart, deduct inventory, write an ORDERS.pending record to PostgreSQL, and publish a checkout.requested event to the queue. Return HTTP 202 to the client immediately.
A separate Payment Worker consumes checkout.requested, calls Stripe, and updates the order record. If Stripe fails, the message stays in the queue for retry — it does not block a web worker thread.
An Email Worker consumes order.confirmed and calls SMTP. A transient SMTP failure retries the job; it never touches the HTTP request path.

Result: a Stripe outage no longer exhausts web worker threads. The checkout endpoint remains available even if Stripe is degraded; orders queue and process once Stripe recovers.

Step 3: Circuit Breakers on Every External Dependency

Even with the queue, the payment worker still calls Stripe. Add a circuit breaker (e.g., Hystrix-style or Resilience4j) with these parameters:

Threshold: open after 5 failures in a 10-second window.
Open state duration: 30 seconds — do not attempt Stripe calls. Increment a payment.circuit_open metric and alert on-call.
Half-open probe: allow 1 request; close circuit on success, re-open on failure.
Fallback: when open, mark the order as PAYMENT_DEFERRED and schedule a retry job for 60 seconds later. Notify the user by email that payment processing is slightly delayed.

Apply the same pattern to your SMTP provider: if 3 consecutive sends fail, open the circuit and route emails through a secondary provider (e.g., SendGrid fallback to Mailgun).

Step 4: Bulkheads — Isolate Failure Domains

Partition thread pools and connection pools so that a spike in one feature cannot starve others:

Checkout pool: 50 dedicated DB connections. Payment workers are capped at 50 concurrent Stripe calls.
Product catalogue pool: 30 dedicated DB connections for read-only browse queries. A checkout surge does not impact browsing.
Admin pool: 10 connections. Admin tools never compete with customer traffic.

Use a separate Redis logical DB (or a second Redis instance) for sessions vs. product cache. If the product cache Redis flushes, sessions are unaffected.

Step 5: Rate Limiting — Protect Against Traffic Spikes

Implement rate limiting at the load balancer / API gateway layer:

Per-user: 60 requests/minute for the checkout endpoint (token bucket, Redis-backed).
Global: If total requests/second exceed 1.5× the 95th-percentile baseline, return HTTP 429 with Retry-After: 5 for a percentage of requests (proportional shedding), preserving capacity for paying customers.
Bot / scraper protection: Challenge requests with suspicious user-agents or >100 req/s from a single IP.

Step 6: Timeouts at Every Layer

Set explicit timeouts everywhere, because no timeout means an unbounded wait that can cascade into a full outage:

Load balancer → App server: connect 1 s, read 10 s.
App server → PostgreSQL: connect 2 s, query 5 s (override to 60 s for admin report queries).
App server → Redis: connect 0.5 s, command 1 s.
Payment worker → Stripe: connect 3 s, read 15 s.
Email worker → SMTP: connect 5 s, read 10 s.

Pair timeouts with exponential backoff with jitter on retries: base 500 ms, multiplier 2×, max 30 s, jitter ±20%. Cap retries at 3 attempts for user-facing paths, unlimited for background workers (with dead-letter queue after 10 failures).

Step 7: Health Checks & Graceful Degradation

Every component exposes a structured health endpoint:

GET /health/live — returns 200 if the process is running (liveness). The load balancer polls every 5 s; after 2 consecutive failures it stops routing to that instance.
GET /health/ready — returns 200 only if PostgreSQL primary, Redis primary, and the queue broker are all reachable within 500 ms (readiness). A new app server instance does not receive traffic until this passes (eliminates cold-start errors during deploys).

Graceful degradation strategy for ShopFast:

Redis down: Serve product catalogue from PostgreSQL (slower, but accurate). Disable personalised "Recently Viewed" widget (feature flag off). Sessions fall back to signed cookies. Cart persists to PostgreSQL.
Read replica down: Route all queries to the primary. Alert on-call. Accept higher primary load as a temporary trade-off rather than returning errors.
Stripe degraded: Circuit open → orders marked PAYMENT_DEFERRED → user sees "Your order is placed; payment will be processed within 5 minutes." Queue retries automatically.
Queue broker down: Fall back to synchronous Stripe call with a 10 s timeout and 1 retry. If that fails too, show an error. The queue outage is the last resort before showing errors.

ShopFast after hardening: redundant load balancers and app servers, DB with synchronous standby and async replica, Redis Sentinel, async queue decoupling external calls, circuit breakers on workers, and a monitoring plane covering the whole system.

Step 8: Monitoring, Alerting, and Dashboards

Instrument every layer with Prometheus metrics scraped every 15 s and visualised in Grafana. The minimum set of metrics for ShopFast:

http_request_duration_seconds — p50, p95, p99 per endpoint. SLO burn alert fires when p99 exceeds 500 ms for more than 5 minutes.
http_requests_total{status="5xx"} — error rate. Alert at >1% over 2 minutes.
payment_circuit_state (0=closed, 1=open) — page on-call immediately when open.
queue_depth — alert if checkout queue exceeds 5,000 unprocessed messages (indicates worker starvation).
postgres_replication_lag_bytes — alert if streaming lag exceeds 10 MB (RPO risk).
redis_connected_clients — alert if near the maxclients limit.

Define SLO burn rate alerts using two windows: a fast window (1 h, burn rate > 14.4×) and a slow window (6 h, burn rate > 6×). Firing both simultaneously means the monthly error budget is burning fast enough to be exhausted in < 1 hour — page someone now.

Step 9: Measuring the Improvement

After applying all patterns, estimate the improvement in theoretical availability. The system availability is the product of each component's availability (for components in series) and improves significantly when components are parallelised:

Before: Single app server at 99.5% × single DB at 99.9% × single Redis at 99.9% × synchronous Stripe at 99.9% ≈ 99.2% composite availability. That is ~58 hours downtime per year — 14× over budget.
After: Three app servers (any 1 of 3 must be up: 1 − 0.005³ ≈ 99.999%) × DB with synchronous standby (failover < 30 s; monthly availability ≈ 99.97%) × Redis Sentinel (≈ 99.95%) × async payment queue (checkout path no longer depends on Stripe synchronously) ≈ > 99.9% composite availability. That is within the 43-minute monthly budget.

Key insight: The biggest reliability gain came from two changes: (1) adding redundancy to stateless app servers (trivial, cheap), and (2) decoupling the Stripe call via a queue (eliminates the most common cascading-failure path). The database failover and circuit breakers are important, but they address rarer events. Always fix the most frequent failure modes first.

The Resilience Checklist

When reviewing any system design for resilience, walk through this checklist:

SPOF scan: Draw the architecture. Circle every component that, if it fails, takes the whole system down. Eliminate each one with redundancy or graceful degradation.
External dependency audit: List every third-party API. Each one should be called asynchronously (queue) or guarded by a circuit breaker with a tested fallback.
Timeout map: Every network call has a timeout. Every timeout has been tested (inject artificial latency).
Retry policy: Every retry has a cap, uses exponential backoff with jitter, and writes to a dead-letter queue after exhaustion.
Bulkhead verification: A load spike on one feature cannot starve another. Confirm with separate connection pools and queue consumers.
Health check coverage: Both liveness and readiness probes exist and are accurate — not trivially returning 200 without actually checking dependencies.
Degradation modes documented: For each SPOF, the degraded user experience is explicitly defined, tested, and communicated (feature flags, fallback routes).
Error budget tracked: SLI data is collected, SLO burn rate is alerting, and the on-call team reviews budget consumption weekly.
Chaos tested: Kill an app server in staging. Saturate the DB connection pool. Inject 5-second latency on the Stripe API. Verify the system degrades gracefully, not catastrophically.

Build incrementally: You do not need to implement all nine steps at once. Start with the highest-impact changes: redundant app servers and an async queue for the most error-prone external dependency. Instrument monitoring immediately so you can measure improvement. Add circuit breakers and bulkheads in the next iteration. Resilience is a continuous investment, not a one-time project.

Summary

You have walked through a complete resilience hardening exercise: identified every SPOF in a realistic system, applied redundancy, decoupled synchronous external calls with a queue, layered on circuit breakers and bulkheads, set timeouts at every hop, established health checks with graceful degradation, and wired up monitoring with SLO burn rate alerts. The result is a system that meets its 99.9% SLO target — and, crucially, a system whose failure modes are known, tested, and handled rather than discovered in a 3 AM production incident. Resilience engineering is not about making systems that never fail. It is about making failure cheap, visible, and recoverable.

Previous Lesson Graceful Degradation & Load Shedding

Back to Course System Design

Back to Course

Project: Make a System Resilient

Project: Make a System Resilient

The Starting System: A Fragile Monolith

Step 1: Eliminate Single Points of Failure with Redundancy

Step 2: Add a Message Queue — Decouple External Calls

Step 3: Circuit Breakers on Every External Dependency

Step 4: Bulkheads — Isolate Failure Domains

Step 5: Rate Limiting — Protect Against Traffic Spikes

Step 6: Timeouts at Every Layer

Step 7: Health Checks & Graceful Degradation

Step 8: Monitoring, Alerting, and Dashboards

Step 9: Measuring the Improvement

The Resilience Checklist

Summary

Tutorial Complete!