Reliability, Availability & Resilience

Graceful Degradation & Load Shedding

18 min Lesson 9 of 10

Graceful Degradation & Load Shedding

Every system has a breaking point. When traffic surges past your design capacity — a viral product launch, a flash sale, a DDoS wave — you face a binary choice: let the system collapse under the weight and return errors to everyone, or deliberately shed work so that the most important users and features keep functioning. Graceful degradation and load shedding are the engineering disciplines that make that second option real and controlled rather than accidental and catastrophic.

The core insight is deceptively simple: a partially working system is almost always better than a completely broken one. Amazon found that during high-load events, disabling product recommendations kept checkout functional — losing some revenue is far better than losing all of it. Netflix, under severe failures, continues streaming cached content and hides the "Top Picks" row rather than showing a blank screen. Google Search drops spelling suggestions before it drops search results. Each of these is a deliberate, pre-engineered decision about what to sacrifice first.

Graceful Degradation: A Taxonomy of Sacrifices

Degradation is not a single action — it is a spectrum. Design your system with an explicit feature priority hierarchy before an incident, not during one. Typical tiers look like this:

Tier 1 — Core (never shed): The revenue-critical or safety-critical path. For an e-commerce site, this is product display, cart, and checkout. For a bank, this is account balance and payment transfer. These must survive even if everything else is disabled.
Tier 2 — Important (shed under extreme load): Features that significantly affect UX but are not strictly transactional. Search autocomplete, personalised recommendations, live inventory counts, comment counts.
Tier 3 — Enhancements (shed first): Real-time analytics dashboards, A/B experiment tracking, social proof widgets, non-critical notifications. These are expensive to compute and low-impact if missing.

Key idea: Write your tier list down and agree on it with product and engineering leadership before an incident. During an outage is the worst time to debate which features matter.

Practically, you implement degradation through feature flags (kill switches that disable a feature instantly), cached fallbacks (serve a 5-minute-old recommendations list instead of computing a fresh one), and static fallbacks (return a pre-rendered HTML page from a CDN when the origin is overloaded). The key is that every non-Tier-1 service call is wrapped in a fallback path — a try/catch or a circuit-breaker that returns a default response instead of propagating the failure upstream.

Feature tier hierarchy: the API gateway routes requests, and non-core tiers fall back to cached or static responses under load rather than failing hard.

Load Shedding: Protecting the Core by Refusing Work

Load shedding is a more aggressive technique: instead of serving a degraded response, the system actively refuses incoming requests to protect itself. The key distinction from a simple error is intentionality — shedding is done before a server runs out of memory or threads, not after it falls over. Done correctly, the requests that do get through are served correctly at normal latency; only the excess is rejected, typically with HTTP 429 Too Many Requests or HTTP 503 Service Unavailable.

Load shedding strategies differ in what they reject:

Queue-length shedding: Measure the length of the work queue. When it exceeds a threshold (e.g., 1,000 pending jobs), reject new arrivals immediately. This caps the latency tail — a request that would sit in a 10,000-item queue for 45 seconds is better rejected fast than answered slowly.
CPU / memory threshold shedding: Monitor host utilisation. At 85% CPU, start rejecting low-priority traffic classes. At 95%, reject all non-Tier-1 traffic. This prevents the kernel from spending more time context-switching than doing real work.
Latency-based shedding (deadline-aware): Track request age. If a request has been waiting longer than its expected deadline (e.g., a user will abandon a page load after 3 seconds), discard it — serving a 4-second response to a request whose user already left wastes resources and helps no one.
Priority-based admission: Assign traffic classes (paying customers, internal health checks, free-tier users). Under load, admit traffic in priority order and shed the lowest class first. Google uses this extensively with criticality labels on RPC calls.

Best practice: Shed at the edge, not at the origin. A load balancer or API gateway that rejects at the network layer uses almost no origin-server resources. If you wait until a database query fails to reject the request, you have wasted threads, connections, and query capacity on a request you were going to fail anyway.

Admission control at the gateway: high-priority requests pass through to healthy origin servers; excess traffic is rejected fast with a Retry-After header, protecting the origin from collapse.

Implementing Load Shedding: Practical Patterns

The machinery behind load shedding typically involves a few composable primitives:

Token bucket / leaky bucket: The admission controller holds a bucket of tokens. Each admitted request consumes one token; tokens refill at a steady rate. When the bucket is empty, new requests are rejected. This smooths bursty traffic into a steady stream the origin can handle (covered in the Rate Limiting lesson — here it applies globally, not per-user).

Adaptive concurrency limits (Netflix Concurrency Limiter / TCP Vegas-inspired): Instead of a fixed queue limit, the system measures the current latency vs. the minimum observed latency (Little's Law). When latency rises above a multiple of the minimum, it infers congestion and reduces the concurrency limit dynamically, without manual tuning. Netflix open-sourced this as the concurrency-limits library.

Back-pressure propagation: When a downstream service (e.g., a database) is slow, the upstream service should propagate that back-pressure: slow down its own accept rate rather than buffering unlimited work. This turns a local overload into a system-wide flow-control signal. Without it, buffers fill up, latency balloons, and timeouts cascade (covered in the Timeouts & Retries lesson — here the point is that the same signal should also trigger shedding, not just waiting).

Common pitfall — the thundering herd on recovery: When a shed load is lifted and the system returns to normal, all the clients that received 503 will retry simultaneously. This "thundering herd" can immediately re-overload a recovering system. Always include a Retry-After header with a randomised delay (e.g., 5–30 seconds with jitter) in shed responses, and implement exponential backoff on the client side.

Putting It Together: A Worked Example

Consider an e-commerce site expecting 50,000 requests/second on Black Friday, but capacity for 35,000. Here is the degradation playbook:

At 35,000 req/s: Normal operation. All features enabled.
At 40,000 req/s: Feature flag flips: disable real-time inventory count (serve cached value, updated every 2 minutes). Disable personalised recommendations (serve editorial picks from CDN). CPU drops back below threshold.
At 48,000 req/s: Admission control kicks in: shed 25% of incoming traffic (lowest-priority session types: logged-out browsers, bots). Return 503 with Retry-After: 10. Core checkout continues at full capacity for authenticated users.
At 52,000 req/s (extreme): Expand shedding to 40% of traffic. All non-Tier-1 features are disabled. An emergency static landing page is served from the CDN for shed requests instead of a bare 503 — it shows a "We are experiencing high demand" message with an estimated wait time, preserving brand trust even while rejecting work.

The result: some users are delayed, but no user sees a full system crash, no database is overloaded, and the most valuable traffic (authenticated checkout) keeps flowing at full speed.

Real-world scale: During peak load events, Twitter historically shed up to 30% of write traffic (tweets) while preserving timeline reads. Google's serving infrastructure sheds lower-criticality RPC calls before dropping search results. LinkedIn uses a "degraded mode" that disables who-viewed-your-profile counts before disabling the feed. These are not accidents — they are explicitly engineered degradation paths tested with load simulation tools such as Gatling and k6 before the event.

Observability During Degradation

You cannot manage what you cannot see. Every degradation state must emit clear signals:

A gauge metric for the current degradation tier (0 = normal, 1 = partial, 2 = severe).
A counter for shed requests per second, broken down by reason code (queue-full, CPU-limit, priority-class).
A log line for each feature flag flip with a timestamp and the trigger condition.
An alert when the shed rate exceeds 5% for more than 60 seconds — that threshold means your capacity planning is wrong, not that degradation is working as intended.

Run regular degradation drills (inject synthetic load in staging, verify the correct features degrade in the correct order, verify the correct metrics fire). Without drills, your fallback paths bit-rot silently — the feature flag exists but the fallback code stopped working six months ago when a junior developer refactored the recommendations service.

Summary

Graceful degradation and load shedding transform a binary "up or down" system into a spectrum of service levels. By pre-designing a feature priority hierarchy, wrapping non-critical calls in fallback paths, and deploying admission control at the edge, you ensure that when capacity runs out, your system fails in the way that causes the least harm to the most important users. This is not a technique for the day something breaks — it is infrastructure you build, test, and maintain before that day arrives.