Load Shedding & Backpressure
Load Shedding & Backpressure
Every system has a capacity ceiling. The question is not whether you will hit it, but what happens when you do. Systems that lack deliberate shedding and backpressure mechanisms respond to overload in the worst possible way: they accept every request, exhaust memory or CPU, cascade latency to every caller, and eventually crash — taking minutes or hours to recover. Systems built with shedding in mind fail gracefully, shed the cheapest work first, protect the most critical paths, and recover in seconds once load subsides. This lesson covers the patterns, tooling, and production judgment that separate the two outcomes.
Why Overload Without Shedding Is Self-Defeating
When an HTTP service is saturated, adding more load does not proportionally increase throughput — it collapses it. Thread pools fill, GC pressure spikes, lock contention grows, and p99 latency balloons. Every new request in-flight costs memory and thread time that could have served existing requests. The correct response is to reject marginal requests early and cheaply, preserving capacity for the work already accepted. This is load shedding.
Backpressure is the complementary mechanism: rather than silently dropping work at the boundary, you signal upstream producers to slow down. The two patterns address different topologies — shedding is boundary-side, backpressure is pipeline-internal — and production systems need both.
Concurrency Limits as the Primary Shed Mechanism
The most reliable load shed is a hard concurrency limit — a maximum number of requests processed simultaneously. When the limit is reached, new arrivals are rejected with 503 Service Unavailable immediately, before any work begins. Netflix's Concurrency Limit library (now part of resilience4j) and Envoy's circuit breaker connection limits both implement this model. The key insight: latency under load is dominated by queue depth, and a tight concurrency limit keeps the queue at or near zero for admitted requests.
In Kubernetes the primary control point is Envoy (via Istio or as a standalone sidecar). The following snippet sets per-upstream connection and pending-request limits for a downstream service:
Rate Limiting vs. Concurrency Limiting
Rate limiting (e.g., 500 RPS per tenant, enforced via a token-bucket in Redis or Envoy's local rate-limit filter) protects against bursty clients and quota abuse. Concurrency limiting protects against overload regardless of source. They solve different problems. At Google and Meta scale, rate limits are enforced at the edge/API gateway per authenticated caller, while concurrency limits are enforced per microservice instance. Run both: rate limits facing out, concurrency limits facing in.
AdaptiveBulkhead implements this. For latency-sensitive services, adaptive limits outperform static ones significantly during partial failures.
Backpressure in Async Pipelines
Backpressure is the producer-consumer contract that says: do not produce faster than the consumer can process. In synchronous HTTP this happens implicitly — the caller blocks until you respond. In async pipelines (Kafka, SQS, internal channels) you must engineer it explicitly.
Kafka consumer-side backpressure is controlled via max.poll.records and the processing loop: if your consumer processes 100 records and commits offsets before fetching the next batch, the broker naturally throttles delivery to match processing rate. The mistake is fire-and-forget async processing where you spawn goroutines/threads per record with no bound — lag accumulates invisibly until the consumer OOMs.
For internal Go/Java pipelines, the idiomatic backpressure primitive is a bounded channel or blocking queue. The producer blocks (or sheds) when the channel is full. In Go: ch := make(chan Job, 500) — a full channel blocks the sender, propagating pressure upstream. Never use unbounded queues in latency-sensitive pipelines.
Brownouts: Degraded Service Over Hard Failure
A brownout is a deliberate partial degradation: you keep the service running but shed non-essential features to protect core functionality. Common brownout strategies at big-tech companies include:
- Feature flags with load gates: Disable recommendation panels, social proof widgets, or personalization calls when CPU utilization exceeds 80% — serving a stripped-but-functional page instead of a timeout.
- Static fallbacks: Return a cached or static version of content (stale CDN edge cache, last-known-good snapshot) when the origin is overloaded.
- Priority queuing: SRE-level triage — drop background jobs (analytics ingest, cache warming) before shedding foreground user requests. Separate Kafka topics or SQS queues per priority tier, with consumers biased toward the high-priority queue.
- Reduced consistency: Temporarily serve reads from a replica with slightly stale data rather than blocking on the primary under write pressure.
requests_shed_total counter broken down by reason and endpoint, you cannot distinguish "working as designed" from "we are hemorrhaging revenue." Instrument before you deploy the shedding logic.
Implementing a Shed Decision in an Nginx / HAProxy Edge
For services running behind Nginx, the simplest overload protection is the connection limit and request queue depth. HAProxy's maxconn per-backend + timeout connect provides similar protection at L4. Below is an Nginx example that limits in-flight connections to 200 per upstream, returns 503 immediately when the queue of 50 pending requests is full, and emits a header so clients can retry after a sensible delay:
The Overload Diagram: Request Fate at Each Boundary
Measuring Shedding Effectiveness
Three metrics decide whether your shedding strategy is working:
- Shed rate (
requests_shed_total / requests_total): Should spike during overload events and return to near-zero afterwards. A non-zero baseline shed rate means your normal capacity is too close to the limit. - p99 latency of admitted requests during shed: Should remain flat. If p99 rises while you are shedding, the concurrency limit is too high — reduce it.
- Error budget consumption rate: Shedding 503s count against your SLO error budget. Tune the shed threshold so that errors stay within budget during your worst-case traffic spikes, not beyond them.
Production Checklist
- Every service has a maximum concurrency configured (Envoy, Nginx, or application middleware) — not just a timeout.
- Shed decisions emit
requests_shed_total{reason,endpoint}metrics to Prometheus. - Brownout gates are tested in staging under synthetic overload (k6 or Gatling) before production.
- Kafka consumers have bounded
max.poll.recordsand manual commit — never unbounded async processing. - Upstream callers receive machine-readable retry semantics:
Retry-Afterheader on 503, exponential backoff + jitter in client libraries. - Shed thresholds are reviewed in every capacity review (Lesson 9) — static limits drift stale as traffic patterns shift.