Capacity Planning & Autoscaling

Load Shedding & Backpressure

18 min Lesson 8 of 27

Load Shedding & Backpressure

Every system has a capacity ceiling. The question is not whether you will hit it, but what happens when you do. Systems that lack deliberate shedding and backpressure mechanisms respond to overload in the worst possible way: they accept every request, exhaust memory or CPU, cascade latency to every caller, and eventually crash — taking minutes or hours to recover. Systems built with shedding in mind fail gracefully, shed the cheapest work first, protect the most critical paths, and recover in seconds once load subsides. This lesson covers the patterns, tooling, and production judgment that separate the two outcomes.

Why Overload Without Shedding Is Self-Defeating

When an HTTP service is saturated, adding more load does not proportionally increase throughput — it collapses it. Thread pools fill, GC pressure spikes, lock contention grows, and p99 latency balloons. Every new request in-flight costs memory and thread time that could have served existing requests. The correct response is to reject marginal requests early and cheaply, preserving capacity for the work already accepted. This is load shedding.

Backpressure is the complementary mechanism: rather than silently dropping work at the boundary, you signal upstream producers to slow down. The two patterns address different topologies — shedding is boundary-side, backpressure is pipeline-internal — and production systems need both.

Concurrency Limits as the Primary Shed Mechanism

The most reliable load shed is a hard concurrency limit — a maximum number of requests processed simultaneously. When the limit is reached, new arrivals are rejected with 503 Service Unavailable immediately, before any work begins. Netflix's Concurrency Limit library (now part of resilience4j) and Envoy's circuit breaker connection limits both implement this model. The key insight: latency under load is dominated by queue depth, and a tight concurrency limit keeps the queue at or near zero for admitted requests.

Little's Law in practice: L = λ × W. If your service has average latency W of 50 ms and you want to cap in-flight requests L at 200, you can sustain throughput λ of 4,000 RPS. At 6,000 RPS the concurrency limit triggers — you shed 33% of requests but the 67% you admit see the same 50 ms, not a 5-second timeout.

In Kubernetes the primary control point is Envoy (via Istio or as a standalone sidecar). The following snippet sets per-upstream connection and pending-request limits for a downstream service:

# Istio DestinationRule: concurrency limits per pod
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-svc
spec:
  host: payments-svc.payments.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 200          # TCP connection pool to the upstream
      http:
        http1MaxPendingRequests: 100  # shed here: queue waiting for a connection
        http2MaxRequests: 200         # shed here: in-flight HTTP/2 streams
        maxRequestsPerConnection: 50  # drain and refresh connections
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s

Rate Limiting vs. Concurrency Limiting

Rate limiting (e.g., 500 RPS per tenant, enforced via a token-bucket in Redis or Envoy's local rate-limit filter) protects against bursty clients and quota abuse. Concurrency limiting protects against overload regardless of source. They solve different problems. At Google and Meta scale, rate limits are enforced at the edge/API gateway per authenticated caller, while concurrency limits are enforced per microservice instance. Run both: rate limits facing out, concurrency limits facing in.

Adaptive concurrency limits: Static limits break in services with variable-latency upstreams (databases, LLM calls). Netflix's Gradient2 and TCP Vegas-inspired algorithms measure round-trip latency and dynamically lower the concurrency limit as latency rises, then raise it as latency falls. Resilience4j's AdaptiveBulkhead implements this. For latency-sensitive services, adaptive limits outperform static ones significantly during partial failures.

Backpressure in Async Pipelines

Backpressure is the producer-consumer contract that says: do not produce faster than the consumer can process. In synchronous HTTP this happens implicitly — the caller blocks until you respond. In async pipelines (Kafka, SQS, internal channels) you must engineer it explicitly.

Kafka consumer-side backpressure is controlled via max.poll.records and the processing loop: if your consumer processes 100 records and commits offsets before fetching the next batch, the broker naturally throttles delivery to match processing rate. The mistake is fire-and-forget async processing where you spawn goroutines/threads per record with no bound — lag accumulates invisibly until the consumer OOMs.

# Kafka consumer: bounded backpressure via max.poll.records + manual commit
# application.properties (Spring Boot / Quarkus pattern)
spring.kafka.consumer.max-poll-records=50
spring.kafka.consumer.enable-auto-commit=false
spring.kafka.listener.ack-mode=MANUAL_IMMEDIATE

# Also set fetch.max.bytes and max.partition.fetch.bytes to limit
# the memory pinned per poll cycle when messages are large.
spring.kafka.consumer.properties.fetch.max.bytes=5242880        # 5 MB
spring.kafka.consumer.properties.max.partition.fetch.bytes=1048576  # 1 MB per partition

For internal Go/Java pipelines, the idiomatic backpressure primitive is a bounded channel or blocking queue. The producer blocks (or sheds) when the channel is full. In Go: ch := make(chan Job, 500) — a full channel blocks the sender, propagating pressure upstream. Never use unbounded queues in latency-sensitive pipelines.

Brownouts: Degraded Service Over Hard Failure

A brownout is a deliberate partial degradation: you keep the service running but shed non-essential features to protect core functionality. Common brownout strategies at big-tech companies include:

Feature flags with load gates: Disable recommendation panels, social proof widgets, or personalization calls when CPU utilization exceeds 80% — serving a stripped-but-functional page instead of a timeout.
Static fallbacks: Return a cached or static version of content (stale CDN edge cache, last-known-good snapshot) when the origin is overloaded.
Priority queuing: SRE-level triage — drop background jobs (analytics ingest, cache warming) before shedding foreground user requests. Separate Kafka topics or SQS queues per priority tier, with consumers biased toward the high-priority queue.
Reduced consistency: Temporarily serve reads from a replica with slightly stale data rather than blocking on the primary under write pressure.

Never shed silently without observability. Every shed decision must emit a metric and, for high-value traffic, a structured log. Without a requests_shed_total counter broken down by reason and endpoint, you cannot distinguish "working as designed" from "we are hemorrhaging revenue." Instrument before you deploy the shedding logic.

Implementing a Shed Decision in an Nginx / HAProxy Edge

For services running behind Nginx, the simplest overload protection is the connection limit and request queue depth. HAProxy's maxconn per-backend + timeout connect provides similar protection at L4. Below is an Nginx example that limits in-flight connections to 200 per upstream, returns 503 immediately when the queue of 50 pending requests is full, and emits a header so clients can retry after a sensible delay:

# nginx.conf — upstream concurrency shed
upstream api_backend {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    keepalive 64;
}

limit_conn_zone $binary_remote_addr zone=per_ip:10m;
limit_conn_zone $server_name       zone=per_server:10m;

server {
    listen 443 ssl http2;

    location /api/ {
        limit_conn per_ip     20;       # per-client concurrency
        limit_conn per_server 200;      # global concurrency cap
        limit_conn_status 503;
        limit_conn_log_level warn;

        add_header Retry-After 5 always; # tell clients when to retry

        proxy_pass         http://api_backend;
        proxy_read_timeout 10s;
        proxy_send_timeout 5s;
    }
}

The Overload Diagram: Request Fate at Each Boundary

Request fate at each boundary: rate-limit shed at the edge, concurrency-limit shed at the sidecar, and brownout in the application layer — with backpressure flowing back from the downstream.

Measuring Shedding Effectiveness

Three metrics decide whether your shedding strategy is working:

Shed rate (requests_shed_total / requests_total): Should spike during overload events and return to near-zero afterwards. A non-zero baseline shed rate means your normal capacity is too close to the limit.
p99 latency of admitted requests during shed: Should remain flat. If p99 rises while you are shedding, the concurrency limit is too high — reduce it.
Error budget consumption rate: Shedding 503s count against your SLO error budget. Tune the shed threshold so that errors stay within budget during your worst-case traffic spikes, not beyond them.

Graduation path for load shedding at scale: Start with a static concurrency limit at the edge. Add per-client rate limiting next. Then instrument and tune adaptive concurrency. Finally, add brownout gates around non-critical features with feature-flag integration. Each layer protects further without requiring the next — start with what you can ship this week.

Production Checklist

Every service has a maximum concurrency configured (Envoy, Nginx, or application middleware) — not just a timeout.
Shed decisions emit requests_shed_total{reason,endpoint} metrics to Prometheus.
Brownout gates are tested in staging under synthetic overload (k6 or Gatling) before production.
Kafka consumers have bounded max.poll.records and manual commit — never unbounded async processing.
Upstream callers receive machine-readable retry semantics: Retry-After header on 503, exponential backoff + jitter in client libraries.
Shed thresholds are reviewed in every capacity review (Lesson 9) — static limits drift stale as traffic patterns shift.