Circuit Breakers & Bulkheads
Circuit Breakers & Bulkheads
A single slow or failing downstream service can quietly poison the entire system. Without explicit protection, a thread pool fills up waiting for timeouts, memory accumulates unanswered requests, and latency spreads upstream — a cascading failure. Two complementary patterns stop this: the circuit breaker isolates a misbehaving dependency over time, and the bulkhead confines a failure to a limited resource pool so the rest of the system stays unaffected.
Cascading Failures: The Core Problem
Imagine Service A calls Service B, which calls Service C. Service C starts experiencing database contention and responding in 30 seconds instead of 50 ms. Service B's thread pool fills with threads blocked on C. Service A's calls to B now also block, exhausting A's pool. Within minutes, the entire call chain is down — and Service C was only slow, not down. This is the hallmark of a cascade.
The root cause is unbounded waiting: each caller blindly waits for the dependency to respond, consuming a scarce resource (threads, connections, memory) for the duration of that wait. The fix is to fail fast and explicitly when a dependency is unhealthy.
The Circuit Breaker Pattern
Coined by Michael Nygard in Release It!, a circuit breaker wraps every outbound call to a dependency and tracks its success/failure rate. It has three states:
- CLOSED — calls flow through normally. Failures are counted in a sliding window (e.g., last 60 seconds or last 100 calls). If the error rate crosses a threshold — say 50% — the breaker trips to OPEN.
- OPEN — all calls fail immediately without touching the dependency. The caller gets an error in microseconds rather than waiting for a 30-second timeout. This is the "fail fast" state.
- HALF-OPEN — after a configured sleep window (e.g., 10 seconds), the breaker lets a small number of probe requests through. If they succeed, the breaker resets to CLOSED. If they fail, it returns to OPEN and waits again.
Practical thresholds from production systems: Netflix Hystrix defaults to tripping at 50% error rate over 20 requests in a 10-second window with a 5-second sleep window. Resilience4j (the modern Java successor) uses a ring-bit buffer of 100 calls and the same 50% default. These numbers should be tuned per service based on its normal error baseline.
What to Return When the Circuit is Open
An open circuit should not just throw an exception. The caller has three good options:
- Return a cached/stale response — if the last known value is acceptable (e.g., a product price from 5 minutes ago), return it with a staleness indicator.
- Return a degraded/default response — return an empty list, a default recommendation set, or a "feature unavailable" flag. The user experience degrades gracefully instead of crashing.
- Enqueue for async retry — for write operations, buffer the request in a local queue and retry when the circuit closes.
The Bulkhead Pattern
A bulkhead (named after the watertight compartments in a ship's hull) limits the blast radius of a failure by giving each dependency or category of work its own isolated resource pool. The idea is simple: if Service A calls both Service B and Service C, and B becomes slow, B's calls should not be able to consume all of A's threads and starve calls to C.
The two common implementations of bulkheads are:
- Thread-pool bulkhead — each dependency gets its own fixed-size thread pool. Calls queue inside that pool; once full, new calls are rejected immediately. Used by Hystrix/Resilience4j with
ExecutorService-backed command threads. - Semaphore bulkhead — instead of a separate thread pool, a semaphore caps the number of concurrent in-flight calls. Lighter weight (no extra threads), but it cannot enforce a timeout on the call itself because the call still runs on the caller's thread.
Circuit Breakers and Bulkheads Together
These two patterns are complementary, not alternatives. Think of them as two lines of defence:
- The bulkhead limits how many resources a failing dependency can consume — it caps blast radius in space.
- The circuit breaker stops calling a dependency altogether once it is clearly unhealthy — it caps blast radius in time.
In a well-designed system, a bulkhead isolates early degradation (B is slow but not yet failing the error threshold), and the circuit breaker kicks in if the situation worsens. Resilience4j makes it straightforward to compose both: annotate a method with @Bulkhead and @CircuitBreaker, configure independent properties per dependency in application.yml, and let the framework handle state tracking.
Key Configuration Parameters to Tune
Getting these patterns right in production requires tuning, not just enabling:
- Minimum call volume — do not trip the breaker after 2 failures out of 2 calls. Require at least 20 calls in the window before evaluating error rate.
- Slow-call threshold — configure a call as "slow" (and count it toward the error rate) if it exceeds a latency threshold (e.g., 2 seconds). Slow calls are just as dangerous as errors.
- Sleep window — the OPEN state duration before probing. Too short: you hammer an already-struggling service. Too long: you delay recovery unnecessarily. 5–30 seconds is a common range.
- Bulkhead size — size thread pools based on observed concurrency, not hope. If a call takes 200 ms on average and you want to handle 100 req/s, you need at least 20 threads (
Little's Law: N = λ × W).