Reliability, Availability & Resilience

Circuit Breakers & Bulkheads

18 min Lesson 5 of 10

Circuit Breakers & Bulkheads

A single slow or failing downstream service can quietly poison the entire system. Without explicit protection, a thread pool fills up waiting for timeouts, memory accumulates unanswered requests, and latency spreads upstream — a cascading failure. Two complementary patterns stop this: the circuit breaker isolates a misbehaving dependency over time, and the bulkhead confines a failure to a limited resource pool so the rest of the system stays unaffected.

Cascading Failures: The Core Problem

Imagine Service A calls Service B, which calls Service C. Service C starts experiencing database contention and responding in 30 seconds instead of 50 ms. Service B's thread pool fills with threads blocked on C. Service A's calls to B now also block, exhausting A's pool. Within minutes, the entire call chain is down — and Service C was only slow, not down. This is the hallmark of a cascade.

The root cause is unbounded waiting: each caller blindly waits for the dependency to respond, consuming a scarce resource (threads, connections, memory) for the duration of that wait. The fix is to fail fast and explicitly when a dependency is unhealthy.

Circuit breaker state machine: CLOSED (normal), OPEN (fast-fail), HALF-OPEN (probing recovery).

The Circuit Breaker Pattern

Coined by Michael Nygard in Release It!, a circuit breaker wraps every outbound call to a dependency and tracks its success/failure rate. It has three states:

CLOSED — calls flow through normally. Failures are counted in a sliding window (e.g., last 60 seconds or last 100 calls). If the error rate crosses a threshold — say 50% — the breaker trips to OPEN.
OPEN — all calls fail immediately without touching the dependency. The caller gets an error in microseconds rather than waiting for a 30-second timeout. This is the "fail fast" state.
HALF-OPEN — after a configured sleep window (e.g., 10 seconds), the breaker lets a small number of probe requests through. If they succeed, the breaker resets to CLOSED. If they fail, it returns to OPEN and waits again.

Key insight: A circuit breaker does not fix the downstream service — it protects the caller from wasting resources while the dependency is unhealthy. Recovery still needs to happen on the other end; the breaker just buys time without cascading.

Practical thresholds from production systems: Netflix Hystrix defaults to tripping at 50% error rate over 20 requests in a 10-second window with a 5-second sleep window. Resilience4j (the modern Java successor) uses a ring-bit buffer of 100 calls and the same 50% default. These numbers should be tuned per service based on its normal error baseline.

What to Return When the Circuit is Open

An open circuit should not just throw an exception. The caller has three good options:

Return a cached/stale response — if the last known value is acceptable (e.g., a product price from 5 minutes ago), return it with a staleness indicator.
Return a degraded/default response — return an empty list, a default recommendation set, or a "feature unavailable" flag. The user experience degrades gracefully instead of crashing.
Enqueue for async retry — for write operations, buffer the request in a local queue and retry when the circuit closes.

Best practice: Define a fallback for every circuit breaker at design time, not as an afterthought. The question to ask is: what is the least-bad user experience when this dependency is unavailable?

The Bulkhead Pattern

A bulkhead (named after the watertight compartments in a ship's hull) limits the blast radius of a failure by giving each dependency or category of work its own isolated resource pool. The idea is simple: if Service A calls both Service B and Service C, and B becomes slow, B's calls should not be able to consume all of A's threads and starve calls to C.

Bulkhead isolation: each downstream dependency gets its own thread pool, so one saturated pool cannot starve the others.

The two common implementations of bulkheads are:

Thread-pool bulkhead — each dependency gets its own fixed-size thread pool. Calls queue inside that pool; once full, new calls are rejected immediately. Used by Hystrix/Resilience4j with ExecutorService-backed command threads.
Semaphore bulkhead — instead of a separate thread pool, a semaphore caps the number of concurrent in-flight calls. Lighter weight (no extra threads), but it cannot enforce a timeout on the call itself because the call still runs on the caller's thread.

Pitfall: A shared HTTP client connection pool across all dependencies defeats bulkheads entirely. If Service B consumes all 200 connections, calls to Service C queue at the connection pool level before they even reach the bulkhead. Always configure per-dependency connection pools alongside your semaphore or thread-pool bulkheads.

Circuit Breakers and Bulkheads Together

These two patterns are complementary, not alternatives. Think of them as two lines of defence:

The bulkhead limits how many resources a failing dependency can consume — it caps blast radius in space.
The circuit breaker stops calling a dependency altogether once it is clearly unhealthy — it caps blast radius in time.

In a well-designed system, a bulkhead isolates early degradation (B is slow but not yet failing the error threshold), and the circuit breaker kicks in if the situation worsens. Resilience4j makes it straightforward to compose both: annotate a method with @Bulkhead and @CircuitBreaker, configure independent properties per dependency in application.yml, and let the framework handle state tracking.

Real-world example: Netflix's API gateway uses both. Each downstream service (Recommendations, Ratings, Search) has its own thread pool (bulkhead) of roughly 40 threads. Each also has a circuit breaker with a 50% error threshold. During an incident in 2012, the Recommendations service degraded; its thread pool saturated, the circuit opened, and every homepage request received a pre-computed default recommendation list from cache. Browse was fully functional — only personalised recommendations were missing. Without these patterns, every homepage load would have timed out.

Key Configuration Parameters to Tune

Getting these patterns right in production requires tuning, not just enabling:

Minimum call volume — do not trip the breaker after 2 failures out of 2 calls. Require at least 20 calls in the window before evaluating error rate.
Slow-call threshold — configure a call as "slow" (and count it toward the error rate) if it exceeds a latency threshold (e.g., 2 seconds). Slow calls are just as dangerous as errors.
Sleep window — the OPEN state duration before probing. Too short: you hammer an already-struggling service. Too long: you delay recovery unnecessarily. 5–30 seconds is a common range.
Bulkhead size — size thread pools based on observed concurrency, not hope. If a call takes 200 ms on average and you want to handle 100 req/s, you need at least 20 threads (Little's Law: N = λ × W).