Retries, Timeouts & Bulkheads
Retries, Timeouts & Bulkheads
A circuit breaker is only one weapon in the resilience arsenal. In practice you need at least three more patterns working in concert: retries to recover from transient failures automatically, timeouts to guarantee that a slow downstream never holds a thread hostage indefinitely, and bulkheads to isolate pools of resources so that a surge in one area cannot starve every other area. This lesson walks through all three in depth — the mechanics, the configuration knobs, the failure modes, and the distributed-systems trade-offs a production engineer must understand.
Why Transient Failures Exist
Network packets are dropped. DNS lookups return stale entries for a few seconds after a rolling restart. A database primary undergoes a leader election that lasts 300 ms. These events are transient: if you simply try the same request again a moment later, it succeeds. Without retry logic your service surfaces these blips as hard errors to its callers. With well-configured retries they become invisible.
503 Service Unavailable, connection-pool exhaustion). They are harmful for permanent failures (400 Bad Request, business-logic rejections). Always configure an exception predicate so you only retry on recoverable errors.
Resilience4j Retry — Setup
Add the Spring Boot starter and the AOP module to your pom.xml:
Configure a named retry instance in application.yml:
This gives you three attempts with initial wait 500 ms, then 1 000 ms, then 2 000 ms, only for I/O and Feign exceptions — business errors are never retried.
Applying the Retry Annotation
randomized-wait-factor: 0.5 to Resilience4j config to spread retries randomly within 50 % of the wait duration.
Timeouts — The Mandatory Safety Net
Retries only help if the original call eventually returns. Without a timeout, a single hung downstream can occupy a thread from your web server's pool forever. In a service handling 200 concurrent requests with a 50-thread pool, just 50 hung calls saturate the pool and all subsequent requests queue — then time out for the caller — even though the rest of your service logic is perfectly healthy.
Resilience4j provides a TimeLimiter decorator, but for most Spring Boot 3 services the simplest approach is a TimeLimiter instance in YAML combined with @TimeLimiter:
@TimeLimiter requires the method to return a CompletableFuture. If the future does not complete within 2 s the decorator cancels it and throws TimeoutException. Set cancel-running-future: true (the default) so the underlying thread is also interrupted — otherwise it keeps running even though the caller has already given up.
@Retry and @TimeLimiter on the same method, the timeout applies per attempt. With 3 attempts at 2 s each plus exponential back-off the maximum elapsed time is much longer than 2 s. Plan your SLAs accordingly and communicate total worst-case latency to downstream callers.
Bulkheads — Isolating Failure Domains
A bulkhead is borrowed from naval engineering: a ship is divided into watertight compartments so that flooding one section does not sink the entire vessel. In software a bulkhead limits the number of concurrent calls to a particular downstream so that a slow or failing dependency cannot consume all available threads or connections.
Resilience4j offers two bulkhead flavours:
- Semaphore bulkhead — limits the number of concurrent calls. Lightweight, same thread. Suitable for non-blocking or fast operations.
- Thread-pool bulkhead — offloads calls to a dedicated bounded thread pool. Provides true thread isolation. Better for blocking I/O where you want to prevent thread-pool starvation in the shared web-server pool.
Semaphore Bulkhead
Thread-Pool Bulkhead
The thread-pool bulkhead executes the lambda on its own pool (4–8 threads) with a queue of 50 tasks. If the queue is full the call is rejected immediately with BulkheadFullException. This shields your Tomcat/Undertow web-server threads from being blocked by slow reports.
Combining Patterns: The Correct Decorator Order
When stacking multiple Resilience4j annotations on one method, the order of evaluation matters. Resilience4j applies decorators in this precedence (outermost first):
- Bulkhead
- TimeLimiter
- CircuitBreaker
- Retry
- RateLimiter
So a call first acquires a bulkhead permit, then starts the timer, then checks the circuit, then retries on failure. This is almost always the correct order: you want a timeout to wrap each individual retry attempt, and the circuit breaker to aggregate results across all attempts before deciding to open.
Observing Retries, Timeouts and Bulkheads with Actuator
Resilience4j publishes metrics to Micrometer automatically. With Spring Boot Actuator on the classpath you can query the current state of any instance:
These feed directly into Prometheus + Grafana dashboards, letting your SRE team set alerts on retry rate (a leading indicator that a downstream is degraded) before errors start reaching end users.
Summary
Retries recover from transient failures automatically — but must be scoped to idempotent operations and configured with exponential back-off plus jitter to avoid thundering herds. Timeouts guarantee bounded latency per call and prevent thread-pool saturation. Bulkheads partition your concurrency budget so one slow dependency cannot monopolise all available threads. Together these three patterns form the second ring of your resilience defence, sitting below the circuit breaker to handle the failures that happen before a breaker would open. In the next lesson you will add rate limiting to protect your own service from being overwhelmed by callers.