Resilience, Messaging & Observability

Retries, Timeouts & Bulkheads

18 min Lesson 3 of 12

Retries, Timeouts & Bulkheads

A circuit breaker is only one weapon in the resilience arsenal. In practice you need at least three more patterns working in concert: retries to recover from transient failures automatically, timeouts to guarantee that a slow downstream never holds a thread hostage indefinitely, and bulkheads to isolate pools of resources so that a surge in one area cannot starve every other area. This lesson walks through all three in depth — the mechanics, the configuration knobs, the failure modes, and the distributed-systems trade-offs a production engineer must understand.

Why Transient Failures Exist

Network packets are dropped. DNS lookups return stale entries for a few seconds after a rolling restart. A database primary undergoes a leader election that lasts 300 ms. These events are transient: if you simply try the same request again a moment later, it succeeds. Without retry logic your service surfaces these blips as hard errors to its callers. With well-configured retries they become invisible.

Transient vs. permanent failures: Retries are appropriate for transient faults (network hiccups, 503 Service Unavailable, connection-pool exhaustion). They are harmful for permanent failures (400 Bad Request, business-logic rejections). Always configure an exception predicate so you only retry on recoverable errors.

Resilience4j Retry — Setup

Add the Spring Boot starter and the AOP module to your pom.xml:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

Configure a named retry instance in application.yml:

resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 3
        wait-duration: 500ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
        retry-exceptions:
          - java.io.IOException
          - feign.RetryableException
        ignore-exceptions:
          - com.example.exceptions.BusinessException

This gives you three attempts with initial wait 500 ms, then 1 000 ms, then 2 000 ms, only for I/O and Feign exceptions — business errors are never retried.

Applying the Retry Annotation

import io.github.resilience4j.retry.annotation.Retry;
import org.springframework.stereotype.Service;

@Service
public class PaymentClient {

    @Retry(name = "paymentService", fallbackMethod = "fallbackCharge")
    public PaymentResponse charge(ChargeRequest request) {
        // HTTP call to payment provider — may throw IOException
        return externalHttpClient.post("/charge", request);
    }

    // Fallback signature must match the original method plus a Throwable param
    public PaymentResponse fallbackCharge(ChargeRequest request, Throwable ex) {
        log.warn("Payment provider unavailable after retries: {}", ex.getMessage());
        return PaymentResponse.pending(request.getOrderId());
    }
}

Exponential back-off with jitter: Plain exponential back-off can cause a thundering herd — all retrying clients wake up at the same instant and re-overwhelm the target. Add randomized-wait-factor: 0.5 to Resilience4j config to spread retries randomly within 50 % of the wait duration.

Timeouts — The Mandatory Safety Net

Retries only help if the original call eventually returns. Without a timeout, a single hung downstream can occupy a thread from your web server's pool forever. In a service handling 200 concurrent requests with a 50-thread pool, just 50 hung calls saturate the pool and all subsequent requests queue — then time out for the caller — even though the rest of your service logic is perfectly healthy.

Resilience4j provides a TimeLimiter decorator, but for most Spring Boot 3 services the simplest approach is a TimeLimiter instance in YAML combined with @TimeLimiter:

resilience4j:
  timelimiter:
    instances:
      inventoryService:
        timeout-duration: 2s
        cancel-running-future: true

import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import java.util.concurrent.CompletableFuture;

@Service
public class InventoryClient {

    @TimeLimiter(name = "inventoryService")
    public CompletableFuture<StockLevel> getStock(String sku) {
        return CompletableFuture.supplyAsync(() -> externalClient.fetchStock(sku));
    }
}

@TimeLimiter requires the method to return a CompletableFuture. If the future does not complete within 2 s the decorator cancels it and throws TimeoutException. Set cancel-running-future: true (the default) so the underlying thread is also interrupted — otherwise it keeps running even though the caller has already given up.

Timeout vs. back-off alignment: If you combine @Retry and @TimeLimiter on the same method, the timeout applies per attempt. With 3 attempts at 2 s each plus exponential back-off the maximum elapsed time is much longer than 2 s. Plan your SLAs accordingly and communicate total worst-case latency to downstream callers.

Bulkheads — Isolating Failure Domains

A bulkhead is borrowed from naval engineering: a ship is divided into watertight compartments so that flooding one section does not sink the entire vessel. In software a bulkhead limits the number of concurrent calls to a particular downstream so that a slow or failing dependency cannot consume all available threads or connections.

Resilience4j offers two bulkhead flavours:

Semaphore bulkhead — limits the number of concurrent calls. Lightweight, same thread. Suitable for non-blocking or fast operations.
Thread-pool bulkhead — offloads calls to a dedicated bounded thread pool. Provides true thread isolation. Better for blocking I/O where you want to prevent thread-pool starvation in the shared web-server pool.

Semaphore Bulkhead

resilience4j:
  bulkhead:
    instances:
      notificationService:
        max-concurrent-calls: 10
        max-wait-duration: 20ms

import io.github.resilience4j.bulkhead.annotation.Bulkhead;

@Service
public class NotificationClient {

    @Bulkhead(name = "notificationService", type = Bulkhead.Type.SEMAPHORE)
    public void sendPush(PushPayload payload) {
        // at most 10 concurrent calls; 11th waits up to 20 ms then throws BulkheadFullException
        pushGateway.send(payload);
    }
}

Thread-Pool Bulkhead

resilience4j:
  thread-pool-bulkhead:
    instances:
      reportingService:
        core-thread-pool-size: 4
        max-thread-pool-size: 8
        queue-capacity: 50
        keep-alive-duration: 20ms

import io.github.resilience4j.bulkhead.annotation.Bulkhead;
import java.util.concurrent.CompletableFuture;

@Service
public class ReportingClient {

    @Bulkhead(name = "reportingService", type = Bulkhead.Type.THREADPOOL)
    public CompletableFuture<Report> generateReport(ReportRequest req) {
        return CompletableFuture.supplyAsync(() -> reportEngine.build(req));
    }
}

The thread-pool bulkhead executes the lambda on its own pool (4–8 threads) with a queue of 50 tasks. If the queue is full the call is rejected immediately with BulkheadFullException. This shields your Tomcat/Undertow web-server threads from being blocked by slow reports.

Combining Patterns: The Correct Decorator Order

When stacking multiple Resilience4j annotations on one method, the order of evaluation matters. Resilience4j applies decorators in this precedence (outermost first):

Bulkhead
TimeLimiter
CircuitBreaker
Retry
RateLimiter

So a call first acquires a bulkhead permit, then starts the timer, then checks the circuit, then retries on failure. This is almost always the correct order: you want a timeout to wrap each individual retry attempt, and the circuit breaker to aggregate results across all attempts before deciding to open.

@Bulkhead(name = "paymentService", type = Bulkhead.Type.SEMAPHORE)
@TimeLimiter(name = "paymentService")
@CircuitBreaker(name = "paymentService", fallbackMethod = "fallbackCharge")
@Retry(name = "paymentService")
public CompletableFuture<PaymentResponse> charge(ChargeRequest request) {
    return CompletableFuture.supplyAsync(() -> externalClient.post("/charge", request));
}

Observing Retries, Timeouts and Bulkheads with Actuator

Resilience4j publishes metrics to Micrometer automatically. With Spring Boot Actuator on the classpath you can query the current state of any instance:

# Retry metrics
GET /actuator/metrics/resilience4j.retry.calls

# Bulkhead available concurrent calls
GET /actuator/metrics/resilience4j.bulkhead.available.concurrent.calls

# TimeLimiter timeout calls
GET /actuator/metrics/resilience4j.timelimiter.calls

These feed directly into Prometheus + Grafana dashboards, letting your SRE team set alerts on retry rate (a leading indicator that a downstream is degraded) before errors start reaching end users.

Summary

Retries recover from transient failures automatically — but must be scoped to idempotent operations and configured with exponential back-off plus jitter to avoid thundering herds. Timeouts guarantee bounded latency per call and prevent thread-pool saturation. Bulkheads partition your concurrency budget so one slow dependency cannot monopolise all available threads. Together these three patterns form the second ring of your resilience defence, sitting below the circuit breaker to handle the failures that happen before a breaker would open. In the next lesson you will add rate limiting to protect your own service from being overwhelmed by callers.