Building Microservices with Spring Boot

Handling Failures Between Services

18 min Lesson 5 of 12

Handling Failures Between Services

In a distributed system, failure is not an exception — it is a design constraint. Any network call between two services can be slow, time out, or return an error. The question is not whether your system will experience partial failures, but how gracefully it handles them. This lesson covers three essential techniques every Spring Boot microservice needs: configuring timeouts, handling errors programmatically, and applying graceful degradation so that a failure in one service does not cascade through the entire system.

Why Distributed Failures Are Different

In a monolith, a method call either returns or throws an exception immediately. A remote HTTP call introduces a third outcome: it can hang indefinitely. Without an explicit timeout, a thread waiting for a slow downstream service holds its resources — a connection from HikariCP, a thread from the servlet pool — until the server runs out of capacity. A single slow dependency can bring down an otherwise healthy service.

The two failure modes to defend against: (1) latency — the call eventually completes but too slowly; (2) error — the call returns a 5xx status or throws an exception. Timeouts address latency; error handling addresses errors. Both are required.

Configuring Timeouts with WebClient

Spring Boot's reactive WebClient lets you set timeouts at two levels: the TCP connection level (how long to wait for the TCP handshake) and the response level (how long to wait for the full response).

import io.netty.channel.ChannelOption;
import io.netty.handler.timeout.ReadTimeoutHandler;
import io.netty.handler.timeout.WriteTimeoutHandler;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.http.client.reactive.ReactorClientHttpConnector;
import org.springframework.web.reactive.function.client.WebClient;
import reactor.netty.http.client.HttpClient;

import java.time.Duration;
import java.util.concurrent.TimeUnit;

@Configuration
public class WebClientConfig {

    @Bean
    public WebClient inventoryClient() {
        HttpClient httpClient = HttpClient.create()
            // TCP-level: abort if the connection cannot be established within 2 s
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 2_000)
            // Read/write timeouts on the channel
            .doOnConnected(conn -> conn
                .addHandlerLast(new ReadTimeoutHandler(5, TimeUnit.SECONDS))
                .addHandlerLast(new WriteTimeoutHandler(5, TimeUnit.SECONDS)));

        return WebClient.builder()
            .baseUrl("http://inventory-service")
            .clientConnector(new ReactorClientHttpConnector(httpClient))
            .build();
    }
}

For blocking RestClient (Spring Boot 3.2+), the equivalent is a SimpleClientHttpRequestFactory or an Apache HttpClient factory with connection and read timeouts set via setConnectTimeout / setReadTimeout.

Always set BOTH connection and read timeouts. A connection timeout catches a dead host; a read timeout catches a host that accepted the connection but then stalled. Either alone leaves a gap.

Handling HTTP Errors Programmatically

When a downstream service returns a 4xx or 5xx status, WebClient does not throw by default — it delivers the response as-is. You must declare explicit error handling with onStatus:

import org.springframework.web.reactive.function.client.WebClientResponseException;
import reactor.core.publisher.Mono;

public class InventoryServiceClient {

    private final WebClient client;

    public InventoryServiceClient(WebClient inventoryClient) {
        this.client = inventoryClient;
    }

    public Mono<InventoryResponse> getStock(String productId) {
        return client.get()
            .uri("/inventory/{id}", productId)
            .retrieve()
            // treat 404 as a domain-level "not found" signal
            .onStatus(status -> status.value() == 404,
                resp -> Mono.error(new ProductNotFoundException(productId)))
            // treat any 5xx as a transient infrastructure failure
            .onStatus(status -> status.is5xxServerError(),
                resp -> resp.bodyToMono(String.class)
                    .flatMap(body -> Mono.error(
                        new DownstreamServiceException("inventory-service returned 5xx: " + body))))
            .bodyToMono(InventoryResponse.class);
    }
}

Mapping HTTP status codes to typed exceptions is the key pattern: it keeps error-handling logic close to the network boundary and lets the rest of your business code react to meaningful domain signals rather than raw HTTP codes.

Graceful Degradation — the Fallback Pattern

Graceful degradation means that when a dependency fails, the system returns a useful partial result rather than propagating the error to the user. The onErrorReturn and onErrorResume operators on a Mono or Flux are the simplest way to express this:

public Mono<ProductDetailDto> getProductDetail(String productId) {
    Mono<ProductDto> product = productRepository.findById(productId);

    // Call inventory service; fall back to "unknown" if it is down
    Mono<InventoryResponse> stock = inventoryClient.getStock(productId)
        .onErrorReturn(new InventoryResponse(productId, -1, "UNKNOWN"));

    return Mono.zip(product, stock)
        .map(tuple -> new ProductDetailDto(tuple.getT1(), tuple.getT2()));
}

The user receives the product information with a "stock unknown" indicator instead of a 500 error. This is a deliberate design choice: the value of showing partial information must outweigh the risk of showing stale or absent data. Document these trade-offs explicitly in your code.

Timeouts on Individual Calls

Beyond the connector-level timeout, you can apply a per-request timeout directly on the reactive pipeline using the timeout operator. This is especially useful when the downstream SLA is known:

import java.time.Duration;

public Mono<InventoryResponse> getStockWithTimeout(String productId) {
    return inventoryClient.getStock(productId)
        // fail fast if inventory does not respond in 3 seconds
        .timeout(Duration.ofSeconds(3))
        // then fall back
        .onErrorResume(ex -> {
            log.warn("Inventory call timed out or failed for {}: {}", productId, ex.getMessage());
            return Mono.just(new InventoryResponse(productId, -1, "UNAVAILABLE"));
        });
}

Do not swallow errors silently. Always log the exception before returning a fallback value. In production, silent fallbacks mask real problems and make debugging nearly impossible. Structured logging with the product ID and service name is the minimum — ideally include a correlation ID (covered in lesson 8).

Bulkhead: Isolating Thread Pools

Even with timeouts in place, a burst of slow calls can exhaust the shared thread pool and starve other features. A bulkhead dedicates a limited resource (thread pool or semaphore) to each downstream dependency, so one slow service cannot consume all available capacity.

With Resilience4j (the standard Spring Cloud Circuit Breaker implementation), a thread-pool bulkhead is a few lines of configuration:

# application.yml
resilience4j:
  thread-pool-bulkhead:
    instances:
      inventory-service:
        maxThreadPoolSize: 10
        coreThreadPoolSize: 5
        queueCapacity: 20

import io.github.resilience4j.bulkhead.annotation.Bulkhead;
import org.springframework.stereotype.Service;

@Service
public class InventoryServiceClient {

    @Bulkhead(name = "inventory-service", fallbackMethod = "stockFallback")
    public InventoryResponse getStock(String productId) {
        // blocking call to inventory service
    }

    public InventoryResponse stockFallback(String productId, Throwable t) {
        log.warn("Bulkhead triggered for inventory-service: {}", t.getMessage());
        return new InventoryResponse(productId, -1, "UNAVAILABLE");
    }
}

Security Implications

Fallback values have a security dimension that is easy to overlook. If your inventory service returns a default "in stock" response when it is unavailable, and your order service trusts that response, an attacker who can force the inventory service offline could trigger orders for out-of-stock items. Consider:

Whether the fallback value is safe to act upon, or whether it should only trigger a user-visible warning.
Propagating the HTTP security context (Authorization header, correlation ID) through fallback paths — a fallback that skips auth token forwarding may allow privilege escalation between services.
Rate-limiting retries: without limits, automatic retries under failure can amplify load on an already struggling service (retry storms).

Summary

Robust inter-service communication requires three layers of defense: timeouts at both the TCP and response levels to prevent thread starvation, error handlers that convert HTTP status codes into typed exceptions, and graceful degradation via fallback values that let the system deliver useful partial results. Pair these with structured logging and, where needed, bulkhead isolation. In the next lesson you will apply these patterns while designing how each service owns and manages its own data store.