Handling Failures Between Services
Handling Failures Between Services
In a distributed system, failure is not an exception — it is a design constraint. Any network call between two services can be slow, time out, or return an error. The question is not whether your system will experience partial failures, but how gracefully it handles them. This lesson covers three essential techniques every Spring Boot microservice needs: configuring timeouts, handling errors programmatically, and applying graceful degradation so that a failure in one service does not cascade through the entire system.
Why Distributed Failures Are Different
In a monolith, a method call either returns or throws an exception immediately. A remote HTTP call introduces a third outcome: it can hang indefinitely. Without an explicit timeout, a thread waiting for a slow downstream service holds its resources — a connection from HikariCP, a thread from the servlet pool — until the server runs out of capacity. A single slow dependency can bring down an otherwise healthy service.
Configuring Timeouts with WebClient
Spring Boot's reactive WebClient lets you set timeouts at two levels: the TCP connection level (how long to wait for the TCP handshake) and the response level (how long to wait for the full response).
For blocking RestClient (Spring Boot 3.2+), the equivalent is a SimpleClientHttpRequestFactory or an Apache HttpClient factory with connection and read timeouts set via setConnectTimeout / setReadTimeout.
Handling HTTP Errors Programmatically
When a downstream service returns a 4xx or 5xx status, WebClient does not throw by default — it delivers the response as-is. You must declare explicit error handling with onStatus:
Mapping HTTP status codes to typed exceptions is the key pattern: it keeps error-handling logic close to the network boundary and lets the rest of your business code react to meaningful domain signals rather than raw HTTP codes.
Graceful Degradation — the Fallback Pattern
Graceful degradation means that when a dependency fails, the system returns a useful partial result rather than propagating the error to the user. The onErrorReturn and onErrorResume operators on a Mono or Flux are the simplest way to express this:
The user receives the product information with a "stock unknown" indicator instead of a 500 error. This is a deliberate design choice: the value of showing partial information must outweigh the risk of showing stale or absent data. Document these trade-offs explicitly in your code.
Timeouts on Individual Calls
Beyond the connector-level timeout, you can apply a per-request timeout directly on the reactive pipeline using the timeout operator. This is especially useful when the downstream SLA is known:
Bulkhead: Isolating Thread Pools
Even with timeouts in place, a burst of slow calls can exhaust the shared thread pool and starve other features. A bulkhead dedicates a limited resource (thread pool or semaphore) to each downstream dependency, so one slow service cannot consume all available capacity.
With Resilience4j (the standard Spring Cloud Circuit Breaker implementation), a thread-pool bulkhead is a few lines of configuration:
Security Implications
Fallback values have a security dimension that is easy to overlook. If your inventory service returns a default "in stock" response when it is unavailable, and your order service trusts that response, an attacker who can force the inventory service offline could trigger orders for out-of-stock items. Consider:
- Whether the fallback value is safe to act upon, or whether it should only trigger a user-visible warning.
- Propagating the HTTP security context (
Authorizationheader, correlation ID) through fallback paths — a fallback that skips auth token forwarding may allow privilege escalation between services. - Rate-limiting retries: without limits, automatic retries under failure can amplify load on an already struggling service (retry storms).
Summary
Robust inter-service communication requires three layers of defense: timeouts at both the TCP and response levels to prevent thread starvation, error handlers that convert HTTP status codes into typed exceptions, and graceful degradation via fallback values that let the system deliver useful partial results. Pair these with structured logging and, where needed, bulkhead isolation. In the next lesson you will apply these patterns while designing how each service owns and manages its own data store.