Scaling & Load Balancing

Health Checks & Failover

18 min Lesson 5 of 10

Health Checks & Failover

Adding more servers solves the capacity problem, but it immediately raises a harder question: how does your load balancer know which servers are actually ready to serve traffic? A crashed process, an out-of-memory app, a database connection pool that ran dry, or a deadlocked thread — all of these leave the server "up" at the network layer while it is completely unable to serve requests. Without a systematic way to detect and route around such failures, your load balancer will keep sending traffic into a black hole, and your users will see errors.

Health checks and failover are the mechanisms that turn a collection of independent servers into a self-healing cluster. They are the operational backbone of any high-availability system.

What a Health Check Actually Tests

There are three progressively deeper levels of health check, and choosing the right level is a deliberate design decision:

  • L1 — TCP / Ping check: Can we open a TCP connection to port 8080? This tells you the OS is alive and the process is listening. It says nothing about whether the app can actually process requests. False-negative rate: very low. False-positive rate: high.
  • L2 — HTTP status check: Does GET /health return HTTP 200? This tells you the web server is responding. It still says nothing about whether downstream dependencies (database, cache, message queue) are reachable. False-positive rate: moderate.
  • L3 — Deep / dependency check: Does GET /health/ready return 200 AND verify that the DB connection is live, the cache is reachable, and the queue is draining? This is the most accurate signal of true readiness, but it adds latency to the check itself and can cause a cascade of health-check traffic under stress.
Liveness vs. Readiness (Kubernetes terminology, widely useful): A liveness probe answers "Is this process fundamentally alive? Should we restart it?" A readiness probe answers "Is this instance ready to receive traffic right now?" These are different questions. A server can be live (should not be killed) but not ready (should be removed from the load balancer pool because it is still warming up its cache).

Health Check Configuration Parameters

Every load balancer exposes at least these four knobs, and tuning them correctly is the difference between a system that recovers in 5 seconds and one that takes 3 minutes:

  • Interval — How often the probe fires. Typical: 5–15 seconds. Shorter = faster detection, more check traffic.
  • Timeout — How long to wait for a response before counting it as a failure. Typical: 2–5 seconds. Must be less than interval.
  • Unhealthy threshold — How many consecutive failures before the instance is removed from rotation. Typical: 2–3. A threshold of 1 causes flapping on transient network blips.
  • Healthy threshold — How many consecutive successes before a recovered instance is re-added to rotation. Typical: 2–3. This prevents re-adding an instance that happens to pass one check during a degraded state.
Real-world tuning example: AWS Application Load Balancer defaults — interval 30 s, timeout 5 s, unhealthy threshold 2, healthy threshold 5. With these defaults, detection takes up to 60 seconds and recovery takes 150 seconds. For a latency-sensitive API, tighten to interval 10 s / timeout 3 s / thresholds 2/2 to reduce the detection window to ~20 seconds.
Health check cycle: detection and removal from pool Load Balancer health probe every 10 s Server A HTTP 200 IN POOL Server B timeout × 2 FAILING Server C REMOVED FROM POOL probe probe probe 200 OK no response traffic no traffic Detection timeline with interval=10s, timeout=3s, unhealthy threshold=2 t=0: first failure → t=10s: second failure → t=13s (timeout): removed from pool → worst-case ~23s to detect Recovery: 2 consecutive 200s required → re-added after ~20s of clean responses
Health probe cycle: the load balancer probes all servers on a fixed interval; servers that exceed the failure threshold are removed from the pool and traffic is redistributed.

Passive vs. Active Health Checks

There are two complementary strategies for detecting failures, and production systems use both:

Active checks (proactive): the load balancer initiates a dedicated probe request to each backend on a schedule. The advantage is that you detect failures even when there is no real traffic. The disadvantage is that the health endpoint must be kept lightweight — a heavy dependency check run every 5 seconds on 50 servers generates significant overhead.

Passive checks (reactive): the load balancer monitors the real request responses as they flow through. If a server returns 5xx errors or times out on N consecutive requests, it is marked unhealthy. The advantage is zero extra traffic. The disadvantage is that passive checks only trigger when real users are being served errors — you cannot detect a failure before users hit it.

Best practice: Use active checks as the primary safety net (fast detection before users are affected), and passive checks as a backstop that can trigger instant removal the moment real traffic starts failing, even between scheduled probe cycles.

Graceful Shutdown & Draining

Removing a server from the pool during a rolling deploy or a scale-in event is a failure scenario in disguise. If the load balancer stops sending new connections the instant it receives the shutdown signal, in-flight requests — which might take 0.5 to 30 seconds — will be abruptly cut off. Users see errors that have nothing to do with bugs in your code.

The solution is connection draining (AWS calls it "deregistration delay"; NGINX calls it proxy_next_upstream). The sequence is:

  1. The instance signals it is shutting down (SIGTERM, or the health endpoint returns 503).
  2. The load balancer stops routing new requests to that instance.
  3. The load balancer waits up to a configurable drain timeout (typically 30–90 seconds) for in-flight requests to complete.
  4. After the timeout, any remaining connections are forcibly closed and the instance is deregistered.

Failover: Automatic vs. Manual

When a server fails, traffic must be redistributed. This redistribution is failover, and it happens at two levels:

Instance-level failover is handled automatically by the load balancer. When Server C is removed from the pool, its share of the connection weight is redistributed to the remaining healthy servers. This happens within seconds and requires no human action. The risk is cascading failure: if Servers A and B were already running at 70% capacity, absorbing C's traffic pushes them to 105% — and now they start failing too, triggering their own removal, which pushes all traffic to... nothing. This is a cascading failure cascade.

Zone/region-level failover handles the case where an entire availability zone or data centre becomes unreachable. Route 53 and other global load balancers support health-check-based DNS failover: if the primary endpoint fails its check, DNS is automatically updated to point to the secondary region. DNS TTL must be set low (60–120 seconds) for this to be effective — a 300-second TTL means 5 minutes of failures before clients see the updated record.

Cascading failure scenario vs. capacity-aware failover Cascading Failure Risk Load Balancer distributes evenly Server A 70% CPU Server B 70% CPU Server C FAILED C fails → A and B absorb extra load → both hit 105% → cascade! Server A OVERLOADED Server B OVERLOADED Capacity-Aware Failover Load Balancer + Auto-Scaling trigger Server A 40% CPU Server B 40% CPU C fails → Auto-Scaler adds Server D in 90s Server A 60% CPU — OK Server B 60% CPU — OK Server D new — warming up
Left: a cluster running at 70% capacity cascades when one node fails. Right: a cluster with headroom and auto-scaling absorbs the failure and replaces the lost node without overloading survivors.

Circuit Breakers: Preventing Retry Storms

A load balancer removes an unhealthy backend from its own pool — but what about service-to-service calls deeper in the stack? If Service A calls Service B, and Service B is overwhelmed, every retry from A adds more load to B, making recovery impossible. This is where circuit breakers come in.

A circuit breaker is a state machine with three states:

  • Closed (normal operation): requests flow through. The breaker counts failures. If the failure rate exceeds a threshold (e.g., 50% of requests in a 10-second window), the circuit opens.
  • Open (failing fast): all calls to the downstream service immediately return an error without making a network call. This gives the downstream service breathing room to recover. After a configurable timeout (e.g., 30 seconds), the circuit enters the half-open state.
  • Half-open (probing): a small number of trial requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.
Common pitfall — health check endpoint doing too much: A health check endpoint that itself calls the database, the cache, and two external APIs will occasionally fail due to transient blips in any one of those dependencies — and will cause your server to be incorrectly removed from the pool. Keep the readiness check focused on the critical path only: can the app accept and process a request? Use a separate monitoring endpoint for full dependency checks that are NOT wired to the load balancer pool decision.

Key Metrics to Monitor

Health checks fire before failures become visible to users, but you also need real-time signals for the operations team:

  • Instance health ratio: what percentage of instances are currently passing health checks? Alert at <80%.
  • Failover event rate: how many instances were removed from the pool in the last hour? A non-zero rate in steady state suggests a recurring bug, a memory leak, or an external dependency that is periodically unavailable.
  • Mean time to detection (MTTD): how long between a failure occurring and the load balancer routing around it? Your target should be <30 seconds for user-facing services.
  • 5xx error rate during failover: some errors during the detection window (before the unhealthy instance is removed) are unavoidable. The goal is to minimise this window, not eliminate errors entirely.
The golden rule of failover: Design for the assumption that at any given moment, up to 30% of your instances can be unhealthy, and your system must still serve 100% of traffic. That means you need a minimum of ~43% headroom in your cluster at all times (n / 0.7 > n). At N=10 servers: 3 can fail and 7 must handle full load — each at roughly 143% of their "steady state" share... unless you plan for 30% excess capacity. This headroom calculation is a core part of capacity planning.

Health checks and failover are ultimately about building a system that knows about its own failures faster than your users do, and takes corrective action automatically. Pair them with good alerting, capacity headroom, and circuit breakers, and you have the foundation of a truly self-healing distributed system.