Health Checks & Failover
Health Checks & Failover
Adding more servers solves the capacity problem, but it immediately raises a harder question: how does your load balancer know which servers are actually ready to serve traffic? A crashed process, an out-of-memory app, a database connection pool that ran dry, or a deadlocked thread — all of these leave the server "up" at the network layer while it is completely unable to serve requests. Without a systematic way to detect and route around such failures, your load balancer will keep sending traffic into a black hole, and your users will see errors.
Health checks and failover are the mechanisms that turn a collection of independent servers into a self-healing cluster. They are the operational backbone of any high-availability system.
What a Health Check Actually Tests
There are three progressively deeper levels of health check, and choosing the right level is a deliberate design decision:
- L1 — TCP / Ping check: Can we open a TCP connection to port 8080? This tells you the OS is alive and the process is listening. It says nothing about whether the app can actually process requests. False-negative rate: very low. False-positive rate: high.
- L2 — HTTP status check: Does
GET /healthreturn HTTP 200? This tells you the web server is responding. It still says nothing about whether downstream dependencies (database, cache, message queue) are reachable. False-positive rate: moderate. - L3 — Deep / dependency check: Does
GET /health/readyreturn 200 AND verify that the DB connection is live, the cache is reachable, and the queue is draining? This is the most accurate signal of true readiness, but it adds latency to the check itself and can cause a cascade of health-check traffic under stress.
Health Check Configuration Parameters
Every load balancer exposes at least these four knobs, and tuning them correctly is the difference between a system that recovers in 5 seconds and one that takes 3 minutes:
- Interval — How often the probe fires. Typical: 5–15 seconds. Shorter = faster detection, more check traffic.
- Timeout — How long to wait for a response before counting it as a failure. Typical: 2–5 seconds. Must be less than interval.
- Unhealthy threshold — How many consecutive failures before the instance is removed from rotation. Typical: 2–3. A threshold of 1 causes flapping on transient network blips.
- Healthy threshold — How many consecutive successes before a recovered instance is re-added to rotation. Typical: 2–3. This prevents re-adding an instance that happens to pass one check during a degraded state.
Passive vs. Active Health Checks
There are two complementary strategies for detecting failures, and production systems use both:
Active checks (proactive): the load balancer initiates a dedicated probe request to each backend on a schedule. The advantage is that you detect failures even when there is no real traffic. The disadvantage is that the health endpoint must be kept lightweight — a heavy dependency check run every 5 seconds on 50 servers generates significant overhead.
Passive checks (reactive): the load balancer monitors the real request responses as they flow through. If a server returns 5xx errors or times out on N consecutive requests, it is marked unhealthy. The advantage is zero extra traffic. The disadvantage is that passive checks only trigger when real users are being served errors — you cannot detect a failure before users hit it.
Graceful Shutdown & Draining
Removing a server from the pool during a rolling deploy or a scale-in event is a failure scenario in disguise. If the load balancer stops sending new connections the instant it receives the shutdown signal, in-flight requests — which might take 0.5 to 30 seconds — will be abruptly cut off. Users see errors that have nothing to do with bugs in your code.
The solution is connection draining (AWS calls it "deregistration delay"; NGINX calls it proxy_next_upstream). The sequence is:
- The instance signals it is shutting down (SIGTERM, or the health endpoint returns 503).
- The load balancer stops routing new requests to that instance.
- The load balancer waits up to a configurable drain timeout (typically 30–90 seconds) for in-flight requests to complete.
- After the timeout, any remaining connections are forcibly closed and the instance is deregistered.
Failover: Automatic vs. Manual
When a server fails, traffic must be redistributed. This redistribution is failover, and it happens at two levels:
Instance-level failover is handled automatically by the load balancer. When Server C is removed from the pool, its share of the connection weight is redistributed to the remaining healthy servers. This happens within seconds and requires no human action. The risk is cascading failure: if Servers A and B were already running at 70% capacity, absorbing C's traffic pushes them to 105% — and now they start failing too, triggering their own removal, which pushes all traffic to... nothing. This is a cascading failure cascade.
Zone/region-level failover handles the case where an entire availability zone or data centre becomes unreachable. Route 53 and other global load balancers support health-check-based DNS failover: if the primary endpoint fails its check, DNS is automatically updated to point to the secondary region. DNS TTL must be set low (60–120 seconds) for this to be effective — a 300-second TTL means 5 minutes of failures before clients see the updated record.
Circuit Breakers: Preventing Retry Storms
A load balancer removes an unhealthy backend from its own pool — but what about service-to-service calls deeper in the stack? If Service A calls Service B, and Service B is overwhelmed, every retry from A adds more load to B, making recovery impossible. This is where circuit breakers come in.
A circuit breaker is a state machine with three states:
- Closed (normal operation): requests flow through. The breaker counts failures. If the failure rate exceeds a threshold (e.g., 50% of requests in a 10-second window), the circuit opens.
- Open (failing fast): all calls to the downstream service immediately return an error without making a network call. This gives the downstream service breathing room to recover. After a configurable timeout (e.g., 30 seconds), the circuit enters the half-open state.
- Half-open (probing): a small number of trial requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.
Key Metrics to Monitor
Health checks fire before failures become visible to users, but you also need real-time signals for the operations team:
- Instance health ratio: what percentage of instances are currently passing health checks? Alert at <80%.
- Failover event rate: how many instances were removed from the pool in the last hour? A non-zero rate in steady state suggests a recurring bug, a memory leak, or an external dependency that is periodically unavailable.
- Mean time to detection (MTTD): how long between a failure occurring and the load balancer routing around it? Your target should be <30 seconds for user-facing services.
- 5xx error rate during failover: some errors during the detection window (before the unhealthy instance is removed) are unavoidable. The goal is to minimise this window, not eliminate errors entirely.
n / 0.7 > n). At N=10 servers: 3 can fail and 7 must handle full load — each at roughly 143% of their "steady state" share... unless you plan for 30% excess capacity. This headroom calculation is a core part of capacity planning.
Health checks and failover are ultimately about building a system that knows about its own failures faster than your users do, and takes corrective action automatically. Pair them with good alerting, capacity headroom, and circuit breakers, and you have the foundation of a truly self-healing distributed system.