Health Checks & Heartbeats
Health Checks & Heartbeats
A system that cannot detect its own failures cannot recover from them. Health checks and heartbeats are the mechanisms that give orchestrators, load balancers, and on-call engineers a continuous, real-time picture of which instances are healthy and which have quietly stopped working. Without them, a crashed pod keeps receiving traffic, a degraded database keeps accepting writes it will never commit, and a runaway process consumes all available memory while serving garbage — all invisibly.
This lesson covers the two complementary detection models (push and pull), the difference between liveness and readiness, the failure modes that each approach misses, and the numbers you need to make these decisions concretely.
Two Detection Models: Pull vs Push
Pull (active probing) — a controller (load balancer, Kubernetes kubelet, Consul agent) sends a request to the instance on a regular interval. The instance responds; the controller interprets the response. This is the most common model and the default in AWS Target Groups, GCP HTTP load balancers, NGINX upstream checks, and Kubernetes.
Push (heartbeat) — the instance sends periodic signals to a central aggregator or monitoring service. Silence means failure. This model is used by Zookeeper session timeouts, Consul agent gossip, Kafka broker liveness, and cluster membership protocols like SWIM.
Liveness vs Readiness vs Startup Checks
Kubernetes popularised a clean taxonomy that most modern systems have now adopted:
- Startup probe — is the process still initialising? Prevents the liveness probe from killing a slow-starting container (e.g., JVM warm-up). Only runs until it first succeeds.
- Liveness probe — is the process fundamentally alive? A failure triggers a restart. Use it to detect deadlocks, out-of-memory loops, or processes that have entered an unrecoverable state. Keep it cheap and fast — it must not depend on downstream services.
- Readiness probe — is the instance ready to receive traffic? A failure removes the instance from the load balancer pool without restarting it. Use it to signal warmup, dependency unavailability, or intentional quiescing before a rolling deploy.
What a Health Endpoint Should Check
A well-designed GET /health endpoint is not a boolean. It returns a structured JSON body with sub-component statuses so that operators can diagnose exactly what is wrong:
Return HTTP 200 for healthy, HTTP 503 for unhealthy (not 500 — that can mislead load balancers). A degraded status can still return 200 if the instance can serve requests in a reduced-capacity mode; return 503 only when the instance genuinely cannot serve traffic.
Probe Timing: The Numbers That Matter
Three parameters control how quickly a failure is detected vs how many false positives you trigger:
- Interval — how often the controller probes. Typical values: 10–30 s. Shorter intervals detect failures faster but add CPU and network overhead on large fleets.
- Timeout — how long the controller waits for a response before counting it as a failure. Must be less than the interval. Typical: 3–10 s.
- Threshold (failure / success count) — how many consecutive failures before marking unhealthy, and how many consecutive successes before marking healthy again. Typical: 2–3 failures, 2 successes. This dampens transient blips from causing unnecessary restarts.
Time-to-detect = interval × failure_threshold + timeout. With a 10 s interval, 5 s timeout, and 3 failure threshold, the worst-case detection latency is 35 seconds. During that window, traffic still flows to the dead instance. Size your retry budgets (covered in Lesson 6) accordingly.
Gossip and SWIM: Heartbeats at Cluster Scale
When you have hundreds or thousands of nodes, polling every node from a central controller creates a single point of failure and a fan-out bottleneck. Distributed systems like Consul, Cassandra, and Akka use the SWIM protocol (Scalable Weakly-consistent Infection-style Membership) instead:
- Each node periodically picks a random peer and sends it a
ping. - If the peer does not respond within a deadline, the node picks k other random nodes and asks them to
ping-reqthe suspect on its behalf (indirect probe). - Only if all indirect probes also fail is the suspect marked as suspect, and after a further grace period, as dead.
- Membership updates propagate via gossip — piggybacked on regular messages — so the whole cluster converges in
O(log N)rounds.
This makes failure detection bandwidth constant per node regardless of cluster size — a critical property for systems with thousands of nodes.
Operational Considerations
Health check endpoints need auth protection — they expose internal topology and dependency status. Use network-layer restrictions (VPC only, private subnet) or a shared secret header. Never expose /health/detail to the public internet.
Distinguish between deep and shallow checks. A shallow check verifies the process is running and can accept connections. A deep check verifies it can actually do useful work (query the DB, read from the queue). Use shallow for liveness, deep for readiness. Deep checks on every probe interval can overload your database with synthetic traffic.
Log health check failures at a separate severity level. A single transient failure is noise; two consecutive failures of the same component should page an engineer. Aggregate failures across the fleet — if 40% of instances fail their DB readiness check simultaneously, that is a database problem, not 40 separate instance problems.
Health checks and heartbeats are the nervous system of a resilient distributed architecture. Without them, you are flying blind; with well-tuned probes, your system can detect and route around failures in seconds rather than minutes.