Reliability, Availability & Resilience

Health Checks & Heartbeats

18 min Lesson 7 of 10

Health Checks & Heartbeats

A system that cannot detect its own failures cannot recover from them. Health checks and heartbeats are the mechanisms that give orchestrators, load balancers, and on-call engineers a continuous, real-time picture of which instances are healthy and which have quietly stopped working. Without them, a crashed pod keeps receiving traffic, a degraded database keeps accepting writes it will never commit, and a runaway process consumes all available memory while serving garbage — all invisibly.

This lesson covers the two complementary detection models (push and pull), the difference between liveness and readiness, the failure modes that each approach misses, and the numbers you need to make these decisions concretely.

Two Detection Models: Pull vs Push

Pull (active probing) — a controller (load balancer, Kubernetes kubelet, Consul agent) sends a request to the instance on a regular interval. The instance responds; the controller interprets the response. This is the most common model and the default in AWS Target Groups, GCP HTTP load balancers, NGINX upstream checks, and Kubernetes.

Push (heartbeat) — the instance sends periodic signals to a central aggregator or monitoring service. Silence means failure. This model is used by Zookeeper session timeouts, Consul agent gossip, Kafka broker liveness, and cluster membership protocols like SWIM.

Key insight: Pull health checks detect failures from the outside; heartbeats detect them from the inside. A process that is alive but deadlocked will pass a heartbeat but fail a pull check. A network partition will cause a pull check to time out even if the instance is perfectly healthy. Use both for defence in depth.

Pull model: the controller probes each instance. Push model: instances send heartbeats; silence triggers an alert.

Liveness vs Readiness vs Startup Checks

Kubernetes popularised a clean taxonomy that most modern systems have now adopted:

Startup probe — is the process still initialising? Prevents the liveness probe from killing a slow-starting container (e.g., JVM warm-up). Only runs until it first succeeds.
Liveness probe — is the process fundamentally alive? A failure triggers a restart. Use it to detect deadlocks, out-of-memory loops, or processes that have entered an unrecoverable state. Keep it cheap and fast — it must not depend on downstream services.
Readiness probe — is the instance ready to receive traffic? A failure removes the instance from the load balancer pool without restarting it. Use it to signal warmup, dependency unavailability, or intentional quiescing before a rolling deploy.

Common mistake: making the liveness probe call a database or external API. If that dependency goes down, all your instances restart simultaneously — a thundering-herd self-inflicted outage. Liveness should check only the process itself (e.g., a simple in-memory counter increment). Reserve dependency checks for readiness.

What a Health Endpoint Should Check

A well-designed GET /health endpoint is not a boolean. It returns a structured JSON body with sub-component statuses so that operators can diagnose exactly what is wrong:

HTTP 200 OK
{
  "status": "degraded",
  "checks": {
    "db_primary":   { "status": "ok",      "latency_ms": 4   },
    "db_replica":   { "status": "ok",      "latency_ms": 6   },
    "cache":        { "status": "ok",      "latency_ms": 1   },
    "message_queue":{ "status": "timeout", "latency_ms": 5000 },
    "disk_space":   { "status": "ok",      "free_gb": 48     }
  },
  "version": "v3.17.2",
  "uptime_seconds": 86402
}

Return HTTP 200 for healthy, HTTP 503 for unhealthy (not 500 — that can mislead load balancers). A degraded status can still return 200 if the instance can serve requests in a reduced-capacity mode; return 503 only when the instance genuinely cannot serve traffic.

Best practice: set a tight internal timeout on every sub-check (e.g., 200 ms). A health check that takes 5 s to respond defeats the purpose — the prober times out and marks the instance unhealthy, or worse, blocks the probe thread pool. The whole endpoint should respond in under 500 ms.

Probe Timing: The Numbers That Matter

Three parameters control how quickly a failure is detected vs how many false positives you trigger:

Interval — how often the controller probes. Typical values: 10–30 s. Shorter intervals detect failures faster but add CPU and network overhead on large fleets.
Timeout — how long the controller waits for a response before counting it as a failure. Must be less than the interval. Typical: 3–10 s.
Threshold (failure / success count) — how many consecutive failures before marking unhealthy, and how many consecutive successes before marking healthy again. Typical: 2–3 failures, 2 successes. This dampens transient blips from causing unnecessary restarts.

Time-to-detect = interval × failure_threshold + timeout. With a 10 s interval, 5 s timeout, and 3 failure threshold, the worst-case detection latency is 35 seconds. During that window, traffic still flows to the dead instance. Size your retry budgets (covered in Lesson 6) accordingly.

Gossip and SWIM: Heartbeats at Cluster Scale

When you have hundreds or thousands of nodes, polling every node from a central controller creates a single point of failure and a fan-out bottleneck. Distributed systems like Consul, Cassandra, and Akka use the SWIM protocol (Scalable Weakly-consistent Infection-style Membership) instead:

Each node periodically picks a random peer and sends it a ping.
If the peer does not respond within a deadline, the node picks k other random nodes and asks them to ping-req the suspect on its behalf (indirect probe).
Only if all indirect probes also fail is the suspect marked as suspect, and after a further grace period, as dead.
Membership updates propagate via gossip — piggybacked on regular messages — so the whole cluster converges in O(log N) rounds.

This makes failure detection bandwidth constant per node regardless of cluster size — a critical property for systems with thousands of nodes.

SWIM indirect probe: Node A asks helpers C and D to probe the suspect Node B. Only when all paths fail is B declared dead.

Operational Considerations

Health check endpoints need auth protection — they expose internal topology and dependency status. Use network-layer restrictions (VPC only, private subnet) or a shared secret header. Never expose /health/detail to the public internet.

Distinguish between deep and shallow checks. A shallow check verifies the process is running and can accept connections. A deep check verifies it can actually do useful work (query the DB, read from the queue). Use shallow for liveness, deep for readiness. Deep checks on every probe interval can overload your database with synthetic traffic.

Log health check failures at a separate severity level. A single transient failure is noise; two consecutive failures of the same component should page an engineer. Aggregate failures across the fleet — if 40% of instances fail their DB readiness check simultaneously, that is a database problem, not 40 separate instance problems.

Synthetic canaries: rather than relying solely on passive health checks, schedule synthetic transactions that exercise full user flows (login, fetch data, write a record) against production or a shadow environment every minute. These catch logical failures — returning wrong data — that a simple HTTP 200 response will never reveal.

Health checks and heartbeats are the nervous system of a resilient distributed architecture. Without them, you are flying blind; with well-tuned probes, your system can detect and route around failures in seconds rather than minutes.