Kubernetes Workloads & Configuration

Probes: Liveness, Readiness & Startup

18 min Lesson 3 of 32

Probes: Liveness, Readiness & Startup

Kubernetes cannot read your application's mind. It knows a container is "running" the moment the process starts — but "running" and "healthy" are very different things. A Java service might take 45 seconds to warm its caches. A Go binary might deadlock without crashing. A web server might start accepting traffic before the database connection pool is ready. Probes are the mechanism Kubernetes uses to distinguish a healthy, ready pod from one that is broken, initialising, or temporarily overloaded.

Getting probes wrong is one of the most common causes of production incidents in Kubernetes. Either they are too aggressive (causing healthy pods to be killed in a traffic spike) or too lenient (routing traffic to a pod that cannot serve requests). This lesson covers the three probe types, their failure modes, and what big-tech SRE teams actually configure.

The Three Probe Types

Liveness probe — answers: "Is this container still alive?" If it fails, kubelet kills the container and restarts it according to the pod's restartPolicy.
Readiness probe — answers: "Is this container ready to receive traffic?" If it fails, the pod is removed from the endpoints of every matching Service (traffic stops flowing to it) but it is not restarted.
Startup probe — answers: "Has this container finished its slow startup?" While it is pending, liveness and readiness probes are suspended. It fires once at launch, and once it succeeds it never fires again.

Key insight: Readiness controls traffic; liveness controls restarts. They are orthogonal. A pod can be alive-but-not-ready (warming up, draining, temporarily overloaded) without being killed. Conflating the two is the single biggest probe mistake in production.

Probe Mechanisms

All three probe types support three check mechanisms:

httpGet — kubelet sends an HTTP GET. Any 2xx or 3xx is success. Use this for HTTP services; it also tests the HTTP stack itself.
tcpSocket — kubelet opens a TCP connection. Success = port open. Use for non-HTTP protocols (gRPC, Redis, Postgres).
exec — kubelet runs a command inside the container. Exit code 0 = success. Use when you need application-level checks (e.g. a Redis PING via redis-cli). Avoid for high-frequency probes: exec forks a new process every tick.
grpc (k8s ≥ 1.24) — kubelet calls the gRPC health checking protocol. Ideal for gRPC-native services.

Tuning Parameters That Matter

initialDelaySeconds — how long to wait after container start before the first check. Required if you have no startup probe.
periodSeconds — how often to check. Default 10s.
timeoutSeconds — how long kubelet waits for a response. Default 1s — dangerously short for anything hitting a database.
failureThreshold — consecutive failures before the probe is considered failed. Default 3.
successThreshold — consecutive successes to flip from failed → success. Must be 1 for liveness and startup.

While the startup probe is active, liveness and readiness probes are suspended. Once startup passes, both activate independently.

A Production-Grade Manifest

Below is a realistic Deployment manifest for a Spring Boot service with a 40-second warm-up. It uses all three probe types correctly:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orders-api
  template:
    metadata:
      labels:
        app: orders-api
    spec:
      containers:
      - name: orders-api
        image: myregistry/orders-api:v2.4.1
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          # Allow up to 60 s for startup (6 * 10s periods)
          failureThreshold: 6
          periodSeconds: 10
          timeoutSeconds: 3
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3   # 45 s of consecutive failure before restart
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 2   # 20 s before pod is pulled from Service endpoints
          successThreshold: 1

Separate liveness and readiness endpoints. Spring Boot Actuator, ASP.NET health checks, and most frameworks expose distinct paths for this reason. The liveness path should check only that the process is not deadlocked (no DB calls). The readiness path should check downstream dependencies (DB, cache, external APIs) — it is acceptable for it to fail under load so the load balancer can shed traffic to healthy peers.

The Startup Probe: Why It Exists

Before startup probes existed (k8s < 1.16), teams used a large initialDelaySeconds on the liveness probe to cover slow startup. The problem: if a pod deadlocked during startup, kubelet would not detect it until the delay expired. Startup probes solve this cleanly — they grant a generous startup window while still detecting post-startup deadlocks promptly.

The formula is: max startup time = failureThreshold × periodSeconds. Set this to 150–200% of your measured P99 cold-start time. For the manifest above: 6 × 10s = 60s.

Common Production Failure Modes

Liveness probe hitting the database — if the DB is slow, the probe times out, kubelet kills healthy pods, and you get a crash loop that amplifies the DB load. Liveness must check only the process itself.
timeoutSeconds: 1 (the default) on a readiness probe — a 200ms P99 endpoint can occasionally take 2s under GC pressure. One timeout counts as a failure. With failureThreshold: 3 that is only 3s of latency before you start shedding the pod. Set timeoutSeconds to at least your P99 × 3.
No startup probe + small initialDelaySeconds on a slow JVM — kubelet fires liveness before the JVM finishes loading, sees a failure, and kills the container. The pod enters a crash loop and never starts. Adding a startup probe with an adequate failureThreshold is the fix.
Readiness endpoint that never fails — a readiness probe that always returns 200 even under overload provides no protection; traffic continues routing to a saturated pod. Implement back-pressure logic (e.g. return 503 when the request queue depth exceeds a threshold).

Never use the same endpoint for liveness and readiness unless you truly understand the consequences. If the endpoint checks the database and the database is slow, your readiness probe (correctly) pulls the pod from traffic. But your liveness probe also fails — and after failureThreshold failures it kills a pod that is perfectly healthy, just waiting for a downstream dependency. The database outage now also causes a pod restart storm.

Inspecting Probe Status

Use kubectl describe pod <name> and look at the Events section and the Containers block. Probe failures appear as Warning events with reason Unhealthy. The kubectl get pod READY column (e.g. 0/1) reflects readiness probe state.

# Live probe event stream for a crashing pod
kubectl describe pod orders-api-7d9f8c-xqr2k | grep -A 20 Events

# Watch pod readiness transitions in real time
kubectl get pod -l app=orders-api -w

# Check probe config as seen by the API server
kubectl get pod orders-api-7d9f8c-xqr2k -o jsonpath=\
  '{.spec.containers[0].livenessProbe}'

Exec Probe for Non-HTTP Services

For a Redis sidecar or a database pod, an exec probe is idiomatic:

livenessProbe:
  exec:
    command:
    - redis-cli
    - ping
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5

At big-tech scale, probe configuration is typically encoded as Helm values and enforced via OPA/Kyverno admission policies that reject Deployments with no readiness probe, timeoutSeconds < 2, or liveness probes that query external dependencies. Probes are treated as a reliability contract, not an afterthought.