Chaos Engineering & Resilience

Failure Modes of Distributed Systems

18 min Lesson 3 of 27

Failure Modes of Distributed Systems

Before you can break things on purpose, you must know how things actually break. Distributed systems fail in ways that are qualitatively different from single-process programs. A thread crash in a monolith is obvious — the process dies, an alert fires, the root cause is in one stack trace. In a distributed system, nodes fail independently, network links partition silently, clocks diverge, and the resulting failure surfaces only as mysteriously elevated latency or subtly wrong data — often hours after the root cause occurred. Understanding failure modes is the prerequisite to designing chaos experiments that expose real risk instead of theater.

The canonical mental model: the Fallacies of Distributed Computing. In 1994, Peter Deutsch codified eight assumptions that engineers routinely make about networks — assumptions that are always false in production. Every major distributed-system failure class traces back to one or more of these fallacies. Internalizing them is the first intellectual step of chaos engineering.

The Eight Fallacies and What They Cost You

Each fallacy is an assumption that feels reasonable during local development and causes catastrophic surprises in production.

The network is reliable. Packets are dropped, NICs fail, switches reboot, and cloud hypervisors migrate VMs mid-request. TCP retransmits hide many of these events, but retransmits add latency — a 5 ms call becomes a 2 s call under congestion. Every RPC call must tolerate failure and retry with exponential backoff plus jitter.
Latency is zero. A service that calls a database, a cache, and two downstream APIs on every request accumulates latency: 20 ms + 35 ms + 15 ms + 40 ms = 110 ms minimum, before GC pauses and tail latency. Measure and budget every cross-process call; never assume it is negligible.
Bandwidth is infinite. Fetching 50 MB of configuration on every startup, or fanning out a single request to 1,000 shards, is fine in a benchmark. In production during a rolling restart, it saturates uplinks, triggers congestion control, and causes cascading timeouts. Back-pressure, pagination, and streaming are production requirements, not optimizations.
The network is secure. In a microservices mesh, services communicate over the same network segment as external traffic unless mTLS is enforced. An attacker who compromises one pod can move laterally without network policy. Security must be designed in — it cannot be assumed from the underlying substrate.
Topology does not change. IP addresses change. Pods are rescheduled. Instances are replaced. A service that hard-codes a peer IP breaks silently the moment Kubernetes reschedules that peer. Use DNS or a service mesh — never embed addresses in config files.
There is one administrator. In a microservices organization, 30 teams own 30 services with different deployment cadences and incident procedures. Assuming a single authority can coordinate a multi-team incident in real time is a fantasy. Design for autonomous runbooks and automated circuit breakers that do not require human coordination.
Transport cost is zero. JSON serialization, TLS handshakes, and connection setup consume CPU and wall-clock time. A service making 10,000 small RPC calls per second can spend a significant fraction of its CPU on serialization alone. Batching, connection pooling, and binary protocols (gRPC/Protobuf) directly address this.
The network is homogeneous. A request path typically crosses the Kubernetes overlay, the cloud VPC, possibly a VPN to another region, and the client's ISP. MTU differences cause fragmentation; each segment has a different loss profile. A single average latency figure masks this heterogeneity completely.

The Major Failure Classes in Production

Operational experience at companies running thousands of services has crystallized the failure landscape into a handful of recurring classes. Each class has a canonical signature in your observability signals and a specific blast radius profile that chaos experiments must be designed to surface.

1. Crash Failures

A process or node stops responding entirely. This is the most benign failure class — it is detectable (health checks fail, connections are refused) and consistent (the crashed component does nothing). The danger is not the crash itself but callers that hang indefinitely waiting for a response that will never arrive. Aggressive connection timeouts on every outbound call are the only reliable defense. In Kubernetes, a pod OOM-kill is a crash failure — the container is replaced, but in-flight requests are dropped.

2. Omission Failures

A node receives a request but produces no response — it silently drops messages. This is indistinguishable from a crash from the caller's perspective, except the server is still alive and consuming resources. Causes include overwhelmed receive buffers, kernel TCP backlog saturation, and buggy async code that loses tasks when a channel is full. Omission failures are dangerous because the server may be partially processing requests, producing inconsistent state that no caller detects until business logic fails.

3. Byzantine Failures

A node responds with incorrect data — wrong results, corrupted payloads, or responses that appear syntactically valid but are semantically wrong. A Byzantine node might return stale data from a failed cache flush, silently truncate a write, or return HTTP 200 to an operation that never persisted. This is the hardest failure class to detect: health checks pass, error rates are nominal, p99 latency is normal. The signature appears only in business metrics — wrong account balances, missing order line items, duplicate notifications. End-to-end data consistency probes are the chaos experiment category that targets this class specifically.

4. Timing Failures

A node responds correctly but outside the expected time window — too slowly. Slow responses cause upstream timeouts that the caller treats as failures, triggering retries that amplify load on an already-struggling downstream — the classic death spiral. Timing failures are the most common class at scale and are the primary target of latency-injection chaos experiments. The failure is not that the node is wrong; it is that its response arrives after the caller has already given up and retried, now executing the operation twice.

5. Network Partition

The network splits: some nodes can reach each other but cannot reach others. The CAP theorem guarantees that under partition you must choose between consistency (refuse writes that cannot be fully replicated) and availability (accept writes that may diverge). Most production systems — Kafka in async mode, DynamoDB with eventual consistency, most databases with async replicas — choose availability and tolerate temporary divergence. Chaos experiments that simulate partitions reveal whether divergence is handled gracefully or produces split-brain data corruption.

Five failure classes at different nodes: A is healthy; B crashes (no reply); C omits (alive but drops messages); D times out (correct data, too slow); E is Byzantine (responds with wrong data). A partition line can cut access to any subset.

The Cascading Failure Pattern

In production, single-node failures rarely stay contained. The pattern that takes down services at scale is the cascading failure: a slow downstream causes upstream threads to block, exhausting the thread pool, causing a queue backup, triggering memory pressure, triggering GC pauses, making the upstream node itself slow — which causes its upstream to block. The failure propagates up the entire call stack. Netflix analyzed this pattern after their 2011 Christmas Eve outage and built Hystrix to address it; the Resilience4j patterns are the current-generation solution.

# Detect a cascade in progress with correlated Prometheus queries.
# All three should fire simultaneously — db root, then payment, then gateway.

# Step 1: slow DB (the root cause)
histogram_quantile(0.99,
  rate(db_query_duration_seconds_bucket{
    service="payment-db"
  }[5m])
) > 0.5

# Step 2: payment-service p99 rising because of the slow db
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{
    service="payment-service"
  }[5m])
) > 2.0

# Step 3: api-gateway degrading because of the slow payment-service
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{
    service="api-gateway",
    upstream="payment-service"
  }[5m])
) > 3.0

# When all three fire simultaneously and in ascending-latency order,
# you have a cascade: fix the db — the rest recovers automatically
# once thread pools drain.

Resource Exhaustion and the Thundering Herd

Two additional failure modes deserve attention because they are reliably triggered by chaos experiments and by real incidents alike.

Resource exhaustion occurs when a shared resource — thread pool, connection pool, file descriptor limit, heap — is consumed entirely. The first sign is increased latency (queuing for the resource), not errors. By the time errors appear, the exhaustion has been building for minutes. ulimit -n, JVM heap, and database connection pool sizes are the most common exhaustion points in production microservices.

The thundering herd occurs when a large number of clients all retry at the same instant — typically after a cache miss, a deployment restart, or a brief network hiccup that caused synchronized timeouts. Without jitter in retry logic, thousands of clients hammer the recovering service simultaneously, pushing it back into failure. This is why every retry implementation must include randomized jitter, not just exponential backoff alone.

# Simulate thundering herd detection: high concurrency spike after a restart.
# This PromQL alert fires when request rate spikes more than 3x the baseline
# in a 1-minute window — the signature of synchronized retries.

(
  rate(http_requests_total{service="checkout-service"}[1m])
  /
  rate(http_requests_total{service="checkout-service"}[10m] offset 1m)
) > 3.0

Why Failure Mode Taxonomy Matters for Chaos Design

Matching your chaos experiment to the right failure class is what separates a signal-generating experiment from noise. Killing a pod tests crash-failure resilience. Adding 200 ms of latency to a downstream call tests timing-failure propagation. Corrupting a subset of API responses tests Byzantine detection. Dropping packets between two availability zones tests partition handling.

Without this taxonomy, teams default to the easiest experiment — pod kills — and spend months building confidence in crash resilience while their system has undetected Byzantine bugs and no jitter in retries. The fallacies are the theoretical foundation; the failure classes are the targeting system for where your experiments should point.

Pro practice — map failure classes to your service inventory before planning experiments. For each critical service, identify which failure class it is most vulnerable to, not just the easiest to inject. A stateful service with replicated data (database, Kafka, etcd) is most at risk from Byzantine failures and partition split-brain. A high-fan-out service (API gateway calling 20 microservices) is most at risk from cascading timing failures. A session-heavy service is most at risk from thundering herds on restart. Target your highest-risk class first.

Production pitfall — confusing availability with correctness. SLOs track request success rate and latency, but Byzantine failures leave both metrics untouched while silently corrupting data. A system can be fully green in Prometheus and still be serving wrong account balances. This is why production-grade chaos programs include correctness probes — synthetic transactions that write a known value and read it back, asserting the returned value is exactly what was written. If the probe fails while all SLIs look healthy, you have a Byzantine failure in progress.