Chaos Experiments: Infrastructure
Chaos Experiments: Infrastructure
Infrastructure chaos experiments probe the failure modes that matter most at scale: the sudden disappearance of compute, the saturation of a resource, and the severing of a network path. These are not theoretical edge cases — they are the exact scenarios that have caused major outages at AWS, Google, and every large distributed system. This lesson teaches you how to design, execute, and learn from these experiments in production without creating the incidents you are trying to prevent.
Experiment 1: Instance and Zone Kills
The most fundamental infrastructure chaos experiment is terminating a running instance or taking an entire availability zone offline. This validates a core assumption: your system can lose a compute unit without user impact. If it cannot, you have discovered a single point of failure before your cloud provider does it for you.
At big-tech companies, instance termination is run continuously. Netflix's original Chaos Monkey does exactly this: it randomly terminates EC2 instances during business hours. The rationale is that if you only run chaos at night, you get complacent — you need your on-call engineers present when the system is under fire.
Designing the experiment:
- Steady state: p99 latency < 200ms, error rate < 0.1%, all health checks green.
- Hypothesis: Terminating one instance in an Auto Scaling Group does not degrade user-facing latency beyond 250ms p99 for more than 60 seconds.
- Blast radius: One instance (1 of N in the ASG). Never kill > 33% of a tier simultaneously on first run.
- Rollback: Immediately re-enable the terminated capacity; restore from snapshot if needed.
Using AWS Fault Injection Service (FIS), you can terminate an instance via a structured experiment document rather than a raw API call. This gives you audit trails, rollback actions, and stop conditions:
The stopConditions block is critical: if your CloudWatch alarm fires (meaning the system is actually degraded), FIS halts the experiment automatically. This is your automated circuit breaker — never run infrastructure chaos without a stop condition tied to a real SLO alarm.
Zone-level experiments are a step beyond instance kills. Simulating an AZ outage means blocking all traffic between your application tier and a target AZ's subnets. On Kubernetes this is done by adding a network policy that denies egress to the CIDR ranges of the failing AZ. AWS FIS now has a native aws:network:disrupt-connectivity action that injects blackhole routes at the VPC level, making this safe to run without manually crafting iptables rules:
topologySpreadConstraints in your pod spec, the Kubernetes scheduler can place all replicas in one AZ, making a zone kill catastrophic. Always spread workloads across zones before you run zone-kill experiments — the experiment will teach you whether the spread is actually happening.
Experiment 2: Resource Exhaustion
Resource exhaustion experiments answer the question: what happens when a host runs out of CPU, memory, or disk? These failures are insidious because they degrade gradually, cause cascading effects across services on the same host, and often trigger failure modes (OOM kills, thrashing, clock skew) that are impossible to reproduce any other way.
CPU exhaustion — the stress-ng tool is the production standard for generating controlled CPU load. Pair it with a time limit so it cannot run indefinitely:
Memory pressure — forcing a process toward the OOM boundary reveals how your application behaves under memory contention. On Linux, the kernel OOM killer will terminate the process with the highest OOM score. Understanding which process gets killed first is critical for systems with multiple containers on a node:
Disk exhaustion — filling a filesystem is the simplest way to cause silent data corruption, failed writes, and cascading log pipeline failures. Always run disk exhaustion experiments in a tmpfs mount, not on the root or data volume:
Experiment 3: Network Partitions
Network partitions are the most dangerous and most revealing infrastructure experiments. They expose split-brain conditions, inconsistent state, retry storms, and timeout configuration errors that no unit or integration test can find. The CAP theorem becomes viscerally real when you partition a database cluster in production and observe whether your application chooses consistency or availability.
The standard tool for network fault injection on Linux is tc (traffic control) with the netem discipline. tc netem can inject latency, packet loss, duplication, corruption, and complete blackholes with nanosecond precision:
At the Kubernetes level, use a NetworkPolicy to partition an entire pod or namespace from its upstream dependency. This is safer than tc because it is declarative and fully reversible by deleting the policy object:
kubectl logs -f and watch your metrics for the "recovery spike" — if p99 latency spikes above steady state during recovery, your retry logic needs fixing.
Measuring Experiment Outcomes
Every infrastructure chaos experiment must be tied to an observable steady-state signal. Without measurement, you are not running an experiment — you are just breaking things. The minimum set of metrics to instrument for any infrastructure experiment:
- Error rate: percentage of requests returning 5xx. Baseline must be established before the experiment.
- Latency percentiles: p50, p95, p99. Instance kills often show latency spikes during failover that are acceptable; zone kills must show full recovery within your SLO window.
- Saturation: CPU throttled time, memory page fault rate, disk iowait. These reveal whether resource exhaustion translated into application-layer impact.
- Retry and timeout rates: rising retry counts without rising error rates means your retry policy is working. Rising retries AND rising errors means cascading failure.
Summary
Infrastructure chaos experiments — instance kills, zone failures, resource exhaustion, and network partitions — each expose a distinct class of production failure mode. Instance kills validate your redundancy and auto-recovery configuration. Zone kills verify that your traffic distribution is truly multi-AZ. Resource exhaustion tests reveal whether your applications degrade gracefully or fail silently. Network partitions expose timeout, retry, and circuit-breaker behavior that no synthetic test can replicate. Run them in escalating order of blast radius, always with a stop condition, always with a documented hypothesis, and always with metrics open on your second screen.