Chaos Engineering & Resilience

Chaos Experiments: Infrastructure

18 min Lesson 4 of 27

Chaos Experiments: Infrastructure

Infrastructure chaos experiments probe the failure modes that matter most at scale: the sudden disappearance of compute, the saturation of a resource, and the severing of a network path. These are not theoretical edge cases — they are the exact scenarios that have caused major outages at AWS, Google, and every large distributed system. This lesson teaches you how to design, execute, and learn from these experiments in production without creating the incidents you are trying to prevent.

Experiment 1: Instance and Zone Kills

The most fundamental infrastructure chaos experiment is terminating a running instance or taking an entire availability zone offline. This validates a core assumption: your system can lose a compute unit without user impact. If it cannot, you have discovered a single point of failure before your cloud provider does it for you.

At big-tech companies, instance termination is run continuously. Netflix's original Chaos Monkey does exactly this: it randomly terminates EC2 instances during business hours. The rationale is that if you only run chaos at night, you get complacent — you need your on-call engineers present when the system is under fire.

Designing the experiment:

Steady state: p99 latency < 200ms, error rate < 0.1%, all health checks green.
Hypothesis: Terminating one instance in an Auto Scaling Group does not degrade user-facing latency beyond 250ms p99 for more than 60 seconds.
Blast radius: One instance (1 of N in the ASG). Never kill > 33% of a tier simultaneously on first run.
Rollback: Immediately re-enable the terminated capacity; restore from snapshot if needed.

Using AWS Fault Injection Service (FIS), you can terminate an instance via a structured experiment document rather than a raw API call. This gives you audit trails, rollback actions, and stop conditions:

# fis-instance-kill.yaml — AWS FIS experiment template
description: "Terminate one instance in the api-service ASG"
stopConditions:
  - source: "aws:cloudwatch:alarm"
    value: "arn:aws:cloudwatch:us-east-1:123456789:alarm/ApiErrorRateHigh"
targets:
  ApiInstance:
    resourceType: "aws:ec2:instance"
    resourceTags:
      Service: "api-service"
      Env: "production"
    filters:
      - path: "State.Name"
        values: ["running"]
    selectionMode: "COUNT(1)"
actions:
  TerminateInstance:
    actionId: "aws:ec2:terminate-instances"
    targets:
      Instances: "ApiInstance"
roleArn: "arn:aws:iam::123456789:role/FISRole"

The stopConditions block is critical: if your CloudWatch alarm fires (meaning the system is actually degraded), FIS halts the experiment automatically. This is your automated circuit breaker — never run infrastructure chaos without a stop condition tied to a real SLO alarm.

Zone-level experiments are a step beyond instance kills. Simulating an AZ outage means blocking all traffic between your application tier and a target AZ's subnets. On Kubernetes this is done by adding a network policy that denies egress to the CIDR ranges of the failing AZ. AWS FIS now has a native aws:network:disrupt-connectivity action that injects blackhole routes at the VPC level, making this safe to run without manually crafting iptables rules:

# kubectl — cordon all nodes in us-east-1b to simulate AZ failure
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1b \
  -o jsonpath='{.items[*].metadata.name}' \
  | tr ' ' '\n' \
  | xargs -I{} kubectl cordon {}

# Verify pods are rescheduled to healthy zones
kubectl get pods -n api-service -o wide --watch

# Rollback: uncordon the zone
kubectl get nodes -l topology.kubernetes.io/zone=us-east-1b \
  -o jsonpath='{.items[*].metadata.name}' \
  | tr ' ' '\n' \
  | xargs -I{} kubectl uncordon {}

Zone kill experiment — AZ-2 is cordoned, the load balancer routes all traffic to AZ-1 and AZ-3, and the scheduler places a new pod in AZ-3.

Topology spread constraints are your defense. Without topologySpreadConstraints in your pod spec, the Kubernetes scheduler can place all replicas in one AZ, making a zone kill catastrophic. Always spread workloads across zones before you run zone-kill experiments — the experiment will teach you whether the spread is actually happening.

Experiment 2: Resource Exhaustion

Resource exhaustion experiments answer the question: what happens when a host runs out of CPU, memory, or disk? These failures are insidious because they degrade gradually, cause cascading effects across services on the same host, and often trigger failure modes (OOM kills, thrashing, clock skew) that are impossible to reproduce any other way.

CPU exhaustion — the stress-ng tool is the production standard for generating controlled CPU load. Pair it with a time limit so it cannot run indefinitely:

# Saturate 4 CPU cores for 120 seconds, then stop
# --cpu N: spawn N workers each spinning on sqrt()
# --timeout: hard stop after this duration
stress-ng --cpu 4 --timeout 120s --metrics-brief

# In a container: run stress-ng as a sidecar or via kubectl exec
kubectl exec -it api-pod-abc123 -n api-service -- \
  stress-ng --cpu 2 --timeout 60s

# Watch what happens to CPU throttling at the cgroup level
kubectl top pod -n api-service --containers
cat /sys/fs/cgroup/cpu/cpu.stat   # throttled_time field

Memory pressure — forcing a process toward the OOM boundary reveals how your application behaves under memory contention. On Linux, the kernel OOM killer will terminate the process with the highest OOM score. Understanding which process gets killed first is critical for systems with multiple containers on a node:

# Allocate and hold 2 GB of memory for 90 seconds
stress-ng --vm 1 --vm-bytes 2G --timeout 90s

# Monitor OOM events in real time
dmesg -w | grep -i "oom\|killed"

# Check OOM score for your process (lower = less likely killed)
cat /proc/$(pgrep -n python)/oom_score
cat /proc/$(pgrep -n python)/oom_score_adj

# In Kubernetes: set oom_score_adj via quality-of-service class
# Guaranteed QoS (requests == limits) gets score -997
# BestEffort QoS gets score 1000 — killed first

Disk exhaustion — filling a filesystem is the simplest way to cause silent data corruption, failed writes, and cascading log pipeline failures. Always run disk exhaustion experiments in a tmpfs mount, not on the root or data volume:

# Create a 5 GB sparse file to fill a test filesystem
# NEVER run this on / or a data volume — use a dedicated test mount
fallocate -l 5G /mnt/chaos-test/fill.dat

# Monitor disk usage during the experiment
watch -n 2 df -h /mnt/chaos-test

# Verify application behavior: does it log the error and degrade gracefully,
# or does it crash silently?
tail -f /var/log/app/app.log | grep -i "disk\|space\|write\|enospc"

# Cleanup
rm /mnt/chaos-test/fill.dat

Disk exhaustion in containers is deceptive. Docker and containerd use overlayfs. A container writing to its own filesystem fills the host's root partition, not an isolated volume. Always mount a dedicated tmpfs or PVC for disk chaos experiments. Filling a production node's root filesystem will crash the kubelet and cause cascading evictions across the node.

Experiment 3: Network Partitions

Network partitions are the most dangerous and most revealing infrastructure experiments. They expose split-brain conditions, inconsistent state, retry storms, and timeout configuration errors that no unit or integration test can find. The CAP theorem becomes viscerally real when you partition a database cluster in production and observe whether your application chooses consistency or availability.

The standard tool for network fault injection on Linux is tc (traffic control) with the netem discipline. tc netem can inject latency, packet loss, duplication, corruption, and complete blackholes with nanosecond precision:

# Add 200ms latency + 50ms jitter to all outbound traffic on eth0
tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal

# Simulate 15% packet loss (models lossy link behavior)
tc qdisc add dev eth0 root netem loss 15%

# Simulate a complete network blackhole to a specific service IP
# (simulates a database becoming unreachable)
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 parent 1:0 protocol ip u32 \
  match ip dst 10.0.1.45/32 \
  action drop

# Show current qdisc rules
tc qdisc show dev eth0

# ALWAYS clean up when done
tc qdisc del dev eth0 root

At the Kubernetes level, use a NetworkPolicy to partition an entire pod or namespace from its upstream dependency. This is safer than tc because it is declarative and fully reversible by deleting the policy object:

# Partition: deny all ingress to the cache layer from api-service
# Simulates "api-service cannot reach Redis"
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-partition-cache
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: redis-cache
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              chaos-exempt: "true"
      # Only chaos-exempt pods (e.g., monitoring) can reach cache
      # api-service pods are now partitioned from Redis

Observe retry storms during partition recovery. When a network partition heals, every client that was timing out simultaneously retries. Without exponential backoff and jitter, this creates a thundering herd that overwhelms the recovering service. Chaos experiments are the only reliable way to validate that your retry logic actually has proper jitter. Use kubectl logs -f and watch your metrics for the "recovery spike" — if p99 latency spikes above steady state during recovery, your retry logic needs fixing.

Measuring Experiment Outcomes

Every infrastructure chaos experiment must be tied to an observable steady-state signal. Without measurement, you are not running an experiment — you are just breaking things. The minimum set of metrics to instrument for any infrastructure experiment:

Error rate: percentage of requests returning 5xx. Baseline must be established before the experiment.
Latency percentiles: p50, p95, p99. Instance kills often show latency spikes during failover that are acceptable; zone kills must show full recovery within your SLO window.
Saturation: CPU throttled time, memory page fault rate, disk iowait. These reveal whether resource exhaustion translated into application-layer impact.
Retry and timeout rates: rising retry counts without rising error rates means your retry policy is working. Rising retries AND rising errors means cascading failure.

Document every experiment as an RFC before running it. A one-page document stating the hypothesis, blast radius, stop conditions, rollback procedure, and expected duration is not bureaucracy — it is the minimum standard for responsible production chaos. Undocumented experiments that cause outages are incidents. Documented experiments that cause outages are learnings.

Summary

Infrastructure chaos experiments — instance kills, zone failures, resource exhaustion, and network partitions — each expose a distinct class of production failure mode. Instance kills validate your redundancy and auto-recovery configuration. Zone kills verify that your traffic distribution is truly multi-AZ. Resource exhaustion tests reveal whether your applications degrade gracefully or fail silently. Network partitions expose timeout, retry, and circuit-breaker behavior that no synthetic test can replicate. Run them in escalating order of blast radius, always with a stop condition, always with a documented hypothesis, and always with metrics open on your second screen.