Chaos Engineering & Resilience

Why Break Things on Purpose?

18 min Lesson 1 of 27

Why Break Things on Purpose?

Every distributed system fails. Networks partition, disks fill, dependencies time out, memory leaks accumulate, and processes crash under load they were never tested against. The question is not whether your system will encounter these conditions — it is whether you will discover the failure mode in a controlled experiment at 2 pm on a Tuesday, or in a blind panic at 3 am when an on-call engineer is paged and real users are affected.

Chaos Engineering is the discipline of deliberately injecting failure into a system in a controlled, scientific way to expose weaknesses before they surface in production on their own. It answers one question: does this system actually behave the way we believe it does under adverse conditions?

The Netflix Origin Story

In 2008, Netflix experienced a major database corruption event that took the service offline for three days. The incident was a turning point. The engineering team recognized that as they migrated from a monolithic data center to Amazon Web Services, the topology of their failure space changed completely. In a data center you had a handful of large, expensive servers with known failure characteristics. On AWS you had thousands of small, disposable instances — and the premise of cloud infrastructure is that individual components will fail, and your system must tolerate that.

In 2010, a Netflix engineer named Greg Orzell and his team built a small internal tool called Chaos Monkey. Its only job was to randomly terminate EC2 instances during business hours. The reasoning was direct: if instances could die at any time in production, engineers had to build services that survived instance death — and the only way to verify that survival was to practice it constantly, not theoretically design for it and hope.

Chaos Monkey was the seed of what Netflix eventually published in 2012 as the Simian Army — a collection of chaos tools each targeting a different failure class: Chaos Gorilla (availability zone failure), Chaos Kong (regional failure), Latency Monkey (network delay injection), Conformity Monkey (configuration drift detection). In 2014, Netflix formalized the principles behind all of them into a discipline they called Chaos Engineering.

Key idea: Chaos Engineering did not start as an academic idea. It was invented by practitioners who were tired of being surprised by production failures they could have anticipated. The Netflix team was not trying to be clever — they were trying to stop being woken up at 3 am.

The Confidence Gap

Most engineering teams operate on a set of untested assumptions about their system's resilience. They have runbooks that say "if the primary database goes down, the replica promotes automatically." They have architecture diagrams that show retry logic and circuit breakers. They have SLO commitments to their users. But none of these assumptions have been validated under realistic conditions at production scale.

This gap between assumed resilience and actual resilience is what chaos engineering closes. Consider a typical web service that depends on three downstream APIs. Your code has a 3-second timeout and a retry with exponential backoff. You believe that if any one of those APIs degrades, your service will degrade gracefully. But have you ever verified:

Does the circuit breaker actually open after the configured failure threshold, or does the configuration have a bug that went unnoticed?
When the circuit breaker opens, does the service return a cached response, a 503, or does it deadlock waiting for a connection pool that never drains?
Does your timeout of 3 seconds account for the fact that your HTTP client library has a separate connection timeout and a read timeout, and misconfiguring either one means requests actually hang for 30 seconds?
When all three downstream APIs degrade simultaneously — a scenario that never happened in staging — does the combined thread pool exhaustion cascade into your own service becoming unavailable?

Each of these is a hypothesis. Chaos Engineering is the practice of running controlled experiments to confirm or refute hypotheses about system behavior, just as scientists run experiments to confirm or refute hypotheses about the physical world.

The Scientific Framework

Netflix published a precise definition of Chaos Engineering with five principles. These are not suggestions — they describe what separates a chaos experiment from random destructive testing:

Define a steady state — identify a measurable output that represents normal behavior (requests per second, error rate, p99 latency, SLO compliance). This is your control baseline.
Hypothesize that the steady state will continue — your hypothesis is that the system will maintain the steady state despite the fault you are about to inject. Write it down explicitly.
Introduce real-world events — inject failure that mirrors production reality: server crashes, network latency, disk full, dependency failure, CPU saturation. Not synthetic edge cases that could never happen.
Run in production — staging does not have production traffic patterns, production data distributions, or production load. Resilience mechanisms that work in staging under 10 RPS often fail in production under 50,000 RPS. The experiment must run where the real behavior lives.
Automate experiments to run continuously — a one-time chaos experiment is a snapshot. Systems change with every deployment. Continuous chaos provides continuous confidence.

Production practice: Start with a blast radius of one. Your first experiment should affect a single instance, a single canary slice of traffic, or a single non-critical dependency — not the entire fleet. Build confidence and observability before expanding scope. Netflix spent months running Chaos Monkey against a small percentage of traffic before enabling it fleet-wide.

Why Business Hours?

Chaos Monkey ran during business hours by design. This seems counterintuitive — why inject failure when users are active? The answer is that the goal of chaos engineering is to discover weaknesses, and you want senior engineers available to respond when a weakness is exposed. If a chaos experiment reveals that your service does not recover from an instance termination, you want a full engineering team at their desks to investigate and fix the root cause — not a single on-call engineer at midnight with limited context. Business-hours experiments also force teams to treat resilience as a first-class engineering concern, not a 3 am emergency.

The Chaos Engineering Experiment Loop

The chaos engineering experiment loop: from defining steady state to refuting or confirming a hypothesis — then repeating.

Chaos Engineering vs. Load Testing vs. Fault Injection

These terms are often confused. Load testing asks: how much traffic can this system handle? It explores capacity. Fault injection (as used in unit testing) asks: does this function handle a bad input correctly? It explores correctness at the component level. Chaos Engineering asks: does this distributed system maintain acceptable behavior under real-world failure conditions? It explores emergent behavior — how independent, individually correct components interact when the environment degrades. The scope is the entire sociotechnical system: software, infrastructure, and the human response processes around it.

Production pitfall: Running chaos experiments without observability in place is dangerous and useless. If you terminate an instance and you cannot measure the effect on error rate, latency, and user experience within seconds, you cannot distinguish a graceful recovery from a silent data corruption. Observability (metrics, traces, logs — covered in earlier tutorials) is a hard prerequisite for chaos engineering. Chaos without observability is just breaking things.

The Maturity Signal

Whether a team practices chaos engineering is a reliable signal of engineering maturity. Immature organizations treat production as sacred and untouchable — they are afraid of failure because they have not built the tooling, observability, or culture to contain it. Mature organizations treat production as the testing environment that matters most, and they run controlled experiments there because they have the confidence that comes from deep observability, fast rollback, and on-call processes that can handle a known, time-boxed blast radius.

The goal of this tutorial is to take you from zero to a functioning chaos engineering practice: the tools, the experiments, the game day process, and the resilience patterns that chaos validates. Lesson 1 ends here — with the why. The rest of the tutorial is the how.