Cloud Architecture & Landing Zones

Resilient Architecture Patterns

18 min Lesson 7 of 28

Resilient Architecture Patterns

Resilience is the ability of a system to absorb failures and continue operating at an acceptable level. In the cloud, failures are not exceptions — they are scheduled events. Hardware dies, availability zones flood, BGP routes flap, and software bugs corrupt state. The difference between a system that survives these events and one that pages your on-call at 3 AM is whether resilience was designed in from day one or bolted on after the first major incident.

This lesson covers the three foundational patterns used at scale: multi-AZ deployment (the baseline), multi-region deployment (the premium tier), and cell-based architecture (the approach Netflix, Amazon, and Slack use to cap blast radius at global scale).

Multi-AZ: The Baseline Standard

An Availability Zone (AZ) is a physically separate data center within a region — distinct power, cooling, and networking, connected to peer AZs via low-latency private links. AWS regions typically have three or more AZs. A multi-AZ architecture distributes your compute, database, and storage across at least two (ideally three) AZs so that the failure of one AZ does not bring down your service.

The mechanics are simple: your load balancer health-checks targets in every AZ and removes unhealthy ones. When AZ-b loses power, the load balancer stops routing to it within seconds, and traffic flows entirely through AZ-a and AZ-c. Your application continues serving requests — possibly at reduced throughput, but without an outage. The key requirement is that no component is a single AZ singleton: not your EC2 instances, not your database, not your message queue, not your NAT Gateway.

Multi-AZ RDS is not a read replica — it is synchronous standby. AWS RDS Multi-AZ maintains a hot standby in a second AZ with synchronous block-level replication. Failover is automatic (typically 60–120 seconds) and DNS-based — your application reconnects to the same endpoint. You do not need to change connection strings. Multi-AZ is table stakes for any production database.

A minimal multi-AZ deployment on AWS with Terraform illustrates the pattern:

# Three private subnets, one per AZ — the foundation of multi-AZ
resource "aws_subnet" "private" {
  for_each          = { a = "10.0.1.0/24", b = "10.0.2.0/24", c = "10.0.3.0/24" }
  vpc_id            = aws_vpc.main.id
  cidr_block        = each.value
  availability_zone = "${var.region}${each.key}"
  tags              = { Name = "private-${each.key}" }
}

# ALB spanning all three AZs
resource "aws_lb" "app" {
  name               = "app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = [for s in aws_subnet.public : s.id]   # public subnets
}

# Auto Scaling Group distributed across all three private subnets
resource "aws_autoscaling_group" "app" {
  min_size            = 3
  max_size            = 12
  desired_capacity    = 3
  vpc_zone_identifier = [for s in aws_subnet.private : s.id]

  # Distribute evenly; rebalance if AZ comes back
  availability_zone_distribution {
    capacity_distribution_strategy = "balanced-only"
  }
}

# RDS with Multi-AZ enabled — synchronous standby in a second AZ
resource "aws_db_instance" "postgres" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  instance_class    = "db.r6g.xlarge"
  multi_az          = true   # this one flag buys you automatic failover
  db_subnet_group_name = aws_db_subnet_group.main.name
}

Multi-Region: When AZ Failure Is Not Enough

Multi-AZ protects you from a single data center going offline. Multi-region protects you from an entire AWS region becoming unavailable — a rarer but real event (AWS us-east-1 has had multiple widespread outages). More practically, multi-region also delivers latency improvements for globally distributed users and satisfies data residency requirements in regulated industries.

The cost and complexity of multi-region is significant. You are now running two or more full stacks with cross-region data replication. Every architectural decision forces you to confront the CAP theorem: when the network link between regions is severed, do you want the system to stay consistent (both regions agree on the same data) or available (both regions continue accepting writes, resolving conflicts later)?

Active-passive (failover): One region is primary and handles all writes. The standby region replicates asynchronously. On regional failure, Route 53 health checks detect the outage and shift DNS to the standby. Recovery point objective (RPO) is seconds to minutes (async lag). Recovery time objective (RTO) is 1–5 minutes for DNS propagation plus warm-up. This is the simpler, cheaper option, suitable for most enterprise workloads.
Active-active: Both regions accept writes simultaneously. You need conflict resolution logic — either CRDTs, last-write-wins semantics, or an application-level merge strategy. DynamoDB Global Tables and CockroachDB handle this natively. Custom relational databases require careful application design. RTO is near-zero because there is no failover — the failed region just stops receiving traffic. Amazon uses this for services like S3 and DynamoDB.

Route 53 health checks alone are not enough for active-passive failover. DNS TTLs must be set to 60 seconds or less on your failover records, and client-side DNS resolvers often violate TTL minimums. Design your failover to tolerate 2–3 minutes of stale DNS. More importantly, validate your failover path quarterly with actual drills — not just monitoring checks. The worst time to discover your standby database has 12 hours of replication lag is during an incident.

Multi-AZ topology (left) handles a single data center failure; multi-region active-passive (right) handles a full regional outage with Route 53 DNS failover.

Static Stability: Designing Against Control Plane Failures

Static stability is a principle coined by AWS Senior Principal Engineer Colm MacCárthaigh. It answers a subtle but critical question: what happens to your running service when the AWS control plane itself — EC2 APIs, IAM, Auto Scaling, Route 53 — is degraded or unavailable?

The answer for a statically stable system is: nothing. Your service keeps running because it was designed to function without any control plane calls at runtime. The control plane is only involved during changes (deployments, scaling events, failovers). Once the desired state is established, the data plane operates independently.

Practical implications of static stability:

Pre-warm capacity in all AZs before you need it. Do not rely on Auto Scaling to spin up new instances during an incident — EC2 capacity might be constrained in the AZ that just lost power, precisely when demand spikes. Maintain enough steady-state capacity to handle your peak load with one AZ down.
Avoid runtime IAM credential fetching in hot paths. Instance Profiles and EKS IRSA rotate credentials automatically; your SDK caches them. But if your code refreshes credentials on every request by calling the metadata service, an IMDSv2 hiccup breaks you. Use the SDK defaults and let caching absorb transient metadata service failures.
Pre-resolve DNS; do not resolve on every request. Cache DNS results at the application layer with a reasonable TTL. If Route 53 is degraded during a large event, services that re-resolve on every connection attempt will fail while those with cached resolution continue.
Zonal independence for data. When an AZ is impaired, you want to stop sending traffic there — not try to migrate its state. Data that lives only in the impaired AZ (in-memory caches, local session state) should be considered lost. Design your application to tolerate this: use distributed caches (ElastiCache with cluster mode), sticky sessions backed by a database, or stateless services.

The AWS Well-Architected REL 6 principle directly encodes static stability: "Deploy the system to multiple locations" and "design your workload to withstand component failures." In practice, this means every critical service should have enough pre-provisioned capacity in its remaining healthy AZs to absorb 100% of traffic without triggering a scaling event. Scale out proactively before your high-traffic window, not reactively during it.

Cell-Based Architecture: Limiting Blast Radius at Scale

Multi-AZ and multi-region protect against infrastructure failures. Cell-based architecture addresses a different class of problem: software failures that are correlated across your entire fleet. A bad deployment, a cascading retry storm, a poisoned cache, a database query that brings down your primary — these failures do not respect AZ boundaries. If you deploy the bad code to all your servers simultaneously, the whole service goes down, everywhere, even though every AZ is perfectly healthy.

The cell model partitions your service into independent, fault-isolated units called cells. Each cell:

Is a complete, self-contained deployment of your service stack (compute, cache, database)
Serves a fixed, non-overlapping subset of your user population (typically 1–5% each)
Shares nothing with other cells at runtime — no shared database, no shared cache, no shared load balancer
Fails independently — a catastrophic failure in one cell affects at most 1–5% of users, not 100%

A cell router (sometimes called a shuffle sharding router) sits at the edge and maps incoming requests to the correct cell based on a partition key — typically a hash of the customer ID or tenant ID. This mapping is static and stored in a simple lookup table. The router itself is extremely simple: it does nothing but read the table and proxy the request. It has no business logic and no dependencies that could fail.

Cell-based architecture: a bad deployment in Cell 2 is contained — the cell router stops routing new requests there while 9 other cells continue serving traffic normally.

Deploying in Cell-Based Systems

Cells are the primitive of safe, incremental deployment at scale. You do not deploy to all cells at once. You deploy to one cell, validate metrics, then progressively expand:

# Canary deployment across cells using AWS CodeDeploy or a custom script
# Assumes cells are tagged and traffic weights are managed via Route 53 weighted records

# Step 1: Deploy to cell-01 only (1% of traffic)
aws deploy create-deployment \
  --application-name my-service \
  --deployment-group-name cell-01 \
  --revision "revisionType=GitHub,gitHubLocation={repository=org/repo,commitId=$COMMIT_SHA}"

# Wait for deployment health (check P99 latency + error rate for 10 min)
./wait-for-cell-health.sh --cell cell-01 --timeout 600 --error-threshold 0.1

# Step 2: Expand to cells 02 and 03 (10% total)
for cell in cell-02 cell-03; do
  aws deploy create-deployment \
    --application-name my-service \
    --deployment-group-name $cell \
    --revision "revisionType=GitHub,gitHubLocation={repository=org/repo,commitId=$COMMIT_SHA}"
done

./wait-for-cell-health.sh --cells cell-02,cell-03 --timeout 900 --error-threshold 0.1

# Step 3: Expand to remaining cells in batches
# If any cell fails health check, halt and rollback that cell only
./deploy-remaining-cells.sh --commit $COMMIT_SHA --batch-size 3

This is how Amazon deploys changes to services like S3 and DynamoDB — not as a single fleet-wide rollout, but as a careful progression across cells with automated canary metrics at each gate. A bug that slips past code review gets caught at 1% traffic, not 100%.

Cell-based architecture is not only for hyperscalers. Slack partitions their message routing by workspace. Stripe partitions by merchant. Any multi-tenant SaaS where one tenant can cause a noisy-neighbor failure for others benefits from cell isolation. The entry cost is designing a good partition key (customer ID almost always works) and building or adopting a simple cell router. The savings in incident scope and deployment confidence are immediate.

Choosing the Right Tier of Resilience

These patterns are not mutually exclusive — they are complementary layers. A mature production architecture at big-tech scale combines all three: multi-AZ for infrastructure resilience, multi-region for geographic and regional resilience, and cells for software and deployment resilience. The decision of which layers to adopt is driven by your reliability requirements (SLA/SLO), your blast radius tolerance, and your operational budget for complexity.

Start with multi-AZ — it is the minimum viable production architecture. Add multi-region when your RTO/RPO requirements demand it or when global user latency is a product requirement. Adopt cells when your deployment confidence is limited by fear of correlated failures, or when a single customer's traffic spike regularly affects other customers.