Reliability, Availability & Resilience

Redundancy for Availability

18 min Lesson 2 of 10

Redundancy for Availability

Every production system will eventually experience a failure — a server crashes, a disk fills up, a network link goes dark. The question is not whether something will fail, but whether the system continues to serve users when it does. Redundancy is the engineering discipline of eliminating single points of failure by duplicating critical components so that no single failure can take the whole system down.

What Is a Single Point of Failure?

A single point of failure (SPOF) is any component whose failure causes the entire system to become unavailable. Classic examples include:

  • One web server with no standby behind it
  • One database with no replica
  • One network switch connecting all servers in a rack
  • One region hosting all your infrastructure
  • One DNS provider with no secondary

Identifying SPOFs is the first step in reliability engineering. Draw your architecture end-to-end and ask: if this component disappears right now, what breaks? Any component where the answer is "everything" is a SPOF that must be addressed.

The SPOF audit: Walk every layer of your stack — client, DNS, CDN, load balancer, application tier, cache, database, message queue, object storage, third-party APIs. Each layer should have at least two independent instances capable of handling traffic, with no shared fate between them.

Types of Redundancy

Redundancy is not one-size-fits-all. Engineers apply different forms depending on what is being protected and how fast a failover must occur.

Active-Active Redundancy

All redundant instances handle live traffic simultaneously. A load balancer distributes requests across them. If one instance fails, the others simply absorb its share. There is no failover delay — the remaining nodes are already warm and serving. This is the preferred model for stateless application tiers because capacity is never wasted.

Example: three application servers each handling one-third of traffic. One crashes. The load balancer detects the failure within seconds via health checks and routes all traffic to the remaining two. Users experience a brief spike in latency while the two servers warm their caches, but no outage.

Active-Passive (Hot Standby) Redundancy

One instance handles all traffic (the active node) while one or more identical instances stay on standby (the passive nodes), ready to take over instantly. Standby nodes receive the same data as the active node in real time, so their state is current. When the active node fails, a failover mechanism (virtual IP, DNS update, or a cluster manager like Pacemaker) promotes a standby to active.

This model is common for databases, where having two nodes simultaneously accepting writes would introduce split-brain conflicts. PostgreSQL with Patroni, MySQL Group Replication, and Amazon RDS Multi-AZ all use active-passive at the primary level.

Trade-off: the passive node consumes resources (CPU, RAM, disk, licensing fees) but handles zero requests. You are paying for idle capacity in exchange for fast, zero-data-loss failover.

Active-Passive (Cold Standby / Warm Standby)

The standby is not running continuously. A warm standby is partially running — the OS and application are up but not accepting traffic, with data kept in sync. A cold standby is a provisioned but stopped instance that needs to be started and catch up on data before it can serve. Cold standby trades faster recovery cost for lower ongoing cost; use it for less critical services or large infrastructure (like a full DR region) where the cost of keeping hot standbys everywhere is prohibitive.

Active-Active vs Active-Passive redundancy comparison Active-Active Clients Load Balancer Server 1 ACTIVE Server 2 ACTIVE Server 3 ACTIVE Database (shared or replicated) All nodes serve traffic. Failure instantly absorbed. Zero idle waste. Active-Passive Clients Virtual IP / Cluster Mgr Primary DB ACTIVE — all writes Replica DB STANDBY — ready no traffic replication failover on failure One node active, one on standby. Fast failover. Standby capacity is idle.
Active-Active (left): all nodes serve live traffic, failure is instantly absorbed. Active-Passive (right): one node handles all traffic, a standby waits in sync and takes over on failure.

Redundancy at Every Layer

Resilient systems apply redundancy at each tier of the stack independently, because a fault in any tier can be the SPOF that brings down availability. Here is how the layers stack up in practice:

Load Balancers

Your load balancer is itself a potential SPOF. Cloud providers (AWS ALB/NLB, GCP Load Balancing, Cloudflare) run multiple physical nodes behind a single virtual IP or Anycast address. If you run your own load balancers (HAProxy, Nginx), you need at least two in active-passive with a floating virtual IP managed by keepalived (VRRP). Without this, your load balancer is the SPOF even though you have multiple app servers.

Application Servers

The easiest layer to make redundant. Because good application servers are stateless (you covered this earlier), you simply run multiple identical instances across different physical hosts — ideally in different availability zones. The load balancer health-checks each instance and stops routing to unhealthy ones within 10–30 seconds of a failure. Run a minimum of 2 instances in production; 3 is better so you can take one down for rolling deploys without dropping below N+1.

Databases

Databases are the hardest tier to make redundant because of state. The standard approach is a primary + one or more read replicas. Reads can be distributed across replicas; writes go to the primary only. If the primary fails, a replica is promoted. Services like Amazon RDS Multi-AZ automate this: a synchronous standby in a different AZ keeps a hot copy with sub-second replication lag, and failover takes about 60–120 seconds automatically.

Replication lag matters. Asynchronous replication is fast and widely used but allows a narrow window where the replica is behind the primary. If the primary crashes during that window, recently committed writes can be lost. For financial or critical data, use synchronous replication — the write is only acknowledged once it is committed on both nodes. The trade-off is slightly higher write latency (~5–10 ms for a same-region standby).

Caches (Redis, Memcached)

A cache SPOF is usually less catastrophic than a database SPOF — requests degrade to hitting the database rather than failing outright. But a stampede of cache misses after a cache node dies can overload the database and cause a cascade. Run Redis in Redis Sentinel (automatic failover for a primary+replica pair) or Redis Cluster (sharded, each shard has its own primary+replica). Memcached has no native replication; use consistent hashing so that a node failure only loses its fraction of the cache.

DNS and CDN

DNS itself is redundant by spec (multiple name servers), but your DNS provider can be a SPOF. If Route53 or Cloudflare has an outage, your domain becomes unreachable. Large companies use two DNS providers with delegation. CDNs like Cloudflare act as redundant reverse proxies and can serve cached content even when your origin is down — configure your CDN to serve a stale page or a maintenance page on origin failure rather than returning a 502 to users.

Availability Zones and Regions

An Availability Zone (AZ) is a physically separate data center within a cloud region — independent power, networking, and cooling. Spreading your resources across 2–3 AZs in a region costs almost nothing extra and protects against a whole data-center failure. Multi-region redundancy protects against regional outages (rare but high-impact). It requires either active-active with global load balancing and a globally consistent or eventually-consistent database, or active-passive with DNS-based failover and an RPO/RTO you can live with.

Multi-tier redundancy across two availability zones Users / Internet Global Load Balancer Availability Zone A Availability Zone B App Server A1 App Server A2 Primary DB ACTIVE Cache A Redis Primary App Server B1 App Server B2 Replica DB STANDBY Cache B Redis Replica sync replication AZ failure knocks out half the app tier — the remaining half continues to serve all traffic.
Two app servers and a cache in each AZ, one shared primary database with a synchronous standby replica across AZs. An entire AZ can fail without a user-visible outage.

N+1 Redundancy — The Practical Minimum

The industry rule of thumb is N+1 redundancy: if you need N instances to handle your current traffic, run N+1. That extra instance means you can sustain one failure without degrading capacity. For higher availability targets, you can use N+2 (survives two simultaneous failures) or 2N (full redundant set — each component is fully duplicated, so the standby set can handle 100% of load alone). 2N is common for life-critical systems and is the model AWS uses for its multi-AZ database offerings.

Over-provisioning is not the same as redundancy. Running one very powerful server with plenty of spare capacity is not redundant — it is just headroom. Redundancy means independent instances, ideally on separate physical hardware, separate power circuits, and separate network paths. A single beefy server can fail just as completely as a small one.

Shared Fate and the Independent Failure Domain

Redundant components must not share a common failure mode — otherwise, what looks like two independent copies can fail together. This concept is called shared fate. Real examples where shared fate kills redundancy:

  • Two database replicas on the same physical host — one hardware failure takes both.
  • Two app servers in the same AZ during an AZ power failure.
  • Primary and replica both using the same EBS volume (rare, but possible misconfiguration in cloud).
  • All your backups in the same AWS account — an account compromise or misconfigured delete policy destroys all backups simultaneously.

The fix is to ensure redundant components belong to independent failure domains: separate hosts, separate racks, separate AZs, separate regions, or separate cloud accounts as appropriate for the risk level.

Cost of Redundancy

Redundancy is not free. Running N+1 app servers costs roughly 50% more than N. A synchronous database standby in another AZ doubles your primary database cost. Cross-AZ data transfer in AWS costs $0.01/GB — negligible for most workloads but material for high-throughput services (e.g., 100 TB/month of DB replication across AZs = $1,000/month just in transfer fees).

The business case is straightforward: compare the cost of redundancy against the cost of downtime. One hour of downtime for a $10M/year SaaS is roughly $1,140 in lost revenue — plus support costs, churn risk, and SLA penalties. A few hundred dollars per month in extra infrastructure is almost always cheaper than the first incident it prevents.

Availability math: Two independent components each with 99% uptime, if one can fail over to the other, give a combined availability of 1 - (0.01 × 0.01) = 99.99%. Three give 99.9999%. Redundancy is the primary lever for moving from two nines to four nines of availability.

Key Takeaways

  • A single point of failure is any component whose failure takes the whole system down. Audit every layer.
  • Active-Active runs all redundant nodes live — zero idle waste, instant failover. Use for stateless tiers.
  • Active-Passive keeps a standby in sync — fast failover but idle cost. Use for stateful tiers like databases.
  • Spread redundant components across independent failure domains (separate hosts, AZs, regions) — shared fate defeats the purpose of redundancy.
  • N+1 is the minimum acceptable redundancy for production. Use 2N for critical tiers where you must survive a full instance set failure.
  • Redundancy at every layer — load balancers, app servers, caches, databases, DNS — compounds into four-nines availability from two-nines components.