High Availability Architectures
High Availability Architectures
A database that goes down takes your entire product with it. Every other layer of the stack — load balancers, application servers, CDNs — can be made stateless and replaced in seconds. Databases cannot. They own the state, and losing them means losing user data, transaction history, and often the business itself. High availability (HA) architecture is the set of patterns that make a database cluster survive node failures, network partitions, and AZ outages without human intervention. This lesson covers the three pillars of production HA: replication topology, synchronous versus asynchronous replication trade-offs, and automatic failover.
The Primary/Replica Model
The foundational HA pattern for relational databases is primary/replica replication (historically called master/slave). One node — the primary — accepts all writes. One or more replica nodes receive a stream of changes from the primary and replay them to maintain an identical copy of the data.
This separation gives you three things simultaneously:
- Redundancy: if the primary dies, a replica can be promoted to take its place.
- Read scalability: read-heavy workloads can distribute SELECT queries across replicas, relieving primary write pressure.
- Operational safety: long-running analytics queries run on a replica without slowing down OLTP writes on the primary.
DROP TABLE. Replication and backups solve different failure modes and both are required in production.In PostgreSQL, replication is built on WAL (Write-Ahead Log) streaming. Every transaction on the primary is first written to the WAL; replicas connect as streaming replication clients and receive WAL segments in near-real-time. In MySQL/MariaDB, the equivalent is the binary log (binlog) streamed in row-based format. Both mechanisms mean replicas are not reading from the primary's tables directly — they are replaying a log of changes, which makes replication lightweight even at high write rates.
Synchronous vs. Asynchronous Replication
The single most important architectural decision in database HA is whether replication is synchronous or asynchronous. This is not a preference — it is a direct trade-off between durability guarantees and write latency, and the wrong choice causes data loss or unacceptable slowdowns.
Asynchronous replication (default in most engines): The primary writes to its own WAL, commits, acknowledges success to the client, and then ships the WAL to replicas in the background. If the primary crashes before the replica has consumed the latest entries, those entries are lost permanently — this is called replication lag becoming data loss during a failover. The upside is that the commit path on the primary adds zero latency from network round-trips to replicas.
Synchronous replication: The primary writes to the WAL and only acknowledges the client commit after at least one designated synchronous replica has confirmed it received and flushed the WAL entry to durable storage. No data can be lost in a failover to that replica. The cost: every write incurs an additional network round-trip to the replica — typically 0.5–2 ms in the same AZ, 5–30 ms cross-AZ. At high write rates (thousands of transactions per second), this latency compounds.
The industry solution at companies like Google (Spanner, AlloyDB) and AWS (Aurora) is to use a quorum-based write protocol that achieves synchronous durability across multiple nodes while hiding most of the latency penalty. Aurora, for example, writes the log record to 6 storage nodes across 3 AZs and acknowledges the commit after 4 of 6 confirm — this provides synchronous guarantees with much better availability than a single synchronous replica.
For PostgreSQL specifically, you configure synchronous commit at the session or server level:
MySQL/InnoDB Group Replication and semi-synchronous plugins offer analogous controls. In a managed service (RDS Multi-AZ, CloudSQL), the sync/async mode is set at the instance level and the engine handles the WAL shipping transparently.
Automatic Failover
Having a warm replica is not HA. HA requires that when the primary fails, a replica is promoted to primary automatically — without a human paging someone at 3 AM to run pg_ctl promote. Automatic failover is the operational differentiator between a system with replicas and a system with true HA.
The components of a production automatic failover system are:
- Health monitoring: an agent or sidecar that continuously polls the primary. A missed heartbeat alone is not sufficient — good implementations distinguish between "primary unreachable from this node" and "primary unreachable from the majority of the cluster" to avoid split-brain.
- Leader election / quorum: a distributed consensus mechanism (typically based on Raft or Paxos — etcd, ZooKeeper, or built-in engine consensus) that safely elects exactly one new primary. The quorum requirement prevents two nodes simultaneously promoting themselves and accepting conflicting writes.
- Promotion: the winning replica is promoted to primary — WAL replay stops, write connections are accepted.
- Re-routing: the connection layer (virtual IP, DNS failover, or a proxy like PgBouncer/ProxySQL) redirects application traffic to the new primary. Applications should use a single endpoint that always points to the current primary — baking in replica IPs breaks this.
- Old primary fencing (STONITH): the previous primary is isolated before the new primary accepts writes. Without fencing, a primary that paused (GC, network blip) and comes back can momentarily believe it is still primary, causing a split-brain write window.
Patroni: Production Automatic Failover for PostgreSQL
The de-facto open-source HA solution for PostgreSQL is Patroni, used in production at Zalando, GitLab, Aiven, and many others. Patroni runs as a sidecar on each PostgreSQL node and uses etcd (or Consul, or ZooKeeper) as the distributed consensus store to hold the cluster leader lock. The node that holds the lock is primary; all others are replicas. A minimal Patroni configuration looks like:
With maximum_lag_on_failover set, Patroni refuses to promote a replica that has fallen too far behind — preventing the promotion of a stale node that would roll back committed transactions. The combination of synchronous_mode: true and a low lag threshold gives you strong durability guarantees without building custom orchestration.
Failover RTO and RPO in Practice
Two metrics govern what your HA architecture actually promises to the business:
- RTO (Recovery Time Objective): how long the database is unavailable during a failover. With Patroni, a typical automatic failover completes in 15–45 seconds from the moment the primary stops heartbeating (Patroni's
ttlsetting is the dominant factor). Managed services like RDS Multi-AZ typically achieve 20–60 seconds. Plan your application connection retry logic around this window — circuit breakers and retry with backoff are not optional. - RPO (Recovery Point Objective): how much data can be lost in the worst case. With asynchronous replication, RPO equals the replication lag at the time of failure — potentially seconds to minutes of committed transactions. With synchronous replication on at least one replica, RPO is zero for a failover to that replica.
Multi-AZ vs. Multi-Region HA
Synchronous replication across availability zones (AZs within the same AWS region) is the production baseline — the network round-trip is 0.5–2 ms, making synchronous commit practical. Cross-region synchronous replication (latency 50–200 ms) is almost never used for OLTP because the write latency penalty is prohibitive. Cross-region replicas are almost always asynchronous and serve as disaster recovery (DR) targets, not automatic failover candidates. The trade-off: if the primary region is entirely lost, promoting a cross-region async replica is a manual operation and accepts data loss equal to the replication lag.
AWS Aurora Global Database solves part of this by using its custom storage layer to replicate across regions with under 1 second of lag — still asynchronous in the CAP sense, but operationally much faster than traditional streaming replication. For systems that require zero cross-region RPO, Spanner or CockroachDB with synchronous multi-region consensus are the options, at significant cost and complexity.