Disaster Recovery & Multi-Region

Data Replication Across Regions

18 min Lesson 4 of 27

Data Replication Across Regions

Moving compute across regions during a failover is relatively straightforward — you re-point DNS, spin up instances, and let your orchestrator reschedule pods. Moving data is a fundamentally harder problem. Data has weight: it must be consistent, durable, and available at the right place at the right time. The choices you make about replication topology determine your effective RPO and set a hard lower bound on your RTO. Get them wrong and your DR plan exists only on paper.

Synchronous vs Asynchronous Replication

Every cross-region replication design starts with one decision: does the primary wait for the remote replica to confirm a write before acknowledging success to the caller?

Synchronous replication means the write is not acknowledged until at least one replica in the secondary region has durably written it. RPO is effectively zero — no committed transaction can be lost. The cost is latency: if your primary is us-east-1 and your replica is eu-west-1, the round-trip is roughly 80 ms. Every write adds that RTT to its latency budget. At p99 this compounds. For OLTP workloads at scale, synchronous cross-region replication is rarely feasible beyond a few hundred kilometres.

Asynchronous replication means the primary acknowledges the write immediately and ships the change log to the replica in the background. Write latency is unchanged. The trade-off is replication lag: if the primary fails before the replica applies the last few seconds of the log, those transactions are lost. At AWS with RDS MySQL, typical cross-region async lag sits between 50 ms and a few seconds under normal load; it can spike to minutes during a write storm or a large DDL migration.

Key insight: Neither mode is universally correct. Synchronous replication is appropriate for financial ledger writes, payment confirmations, and anything where data loss is unacceptable and write latency is tolerable. Async is appropriate for session state, analytics events, product catalogue, and anything where a few seconds of potential loss is acceptable and write throughput matters.

Replication Lag is a Production Trap

Teams often configure async replication, measure lag at 50 ms during testing, and declare their RPO as "near zero." This is dangerous. Replication lag is not a constant — it is a function of write rate, network conditions, and replica I/O capacity. During a large batch import, a schema migration, or a traffic spike, lag routinely climbs to minutes. If the primary fails at that moment, you lose minutes of data, not milliseconds. Always monitor Seconds_Behind_Source (MySQL) or pg_stat_replication.write_lag (PostgreSQL) as an SLO metric with alerting thresholds.

Production pitfall: Never read your RPO off the steady-state lag. Measure the maximum replication lag observed over 30 days, especially during deploy windows and batch jobs. That maximum is your realistic worst-case RPO. Many teams discover their "5-second RPO" database regularly hits 4 minutes of lag during the weekly analytics export.

Cross-Region Replication: Database-Specific Options

At big-tech scale, teams choose their replication mechanism based on their database engine, their RPO/RTO targets, and their operational complexity tolerance.

Amazon Aurora Global Database

Aurora Global Database uses a dedicated replication layer that bypasses the database's own redo log shipping. Changes are written to Aurora's distributed storage layer (SSD-backed, six copies across three AZs in the primary region) and replicated to up to five secondary regions using a proprietary fast replication protocol. Typical replication lag is under one second — often under 100 ms. During a failover, the secondary region is promoted with an RPO under one second and an RTO of approximately one minute. This is purpose-built DR architecture at managed-service quality.

-- Promote a secondary Aurora Global Database region (AWS CLI)
aws rds failover-global-cluster \
  --global-cluster-identifier my-global-cluster \
  --target-db-cluster-identifier arn:aws:rds:eu-west-1:123456789012:cluster:my-replica \
  --allow-data-loss   # omit this flag for a planned, zero-data-loss switchover

-- Check replication lag on each secondary
SELECT aws_region, status, replica_lag_in_msec
FROM information_schema.replica_host_status
ORDER BY replica_lag_in_msec DESC;

PostgreSQL Logical Replication Across Regions

For self-managed PostgreSQL on EC2 or bare metal, logical replication (available since PG 10) allows you to replicate individual tables or sets of tables to a remote subscriber. Unlike streaming (physical) replication, logical replication survives major version mismatches and allows the subscriber to be writable on non-subscribed tables — a key property for active-active patterns. The trade-off is that logical replication does not replicate DDL; schema changes must be applied manually to subscribers before they are applied on the publisher, or you will break the replication slot.

-- On the primary (publisher) in us-east-1
ALTER SYSTEM SET wal_level = logical;
SELECT pg_reload_conf();

CREATE PUBLICATION orders_pub FOR TABLE orders, order_items, payments;

-- On the replica (subscriber) in eu-west-1 — tables must already exist with matching schema
CREATE SUBSCRIPTION orders_sub
  CONNECTION 'host=primary.us-east-1.internal port=5432 dbname=app user=replicator password=SECRET sslmode=require'
  PUBLICATION orders_pub;

-- Monitor lag on the publisher
SELECT slot_name,
       confirmed_flush_lsn,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical';

Apache Kafka Cross-Region Mirroring (MirrorMaker 2)

For event-streaming data — the backbone of modern data pipelines — Kafka MirrorMaker 2 (MM2) replicates topics across clusters in different regions with configurable consumer offset translation. Topics in the secondary cluster are prefixed with the source cluster alias (us-east.orders), which means failover consumers need only re-point their bootstrap servers and adjust topic names. MM2 is built on Kafka Connect and inherits its operational model: distributed workers, REST-managed connectors, and connector status visible via the Connect API.

# MirrorMaker 2 connector config — replicate all topics matching the pattern
# Deployed as a Kafka Connect connector on the DR cluster in eu-west-1

{
  "name": "us-east-to-eu-west-mirror",
  "config": {
    "connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
    "source.cluster.alias": "us-east",
    "source.cluster.bootstrap.servers": "kafka-us-east.internal:9092",
    "target.cluster.bootstrap.servers": "kafka-eu-west.internal:9092",
    "topics": "orders,payments,user-events",
    "replication.factor": "3",
    "sync.topic.configs.enabled": "true",
    "sync.topic.acls.enabled": "true",
    "emit.heartbeats.enabled": "true",
    "refresh.topics.interval.seconds": "30"
  }
}

# Check consumer group offset replication lag (run on the DR cluster)
kafka-consumer-groups.sh \
  --bootstrap-server kafka-eu-west.internal:9092 \
  --describe \
  --group us-east.orders-service

Conflict Handling in Active-Active Replication

When both regions accept writes — the active-active pattern — the same logical record can be mutated concurrently in two regions. This is a distributed systems fundamentals problem with no perfect solution. The practical options are:

Last-write-wins (LWW) by wall clock: The write with the highest timestamp wins. Simple but dangerous — clock skew between regions means a write from 200 ms ago can silently overwrite a more recent write from the other region. Use only for records where occasional overwrites are acceptable (user preferences, cache entries).
LWW by logical clock (CRDT or version vector): Replace wall clock with a Lamport timestamp or vector clock. CockroachDB and DynamoDB use variants of this approach. More correct, still lossy — one write is silently discarded.
Application-level conflict detection: Tag every write with an origin region and a sequence number. On merge, detect conflicts explicitly and push them to an application-layer resolution queue. This preserves both writes and lets business logic decide (e.g., sum the inventory deltas rather than pick a winner). This is the correct approach for financial data but adds significant operational complexity.
Avoid conflicts by design (entity ownership): Partition entities so each region owns a disjoint subset. Region A owns even-numbered user IDs, Region B owns odd-numbered ones. No entity is ever written by two regions simultaneously. This is the most operationally reliable approach and the one most large-scale systems converge on after fighting LWW bugs in production.

Big-tech practice: Most mature platforms use entity ownership partitioning for their critical data stores and accept active-passive for the 5–10% of entities that cannot be partitioned. True multi-master with conflict resolution is reserved for metadata and configuration — not transactional data. The engineering cost of reasoning about concurrent writes across regions is substantially higher than it appears in architecture diagrams.

Cross-Region Replication Topology Diagram

Async DB replication and Kafka mirroring from a primary region to a DR region. On failover the replica is promoted and the standby application becomes active.

Monitoring Replication Health as an SLO

Replication lag is not a background operational metric — it is a first-class SLO because it is your real-time RPO. Instrument it accordingly. At Amazon scale, teams alert on two thresholds: a warning at 10× the average lag (indicating a developing problem) and a critical alert at the SLO breach point (e.g., 60 seconds for a system with a 1-minute RPO commitment). The critical alert should page the on-call engineer, not just email the team.

For Aurora Global Database, the key CloudWatch metric is AuroraGlobalDBReplicationLag measured per secondary cluster. For self-managed PostgreSQL, expose pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as a Prometheus gauge via postgres_exporter. For Kafka MM2, the metric kafka.connect:type=MirrorSourceConnector,attribute=replication-latency-ms gives per-topic end-to-end lag including connector processing time.

Operational practice: Test your failover promotion path quarterly under real lag conditions. Stop the primary, measure the lag at the time of failure, confirm that the replica promotion produces exactly that many seconds of lost data (no more), and confirm the standby application reconnects cleanly. If you have never exercised the promotion path under realistic lag, your RPO is theoretical.