Disaster Recovery & Multi-Region

DR Fundamentals: RTO & RPO

18 min Lesson 1 of 27

DR Fundamentals: RTO & RPO

Every production system will fail. The question is not whether — it is how long the business can survive the failure and how much data it can afford to lose. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are the two numbers that anchor every disaster-recovery conversation. Before you spec a cross-region Kubernetes deployment or a multi-master Aurora cluster, you need these numbers signed off by stakeholders — because they determine cost more than any other architectural decision you will make.

What RTO and RPO Actually Mean

RTO (Recovery Time Objective) is the maximum acceptable duration between a disaster event and the restoration of service to users. It answers: "How long can we be down?" An RTO of four hours means the service must be operational and passing health checks within four hours of the incident declaration — not within four hours of someone noticing.

RPO (Recovery Point Objective) is the maximum acceptable age of the data that must be recovered after a disaster. It answers: "How much data can we lose?" An RPO of one hour means that after recovery, the system state may be at most one hour behind the last consistent checkpoint before the failure. Transactions written in the final 59 minutes before the event may simply be gone.

RTO measures time; RPO measures data. They are independent axes. A financial trading system might demand RPO near zero (no lost trades) but tolerate an RTO of 30 minutes (a brief outage). A nightly batch analytics pipeline might accept 24-hour RPO but need only an hour of RTO because re-running a full day of jobs is expensive, not dangerous.

The Cost Spectrum

Both objectives exist on a spectrum from "cold" (cheap, slow) to "hot" (expensive, instant). The diagram below maps the four canonical DR strategies against their RTO and RPO ranges. This is what you will sketch on a whiteboard with a VP of Engineering when they ask why resilience costs so much.

The four canonical DR strategies plotted against cost/complexity (vertical) and recovery speed (horizontal). Moving right reduces cost but increases both RTO and RPO.

Deriving Realistic Objectives

RTO and RPO are business decisions disguised as engineering numbers. The SRE or platform engineer's job is to translate business impact into cost, then let stakeholders choose. The standard derivation process has three steps:

Quantify the cost of downtime. Revenue lost per minute, SLA penalty clauses, support-call surge costs, and reputational damage (harder to model). A payments processor losing $50,000/min will happily fund an active-active architecture. An internal HR portal losing $0/min can live with cold standby.
Quantify the cost of data loss. Can lost transactions be replayed from upstream events? Are there regulatory requirements (PCI-DSS, HIPAA, SOX) that impose a maximum RPO by law regardless of cost?
Map business requirements to DR tiers. Express each objective as a concrete infrastructure capability (e.g., "RPO ≤ 5 min" implies continuous replication with async lag monitoring), then price it.

Always test your numbers against a real failure budget. If your SLO is 99.9% uptime (43 min/month error budget) and your RTO is 4 hours, you can only afford one incident per 6 months without burning the entire error budget. If your SLO is 99.99% (4.3 min/month), an RTO of 4 hours is mathematically incompatible — you need RTO under 4 minutes, period.

Measuring Current RTO and RPO

Before designing a new DR strategy, measure where you actually stand. The two standard methods are a DR drill (covered in Lesson 8) and continuous replication lag monitoring. For the latter, here is how you expose Postgres streaming-replication lag as a Prometheus metric and alert when it violates your RPO budget:

-- Run on the PRIMARY. Returns lag in seconds per standby.
SELECT
    application_name,
    state,
    EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds,
    sent_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;

Wrap this in a custom Prometheus exporter (or use postgres_exporter with a custom query file) and set an alert rule. If your RPO is 5 minutes, the alert fires well before you breach it:

# prometheus/rules/dr.yml
groups:
  - name: disaster_recovery
    rules:
      - alert: ReplicationLagExceedsRPO
        expr: pg_replication_lag_seconds > 240   # 4-min warning before 5-min RPO
        for: 1m
        labels:
          severity: critical
          runbook: https://wiki.internal/runbooks/pg-replication-lag
        annotations:
          summary: "Postgres replication lag {{ $value }}s exceeds 4-minute RPO threshold"
          description: |
            Standby {{ $labels.application_name }} is {{ $value }} seconds behind primary.
            RPO budget is 300s. Investigate network partition or I/O saturation on replica.

Recovery objectives erode silently. A system that starts with a 15-minute RPO can drift to 3-hour effective RPO over 18 months as data volume grows, replication lag increases, and backup windows stretch — without anyone updating the DR plan. Schedule a quarterly DR metrics review the same way you schedule capacity reviews. Treat "measured RTO" and "measured RPO" as SLIs: track them in your dashboards and alert when they trend past 80% of the contractual threshold.

RTO vs RPO in Kubernetes and Stateless Workloads

For stateless services running on Kubernetes (which you have been running since Tutorial 27), RTO is typically dominated by pod scheduling time plus readiness-probe convergence — often 30–90 seconds in a well-tuned cluster. The more nuanced conversation is about stateful workloads: databases, message-queue offsets, distributed caches. These are where RPO violations actually hurt.

A common mistake is assuming that "we have a standby region in AWS us-west-2" implies a defined RTO. It does not — unless you have automated failover logic, pre-warmed DNS TTLs, database promotion scripts, and the whole sequence has been timed under load in a fire drill. An untested DR plan has an unknown RTO. Treat it as infinite until proven otherwise.

The Two-by-Two: Mapping Tiers to Real Systems

At FAANG-scale, services are tiered by criticality, and each tier has a published RTO/RPO contract enforced at the platform level:

Tier 0 (revenue-critical): Active-active, multi-region, RPO = 0 (synchronous replication), RTO < 30 s. Examples: checkout, payments, auth. Cost: 3–5× single-region infrastructure.
Tier 1 (customer-facing, non-blocking): Pilot light, async replication, RPO < 5 min, RTO < 15 min. Examples: search, recommendations, profile reads.
Tier 2 (internal / degraded-mode ok): Warm standby, RPO < 1 hr, RTO < 2 hr. Examples: analytics ingestion, internal dashboards.
Tier 3 (batch / non-critical): Cold standby or backup-restore, RPO < 24 hr, RTO < 8 hr. Examples: overnight reports, data-science notebooks.

The remaining lessons in this tutorial build out each layer: backup architecture, cross-region replication, failover mechanics, GitOps-driven recovery, and game-day testing. Every decision traces back to the RTO and RPO you locked in here. Get those two numbers wrong — or leave them undefined — and every downstream trade-off becomes guesswork.