Databases in Production

Project: A Production Database Runbook

18 min Lesson 10 of 30

Project: A Production Database Runbook

A runbook is not documentation written for auditors — it is the document your on-call engineer reads at 2 AM with adrenaline running and Slack pinging. Every section must answer a single question in under thirty seconds: "What do I do right now?" This lesson synthesises every technique from this tutorial — high availability, backups, zero-downtime migrations, connection pooling, and observability — into a single, coherent runbook for a sample production PostgreSQL cluster. You will also see how to validate each procedure so the runbook stays trustworthy over time.

The Sample System

Our reference system is a PostgreSQL 16 cluster running behind PgBouncer on a three-node Patroni setup: one primary, two synchronous replicas, with an HAProxy VIP routing writes to port 5000 and reads to port 5001. Backups are taken nightly via pgBackRest to an S3-compatible object store and retained for 30 days. Schema migrations are managed with Flyway and applied through a CI/CD pipeline.

Production database cluster layout: HAProxy, PgBouncer, Patroni, pgBackRest Application Pods HAProxy VIP :5000 writes · :5001 reads PgBouncer Pool transaction-mode · pool_size=50 Primary (Leader) Patroni + pg 16 Replica 1 (sync) hot standby Replica 2 (sync) hot standby pgBackRest → S3 etcd (Patroni DCS)
Reference cluster: HAProxy routes write/read traffic through PgBouncer to a three-node Patroni cluster; pgBackRest ships WAL archives to S3; etcd is the Distributed Configuration Store.

Part 1 — High Availability Procedures

Planned Failover (Maintenance Switchover)

A planned switchover demotes the current leader cleanly, allows replicas to drain outstanding WAL, then promotes a replica. Patroni makes this a single command, but the runbook must state what to verify before and after.

# 1. Verify cluster health before touching anything patronictl -c /etc/patroni/patroni.yml list # Expected: one Leader, at least one running Replica, replication_lag = 0 # 2. Check that both replicas are fully caught up psql -h localhost -U postgres -c " SELECT application_name, state, sent_lsn - replay_lsn AS replay_lag_bytes, sync_state FROM pg_stat_replication; " # Expected: replay_lag_bytes = 0 for both replicas # 3. Execute the switchover (picks the most advanced replica as new primary) patronictl -c /etc/patroni/patroni.yml switchover --master db-node-1 --candidate db-node-2 --scheduled now --force # 4. Confirm new leader patronictl -c /etc/patroni/patroni.yml list # 5. Confirm HAProxy VIP now points to the new primary (port 5000) psql -h vip.db.internal -p 5000 -U app_user -c "SELECT pg_is_in_recovery();" # Expected: f (false — new primary confirms writes)

Unplanned Failover (Node Crash)

When the primary crashes, Patroni detects the outage via etcd TTL expiry (default 30 s), fences the failed node by revoking its etcd lock, promotes the most advanced replica, and updates the HAProxy backend. The runbook must specify the verification steps after automatic promotion, not how to trigger the promotion — Patroni does that itself.

# After the alert fires and you open your laptop: # 1. What does Patroni see? patronictl -c /etc/patroni/patroni.yml list # Look for a new Leader; if no Leader exists, etcd may be partitioned # 2. Confirm the new primary is accepting writes psql -h vip.db.internal -p 5000 -U app_user -c " SELECT now(), pg_is_in_recovery(), pg_current_wal_lsn(); " # 3. Check for replication slots that were pinned to the dead node psql -h vip.db.internal -p 5000 -U postgres -c " SELECT slot_name, active, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots WHERE NOT active; " # DROP any inactive slots that are blocking WAL reclamation # psql -c "SELECT pg_drop_replication_slot('stale_slot_name');" # 4. Re-join the recovered node as a replica (Patroni handles this automatically # once the node comes back online — it calls pg_basebackup internally) # Monitor with: patronictl -c /etc/patroni/patroni.yml list # 5. Update PagerDuty/status page and open an incident ticket
Stale replication slots are a silent disk bomb. An inactive slot causes the primary to retain WAL indefinitely for the missing consumer. A 10 TB database can fill its WAL directory in hours if a slot pointed at a dead CDC consumer is left in place. The runbook must include an immediate check for inactive slots after every unplanned failover.

Part 2 — Backup and Restore Procedures

Verifying Nightly Backup Completion

# Check pgBackRest backup info from the standby (preferred — no I/O impact on primary) pgbackrest --stanza=db-primary info # Expected output includes: # status: ok # db (current): ... # full backup: timestamp < 25h ago # wal archive: min = ... max = [within 5 min of now] # If backup is missing or WAL archive is stale, check the log: journalctl -u pgbackrest-backup --since "24 hours ago" | grep -E "ERROR|WARN|P00"

Point-in-Time Restore Procedure

This is the procedure that matters most and is tested least. Run a full restore drill to a staging cluster every quarter. The runbook must contain the exact commands so the drill and the real recovery use identical steps.

# Restore to a specific point in time (e.g., 30 minutes before the bad migration) # Run this on a SEPARATE recovery host — never restore over a live primary # 1. Stop PostgreSQL if running on the recovery host systemctl stop postgresql # 2. Clear the data directory rm -rf /var/lib/postgresql/16/main/* # 3. Restore from the nearest full backup, applying WAL through the target time pgbackrest --stanza=db-primary \ --delta \ --type=time \ "--target=2025-09-15 03:45:00 UTC" \ --target-action=promote \ restore # 4. Start PostgreSQL — it will replay WAL up to the target time, then promote systemctl start postgresql # 5. Verify data integrity at the restored point psql -U postgres -c "SELECT max(created_at) FROM orders;" # Confirm it is consistent with the expected timestamp # 6. Run application smoke tests against the recovery host before cutover # If good, promote this node and re-point the application connection string
Quarterly restore drill: Schedule a calendar event every quarter to spin up a restore from production backups to a temporary EC2 instance. Time it, document the actual RTO, and compare against your SLO. If your RTO SLO is 4 hours but the drill takes 5.5 hours, you have a gap to close before the real incident — not during it.

Part 3 — Zero-Downtime Migration Procedure

Every schema migration goes through four gates before reaching production. The runbook defines what each gate checks.

# Gate 1: Local review — does the migration require a full table lock? # Dangerous: ALTER TABLE users ADD COLUMN age INT NOT NULL DEFAULT 0; # (acquires AccessExclusiveLock on the whole table for the DEFAULT backfill) # Safe: ALTER TABLE users ADD COLUMN age INT; -- then backfill in batches # Gate 2: Staging apply — measure execution time and lock wait time psql -U postgres -c " SET lock_timeout = '2s'; -- fail fast if another query holds the lock SET statement_timeout = '30s'; \i migrations/V45__add_age_column.sql " # Gate 3: CI pipeline dry-run using Flyway's validate + dryRun mode flyway -url=jdbc:postgresql://staging-db:5432/app \ -user=flyway \ -password=\${FLYWAY_PASSWORD} \ -dryRunOutput=/tmp/V45_dryrun.sql \ migrate # Gate 4: Production apply with online schema change for large tables (>50 M rows) # Use pg_repack for bloat-sensitive tables, or a shadow-table approach pg_repack --host=vip.db.internal --port=5000 --username=postgres \ --table=users --no-superuser-check # Monitor lock waits during migration window psql -U postgres -c " SELECT pid, wait_event_type, wait_event, query_start, left(query, 80) FROM pg_stat_activity WHERE wait_event_type = 'Lock' ORDER BY query_start; "

Part 4 — Incident Decision Tree

The runbook must reduce cognitive load to near zero during an incident. A decision tree forces structured thinking when adrenaline degrades it.

Database incident decision tree: triage from alert to resolution DB Alert Fires Primary reachable? patronictl list / psql ping Reachable — check metrics lag · connections · locks YES Patroni promoting? check etcd + patronictl NO Repl lag high → parallel workers Conn saturated → PgBouncer tuning YES — wait auto-promote <30s YES NO — manual patronictl failover NO No replica viable? → Restore from pgBackRest Incident resolved → Write postmortem
Database incident decision tree: triage from the firing alert through failover, metrics-based diagnosis, and restore, ending in a postmortem.

Part 5 — Runbook Maintenance and Testing

A runbook that has not been exercised recently is a liability. At big-tech companies, database runbooks are owned by a named team, versioned in git, reviewed quarterly, and validated through scheduled game days. The following table defines the minimum testing cadence for this runbook:

  • Monthly: Verify that all monitoring alerts still fire correctly by running patronictl pause on the primary in staging and confirming that the "primary unreachable" PagerDuty alert fires within 2 minutes.
  • Quarterly: Full restore drill from a production pgBackRest backup to a temporary host. Record the actual wall-clock RTO and RPO and compare against the documented SLOs.
  • Before every major migration: Apply the migration to a production-sized staging clone. Record lock hold time. If it exceeds 500 ms, escalate to an online schema change strategy.
  • After every incident: Update the runbook with the actual steps taken, add any new failure mode discovered, and tag the git commit with the incident ID.
The single most important sentence in any runbook: "This procedure was last tested on [DATE] by [PERSON] and took [DURATION] to complete." If that sentence cannot be filled in, the procedure is theoretical, not operational. Every section of your runbook must carry this footer.

Congratulations on completing the Databases in Production tutorial. You now hold the full operator toolkit: you can design and validate HA topologies, write and test backup procedures, apply zero-downtime schema changes, manage connection pools under load, instrument every critical metric, and — critically — act from a runbook rather than instinct when the alert fires at 2 AM. That is the difference between surviving production and mastering it.