Databases in Production

Project: A Production Database Runbook

18 min Lesson 10 of 30

Project: A Production Database Runbook

A runbook is not documentation written for auditors — it is the document your on-call engineer reads at 2 AM with adrenaline running and Slack pinging. Every section must answer a single question in under thirty seconds: "What do I do right now?" This lesson synthesises every technique from this tutorial — high availability, backups, zero-downtime migrations, connection pooling, and observability — into a single, coherent runbook for a sample production PostgreSQL cluster. You will also see how to validate each procedure so the runbook stays trustworthy over time.

The Sample System

Our reference system is a PostgreSQL 16 cluster running behind PgBouncer on a three-node Patroni setup: one primary, two synchronous replicas, with an HAProxy VIP routing writes to port 5000 and reads to port 5001. Backups are taken nightly via pgBackRest to an S3-compatible object store and retained for 30 days. Schema migrations are managed with Flyway and applied through a CI/CD pipeline.

Reference cluster: HAProxy routes write/read traffic through PgBouncer to a three-node Patroni cluster; pgBackRest ships WAL archives to S3; etcd is the Distributed Configuration Store.

Part 1 — High Availability Procedures

Planned Failover (Maintenance Switchover)

A planned switchover demotes the current leader cleanly, allows replicas to drain outstanding WAL, then promotes a replica. Patroni makes this a single command, but the runbook must state what to verify before and after.

# 1. Verify cluster health before touching anything
patronictl -c /etc/patroni/patroni.yml list
# Expected: one Leader, at least one running Replica, replication_lag = 0

# 2. Check that both replicas are fully caught up
psql -h localhost -U postgres -c "
SELECT application_name,
       state,
       sent_lsn - replay_lsn AS replay_lag_bytes,
       sync_state
FROM pg_stat_replication;
"
# Expected: replay_lag_bytes = 0 for both replicas

# 3. Execute the switchover (picks the most advanced replica as new primary)
patronictl -c /etc/patroni/patroni.yml switchover --master db-node-1 --candidate db-node-2 --scheduled now --force

# 4. Confirm new leader
patronictl -c /etc/patroni/patroni.yml list

# 5. Confirm HAProxy VIP now points to the new primary (port 5000)
psql -h vip.db.internal -p 5000 -U app_user -c "SELECT pg_is_in_recovery();"
# Expected: f (false — new primary confirms writes)

Unplanned Failover (Node Crash)

When the primary crashes, Patroni detects the outage via etcd TTL expiry (default 30 s), fences the failed node by revoking its etcd lock, promotes the most advanced replica, and updates the HAProxy backend. The runbook must specify the verification steps after automatic promotion, not how to trigger the promotion — Patroni does that itself.

# After the alert fires and you open your laptop:

# 1. What does Patroni see?
patronictl -c /etc/patroni/patroni.yml list
# Look for a new Leader; if no Leader exists, etcd may be partitioned

# 2. Confirm the new primary is accepting writes
psql -h vip.db.internal -p 5000 -U app_user -c "
SELECT now(), pg_is_in_recovery(), pg_current_wal_lsn();
"

# 3. Check for replication slots that were pinned to the dead node
psql -h vip.db.internal -p 5000 -U postgres -c "
SELECT slot_name, active, pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE NOT active;
"
# DROP any inactive slots that are blocking WAL reclamation
# psql -c "SELECT pg_drop_replication_slot('stale_slot_name');"

# 4. Re-join the recovered node as a replica (Patroni handles this automatically
#    once the node comes back online — it calls pg_basebackup internally)
#    Monitor with: patronictl -c /etc/patroni/patroni.yml list

# 5. Update PagerDuty/status page and open an incident ticket

Stale replication slots are a silent disk bomb. An inactive slot causes the primary to retain WAL indefinitely for the missing consumer. A 10 TB database can fill its WAL directory in hours if a slot pointed at a dead CDC consumer is left in place. The runbook must include an immediate check for inactive slots after every unplanned failover.

Part 2 — Backup and Restore Procedures

Verifying Nightly Backup Completion

# Check pgBackRest backup info from the standby (preferred — no I/O impact on primary)
pgbackrest --stanza=db-primary info

# Expected output includes:
#   status: ok
#   db (current): ...
#   full backup: timestamp < 25h ago
#   wal archive: min = ... max = [within 5 min of now]

# If backup is missing or WAL archive is stale, check the log:
journalctl -u pgbackrest-backup --since "24 hours ago" | grep -E "ERROR|WARN|P00"

Point-in-Time Restore Procedure

This is the procedure that matters most and is tested least. Run a full restore drill to a staging cluster every quarter. The runbook must contain the exact commands so the drill and the real recovery use identical steps.

# Restore to a specific point in time (e.g., 30 minutes before the bad migration)
# Run this on a SEPARATE recovery host — never restore over a live primary

# 1. Stop PostgreSQL if running on the recovery host
systemctl stop postgresql

# 2. Clear the data directory
rm -rf /var/lib/postgresql/16/main/*

# 3. Restore from the nearest full backup, applying WAL through the target time
pgbackrest --stanza=db-primary \
           --delta \
           --type=time \
           "--target=2025-09-15 03:45:00 UTC" \
           --target-action=promote \
           restore

# 4. Start PostgreSQL — it will replay WAL up to the target time, then promote
systemctl start postgresql

# 5. Verify data integrity at the restored point
psql -U postgres -c "SELECT max(created_at) FROM orders;"
# Confirm it is consistent with the expected timestamp

# 6. Run application smoke tests against the recovery host before cutover
# If good, promote this node and re-point the application connection string

Quarterly restore drill: Schedule a calendar event every quarter to spin up a restore from production backups to a temporary EC2 instance. Time it, document the actual RTO, and compare against your SLO. If your RTO SLO is 4 hours but the drill takes 5.5 hours, you have a gap to close before the real incident — not during it.

Part 3 — Zero-Downtime Migration Procedure

Every schema migration goes through four gates before reaching production. The runbook defines what each gate checks.

# Gate 1: Local review — does the migration require a full table lock?
# Dangerous: ALTER TABLE users ADD COLUMN age INT NOT NULL DEFAULT 0;
#   (acquires AccessExclusiveLock on the whole table for the DEFAULT backfill)
# Safe: ALTER TABLE users ADD COLUMN age INT;  -- then backfill in batches

# Gate 2: Staging apply — measure execution time and lock wait time
psql -U postgres -c "
SET lock_timeout = '2s';   -- fail fast if another query holds the lock
SET statement_timeout = '30s';
\i migrations/V45__add_age_column.sql
"

# Gate 3: CI pipeline dry-run using Flyway's validate + dryRun mode
flyway -url=jdbc:postgresql://staging-db:5432/app \
       -user=flyway \
       -password=\${FLYWAY_PASSWORD} \
       -dryRunOutput=/tmp/V45_dryrun.sql \
       migrate

# Gate 4: Production apply with online schema change for large tables (>50 M rows)
# Use pg_repack for bloat-sensitive tables, or a shadow-table approach
pg_repack --host=vip.db.internal --port=5000 --username=postgres \
          --table=users --no-superuser-check

# Monitor lock waits during migration window
psql -U postgres -c "
SELECT pid, wait_event_type, wait_event, query_start, left(query, 80)
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start;
"

Part 4 — Incident Decision Tree

The runbook must reduce cognitive load to near zero during an incident. A decision tree forces structured thinking when adrenaline degrades it.

Database incident decision tree: triage from the firing alert through failover, metrics-based diagnosis, and restore, ending in a postmortem.

Part 5 — Runbook Maintenance and Testing

A runbook that has not been exercised recently is a liability. At big-tech companies, database runbooks are owned by a named team, versioned in git, reviewed quarterly, and validated through scheduled game days. The following table defines the minimum testing cadence for this runbook:

Monthly: Verify that all monitoring alerts still fire correctly by running patronictl pause on the primary in staging and confirming that the "primary unreachable" PagerDuty alert fires within 2 minutes.
Quarterly: Full restore drill from a production pgBackRest backup to a temporary host. Record the actual wall-clock RTO and RPO and compare against the documented SLOs.
Before every major migration: Apply the migration to a production-sized staging clone. Record lock hold time. If it exceeds 500 ms, escalate to an online schema change strategy.
After every incident: Update the runbook with the actual steps taken, add any new failure mode discovered, and tag the git commit with the incident ID.

The single most important sentence in any runbook: "This procedure was last tested on [DATE] by [PERSON] and took [DURATION] to complete." If that sentence cannot be filled in, the procedure is theoretical, not operational. Every section of your runbook must carry this footer.

Congratulations on completing the Databases in Production tutorial. You now hold the full operator toolkit: you can design and validate HA topologies, write and test backup procedures, apply zero-downtime schema changes, manage connection pools under load, instrument every critical metric, and — critically — act from a runbook rather than instinct when the alert fires at 2 AM. That is the difference between surviving production and mastering it.