Databases in Production

The Operators View of Databases

18 min Lesson 1 of 30

The Operators View of Databases

A developer asks: "Does my query return the right rows?" An operator asks: "Is this database still alive at 3 AM, can it survive the loss of a primary node, and will it come back up cleanly if the host kernel panics?" These are fundamentally different questions, and learning to think like an operator is the first shift you must make when you move from writing application code to running database infrastructure in production.

What Changes When You Become the Operator

When you own a database in production you inherit a set of responsibilities that have nothing to do with SQL correctness:

Availability: The database must be reachable. A single unreachable instance blocks every service that depends on it — often cascading into multi-service outages.
Durability: Committed transactions must survive crashes, power loss, and disk failure. Understanding fsync, write-ahead logs, and the difference between acknowledged and durable is non-negotiable.
Recoverability: You will take backups. You must also test restores — a backup you have never restored is a backup you do not have.
Capacity: Storage grows monotonically. You must know your growth rate, set alerts before disks fill, and plan for growth before it becomes an incident.
Observability: Latency percentiles, replication lag, connection pool saturation, lock wait time, and slow-query counts are the vital signs of a database. You must collect them, alert on them, and know how to read them under pressure.
Change management: Schema migrations, version upgrades, and configuration tuning all carry blast radius. An operator plans rollbacks before applying changes, not after.

The operator contract: When a developer pushes a feature, they are done. When an operator deploys a database, they own it for its entire production lifetime — every crash, every restore, every capacity crunch, every 3 AM page.

Managed vs Self-Hosted: The Real Trade-off

Every organization running databases faces this choice. The answer is rarely obvious and always context-dependent. Let us be precise about what each option buys and costs.

Managed Databases (RDS, Cloud SQL, Aurora, Neon, PlanetScale, Atlas)

A managed service takes over a defined set of operational tasks: hardware provisioning, OS patching, minor-version upgrades, automated backups, and basic failover. What you give up in return is control — over the kernel parameters, the storage layout, the binary log format, and often the upgrade schedule.

In practice, managed databases are the right default for:

Teams without a dedicated DBA or database-platform team
Workloads where the blast radius of a misconfigured innodb_buffer_pool_size is higher than the cloud bill
Use cases where automated Multi-AZ failover (typically 30–60 seconds for RDS) is acceptable

What managed services do not solve: slow queries, schema design mistakes, runaway connection counts, or application-level bugs. The cloud patches the engine; it does not patch your SELECT * inside a tight loop.

The managed-service illusion: Engineers often assume that choosing RDS or Cloud SQL means they no longer need to think operationally about the database. This is false. You still own query performance, connection management, backup verification, schema migrations, and capacity planning. The managed service handles the hardware and OS — everything above that layer remains your responsibility.

Self-Hosted Databases (EC2, bare-metal, on-prem, Kubernetes StatefulSets)

Self-hosting gives you complete control: kernel tuning (vm.swappiness=1, transparent_hugepages=never), custom storage topologies, unrestricted access to binary logs for CDC pipelines, and the ability to run any plugin or extension the managed service does not expose. The cost is that you own everything that can go wrong.

Self-hosting is justified when:

Data residency or compliance requirements prevent cloud-managed services
You need engine features the managed service does not expose (e.g., pg_cron, custom WAL plugins, specific storage engines)
Scale makes the managed-service price significantly higher than the engineering cost of self-operation
You need sub-10-second failover with custom quorum logic (Patroni, Orchestrator, Vitess)

Responsibility split: managed databases eliminate the bottom layers but schema, query performance, and application logic always belong to your team.

The Operational Baseline: What Every Production Database Needs

Regardless of managed or self-hosted, these capabilities must exist on day one — not after the first incident:

Automated, tested backups with a documented RTO and RPO.
Replication (at minimum a hot standby) so a single host failure is not a service outage.
Monitoring covering latency (p50/p99), replication lag, connection saturation, disk usage, and slow-query rate.
Runbook covering failover, restore, and emergency connection-kill procedures — written before you need it.
Change process for schema migrations that includes a rollback path.

The following command is one of the first things a new operator should run against any self-hosted PostgreSQL or MySQL instance to understand its current state:

# PostgreSQL: surface key operational parameters at a glance
psql -U postgres -c "
SELECT name, setting, unit, context
FROM pg_settings
WHERE name IN (
  'max_connections',
  'shared_buffers',
  'wal_level',
  'synchronous_commit',
  'checkpoint_completion_target',
  'max_wal_size',
  'archive_mode',
  'log_min_duration_statement'
)
ORDER BY name;
"

# Also: check replication slots (unconsumed slots can fill disk silently)
psql -U postgres -c "
SELECT slot_name, plugin, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
"

For a managed RDS/Aurora instance you do the same audit through parameter groups and CloudWatch metrics, but the principle is identical: know the state of the system before something goes wrong.

Kernel and OS Settings That Matter (Self-Hosted)

If you run a self-hosted database, three OS-level settings have more impact on stability than almost any database-level tuning:

# /etc/sysctl.d/90-database.conf
# Disable swap — a swapping database is an unhealthy database
vm.swappiness = 1

# Raise the dirty-data flush threshold (reduce stall frequency for large writes)
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5

# Increase the open-file limit for busy instances
fs.file-max = 1000000

# Apply without reboot
sysctl -p /etc/sysctl.d/90-database.conf

# Disable Transparent Huge Pages (major latency source for databases)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Make the THP setting persist across reboots (systemd unit or rc.local)

First day on a new database host: Run cat /proc/sys/vm/swappiness and cat /sys/kernel/mm/transparent_hugepage/enabled. If swappiness is above 1 or THP is enabled, you have found your first production risk before a single query has run.

The Mental Model You Need to Carry Forward

Every subsequent lesson in this tutorial — replication, backups, pooling, observability, Kubernetes StatefulSets — makes more sense when you hold this foundational frame: the database is a shared, stateful, failure-prone component whose durability and availability are entirely your team's problem. The cloud does not magically make it otherwise; it only moves the line of your responsibility upward. Knowing where that line sits, exactly, is the first skill of a database operator.