Databases in Production

Database Observability

18 min Lesson 7 of 30

Database Observability

An on-call engineer who cannot answer "which query is killing us right now?" in under 60 seconds is flying blind. Database observability is not just about dashboards — it is a disciplined set of signals, collection pipelines, and alert rules that let you go from "something is wrong" to "this specific query, holding this lock, on this replica, with this lag" in minutes rather than hours. This lesson covers the four pillars that matter most in production: slow-query telemetry, lock analysis, replication-lag tracking, and query-stats aggregation.

Slow Query Telemetry

Every production database engine has a built-in mechanism to log queries that exceed a configurable threshold. Enabling this correctly is the single highest-return observability investment you can make.

PostgreSQL — log_min_duration_statement

Set log_min_duration_statement = 1000 (milliseconds) in postgresql.conf or via ALTER SYSTEM to log any statement taking longer than one second. In high-traffic environments start at 500 ms and lower as you build confidence. The log line includes duration, query text, bind parameters (if log_min_duration_statement is combined with log_parameters), and the query plan can be captured by pairing with auto_explain.

-- Enable auto_explain for queries over 500 ms (PostgreSQL) -- Add to postgresql.conf or run as superuser: ALTER SYSTEM SET shared_preload_libraries = 'auto_explain'; ALTER SYSTEM SET auto_explain.log_min_duration = 500; ALTER SYSTEM SET auto_explain.log_analyze = on; ALTER SYSTEM SET auto_explain.log_buffers = on; ALTER SYSTEM SET auto_explain.log_format = 'json'; ALTER SYSTEM SET log_min_duration_statement = 500; SELECT pg_reload_conf(); -- View the top 20 slowest normalized queries (requires pg_stat_statements) SELECT calls, round(total_exec_time::numeric, 2) AS total_ms, round((total_exec_time / calls)::numeric, 2) AS avg_ms, round(stddev_exec_time::numeric, 2) AS stddev_ms, rows, shared_blks_hit, shared_blks_read, left(query, 120) AS query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;

The shared_blks_read column reveals cache-miss pressure — a query with high reads relative to hits is pulling data from disk, a red flag regardless of its wall-clock time. Reset the stats between deploys with SELECT pg_stat_statements_reset(); to get a clean baseline.

MySQL / MariaDB — Slow Query Log

Set slow_query_log = ON, long_query_time = 1, and log_queries_not_using_indexes = ON. Parse aggregated reports with pt-query-digest (Percona Toolkit) — it normalizes query fingerprints, computes percentiles, and emits structured output that plugs directly into your observability stack.

# MySQL slow-query log analysis with pt-query-digest pt-query-digest \ --since "$(date -d '1 hour ago' '+%Y-%m-%d %H:%M:%S')" \ --order-by Query_time:sum \ --limit 20 \ /var/lib/mysql/slow.log # Ship slow query logs to Loki for long-term retention # /etc/promtail/config.yaml excerpt scrape_configs: - job_name: mysql-slow static_configs: - targets: [localhost] labels: job: mysql-slow host: db-primary-01 pipeline_stages: - regex: expression: '# Time: (?P<ts>\S+)\n# User.*\n# Query_time: (?P<query_time>\S+).*\n(?P<query>.*)' - labels: query_time: __path__: /var/lib/mysql/slow.log
The p99, not the average: Average query time hides the tail. Always build dashboards around p99 (and sometimes p999 for financial or safety-critical systems). A query that averages 5 ms but spikes to 4,000 ms at p99 is a production incident waiting to happen. In Prometheus, compute percentiles from pg_stat_statements histograms exposed by postgres_exporter.

Lock Analysis

Locks are the silent killer in multi-writer databases. A transaction holding a row lock while waiting for user input, or an accidental table-level lock from a schema migration, can queue up hundreds of waiting connections in seconds.

Lock wait chain — blocker, waiter, and downstream queue Blocker PID 42 UPDATE orders HOLDS row lock Row Lock orders.id = 9901 Waiter PID 77 UPDATE orders waiting 8s Waiter PID 81 SELECT FOR UPDATE waiting 6s Waiter PID 94 UPDATE orders waiting 2s pg_terminate_ backend(42)
A single blocker transaction holding a row lock queues multiple waiter connections — resolving the root blocker (PID 42) unblocks the entire chain.
-- PostgreSQL: identify blockers and their waiter queues SELECT blocker.pid AS blocker_pid, blocker.usename AS blocker_user, blocker.query AS blocker_query, now() - blocker.xact_start AS blocker_age, waiter.pid AS waiter_pid, waiter.query AS waiter_query, now() - waiter.state_change AS waiting_for FROM pg_stat_activity waiter JOIN pg_stat_activity blocker ON blocker.pid = ANY(pg_blocking_pids(waiter.pid)) WHERE waiter.wait_event_type = 'Lock' ORDER BY blocker_age DESC; -- MySQL: show blocking transactions SELECT r.trx_id AS waiting_trx, r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query, b.trx_id AS blocking_trx, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_query FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
Lock-wait timeout is not your safety net: The default lock_timeout in PostgreSQL is 0 (infinite wait). In MySQL the default innodb_lock_wait_timeout is 50 seconds. Both values are dangerously permissive in high-throughput systems. Set lock_timeout = 5s in PostgreSQL at the application level and instrument your connection pool to alert when lock-wait errors spike — they are a leading indicator of a deeper contention problem.

Replication Lag Metrics

Replication lag is the seconds (or bytes) by which a replica trails the primary. It is a critical SLO signal: reads from a lagged replica return stale data, and a replica that cannot keep up will never catch up under sustained write load. Big-tech teams alert on lag at two thresholds — a warning at 10 seconds and a page at 60 seconds — and treat any lag trend above zero as a performance bug.

PostgreSQL replication lag is measured in both bytes (LSN difference) and wall-clock seconds. The authoritative source is pg_stat_replication on the primary:

-- PostgreSQL: lag in bytes and seconds per replica SELECT application_name, client_addr, state, sync_state, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn)) AS flush_lag, pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn)) AS replay_lag, write_lag, flush_lag AS flush_lag_time, replay_lag AS replay_lag_time FROM pg_stat_replication ORDER BY replay_lag DESC; -- On the replica itself: seconds of lag SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag_seconds; -- Prometheus alert rule (PrometheusRule) for replication lag # prometheus-rules/database.yaml groups: - name: database.replication rules: - alert: PostgresReplicationLagWarning expr: pg_replication_lag{job="postgres"} > 10 for: 2m labels: severity: warning annotations: summary: "Replica {{ $labels.instance }} is {{ $value | humanizeDuration }} behind primary" - alert: PostgresReplicationLagCritical expr: pg_replication_lag{job="postgres"} > 60 for: 1m labels: severity: critical annotations: summary: "CRITICAL: Replica lag {{ $value }}s — reads may return stale data" - alert: MySQLSlaveIONotRunning expr: mysql_slave_status_slave_io_running{job="mysql"} == 0 for: 30s labels: severity: critical annotations: summary: "MySQL replica IO thread stopped on {{ $labels.instance }}"

For MySQL, the canonical lag metric is Seconds_Behind_Master from SHOW SLAVE STATUS — exposed by mysqld_exporter as mysql_slave_status_seconds_behind_master. Treat this value critically: it resets to 0 when the IO thread stops, giving a false green. Always monitor both the lag value and the IO/SQL thread running status.

Query Stats Aggregation with pg_stat_statements and Performance Schema

Ad-hoc slow-query analysis catches individual incidents. Query stats aggregation catches systemic patterns — the query that is fast individually but accounts for 40% of total DB CPU because it runs 50,000 times per minute. This is the difference between firefighting and capacity planning.

PostgreSQL's pg_stat_statements normalizes query fingerprints and accumulates per-query counters across the lifetime of the server. MySQL's Performance Schema events_statements_summary_by_digest serves the same purpose. Both should be queried on a schedule (every 1–5 minutes) and their deltas shipped to your metrics backend.

-- PromQL: surface queries dominating DB execution time -- (postgres_exporter exposes pg_stat_statements as a Gauge per query fingerprint) # Total execution time rate per query (top-5 by CPU burn) topk(5, rate(pg_stat_statements_total_exec_time_seconds_total[5m]) ) # Cache hit ratio per query — below 0.95 means disk reads 1 - ( rate(pg_stat_statements_blks_read_total[5m]) / (rate(pg_stat_statements_blks_hit_total[5m]) + rate(pg_stat_statements_blks_read_total[5m]) + 0.001) ) # MySQL: top 10 digests by total latency (Performance Schema) SELECT schema_name, LEFT(digest_text, 100) AS query_fingerprint, count_star AS executions, round(sum_timer_wait / 1e12, 2) AS total_latency_s, round(avg_timer_wait / 1e9, 2) AS avg_latency_ms, sum_rows_examined, sum_rows_sent FROM performance_schema.events_statements_summary_by_digest WHERE schema_name IS NOT NULL ORDER BY sum_timer_wait DESC LIMIT 10;
The observability stack for databases at scale: Prometheus + postgres_exporter (or mysqld_exporter) scrapes engine metrics every 15 seconds. Grafana dashboards surface the four golden signals adapted for databases: error rate (connection failures, deadlocks), latency (p99 query time), traffic (queries/sec), and saturation (connection pool fill, replication lag). Loki ingests slow-query log lines for full-text search. PagerDuty routes critical alerts. This four-layer stack — scrape, store, visualize, alert — is the baseline at every major cloud company.

Building a Minimal Observability Baseline in 30 Minutes

If you are starting from zero on a new cluster, this is the order of operations that delivers 80% of production value fastest:

  1. Enable pg_stat_statements (PostgreSQL) or Performance Schema digests (MySQL) — these are safe for production and are typically already on in managed services.
  2. Deploy postgres_exporter or mysqld_exporter as a sidecar or systemd service; point Prometheus at it with a 15s scrape interval.
  3. Import the community Grafana dashboards (ID 9628 for PostgreSQL, ID 7362 for MySQL) as a starting point, then customize thresholds for your traffic profile.
  4. Set log_min_duration_statement = 500 (Postgres) or enable the slow query log (MySQL); ship logs to Loki or CloudWatch Logs Insights.
  5. Write three alert rules: replication lag critical, connection saturation above 80%, and slow-query rate above your baseline.
Automated query regression detection: At Google and Meta scale, slow-query detection is automated — a CI step runs EXPLAIN ANALYZE on every new query in a pull request and fails the build if the plan cost exceeds a threshold. For most teams a simpler version works: ship pg_stat_statements deltas to Datadog or Grafana Cloud, and set anomaly-detection alerts on per-fingerprint execution time. The first time you catch a bad index removal in code review rather than a 3 AM page, you will understand why this investment pays off.

Database observability is not a one-time setup — it is a feedback loop. Every incident should end with a new alert rule or a new dashboard panel so the same failure mode is caught automatically next time. Teams that treat observability as a living system catch problems in minutes; teams that treat it as a checkbox catch them in hours or not at all.