Databases in Production

Database Observability

18 min Lesson 7 of 30

Database Observability

An on-call engineer who cannot answer "which query is killing us right now?" in under 60 seconds is flying blind. Database observability is not just about dashboards — it is a disciplined set of signals, collection pipelines, and alert rules that let you go from "something is wrong" to "this specific query, holding this lock, on this replica, with this lag" in minutes rather than hours. This lesson covers the four pillars that matter most in production: slow-query telemetry, lock analysis, replication-lag tracking, and query-stats aggregation.

Slow Query Telemetry

Every production database engine has a built-in mechanism to log queries that exceed a configurable threshold. Enabling this correctly is the single highest-return observability investment you can make.

PostgreSQL — log_min_duration_statement

Set log_min_duration_statement = 1000 (milliseconds) in postgresql.conf or via ALTER SYSTEM to log any statement taking longer than one second. In high-traffic environments start at 500 ms and lower as you build confidence. The log line includes duration, query text, bind parameters (if log_min_duration_statement is combined with log_parameters), and the query plan can be captured by pairing with auto_explain.

-- Enable auto_explain for queries over 500 ms (PostgreSQL)
-- Add to postgresql.conf or run as superuser:
ALTER SYSTEM SET shared_preload_libraries = 'auto_explain';
ALTER SYSTEM SET auto_explain.log_min_duration = 500;
ALTER SYSTEM SET auto_explain.log_analyze = on;
ALTER SYSTEM SET auto_explain.log_buffers = on;
ALTER SYSTEM SET auto_explain.log_format = 'json';
ALTER SYSTEM SET log_min_duration_statement = 500;
SELECT pg_reload_conf();

-- View the top 20 slowest normalized queries (requires pg_stat_statements)
SELECT
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round((total_exec_time / calls)::numeric, 2) AS avg_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  shared_blks_hit,
  shared_blks_read,
  left(query, 120) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The shared_blks_read column reveals cache-miss pressure — a query with high reads relative to hits is pulling data from disk, a red flag regardless of its wall-clock time. Reset the stats between deploys with SELECT pg_stat_statements_reset(); to get a clean baseline.

MySQL / MariaDB — Slow Query Log

Set slow_query_log = ON, long_query_time = 1, and log_queries_not_using_indexes = ON. Parse aggregated reports with pt-query-digest (Percona Toolkit) — it normalizes query fingerprints, computes percentiles, and emits structured output that plugs directly into your observability stack.

# MySQL slow-query log analysis with pt-query-digest
pt-query-digest \
  --since "$(date -d '1 hour ago' '+%Y-%m-%d %H:%M:%S')" \
  --order-by Query_time:sum \
  --limit 20 \
  /var/lib/mysql/slow.log

# Ship slow query logs to Loki for long-term retention
# /etc/promtail/config.yaml excerpt
scrape_configs:
  - job_name: mysql-slow
    static_configs:
      - targets: [localhost]
        labels:
          job: mysql-slow
          host: db-primary-01
    pipeline_stages:
      - regex:
          expression: '# Time: (?P<ts>\S+)\n# User.*\n# Query_time: (?P<query_time>\S+).*\n(?P<query>.*)'
      - labels:
          query_time:
    __path__: /var/lib/mysql/slow.log

The p99, not the average: Average query time hides the tail. Always build dashboards around p99 (and sometimes p999 for financial or safety-critical systems). A query that averages 5 ms but spikes to 4,000 ms at p99 is a production incident waiting to happen. In Prometheus, compute percentiles from pg_stat_statements histograms exposed by postgres_exporter.

Lock Analysis

Locks are the silent killer in multi-writer databases. A transaction holding a row lock while waiting for user input, or an accidental table-level lock from a schema migration, can queue up hundreds of waiting connections in seconds.

A single blocker transaction holding a row lock queues multiple waiter connections — resolving the root blocker (PID 42) unblocks the entire chain.

-- PostgreSQL: identify blockers and their waiter queues
SELECT
  blocker.pid                          AS blocker_pid,
  blocker.usename                      AS blocker_user,
  blocker.query                        AS blocker_query,
  now() - blocker.xact_start          AS blocker_age,
  waiter.pid                           AS waiter_pid,
  waiter.query                         AS waiter_query,
  now() - waiter.state_change          AS waiting_for
FROM pg_stat_activity waiter
JOIN pg_stat_activity blocker
  ON blocker.pid = ANY(pg_blocking_pids(waiter.pid))
WHERE waiter.wait_event_type = 'Lock'
ORDER BY blocker_age DESC;

-- MySQL: show blocking transactions
SELECT
  r.trx_id             AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query          AS waiting_query,
  b.trx_id             AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query          AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

Lock-wait timeout is not your safety net: The default lock_timeout in PostgreSQL is 0 (infinite wait). In MySQL the default innodb_lock_wait_timeout is 50 seconds. Both values are dangerously permissive in high-throughput systems. Set lock_timeout = 5s in PostgreSQL at the application level and instrument your connection pool to alert when lock-wait errors spike — they are a leading indicator of a deeper contention problem.

Replication Lag Metrics

Replication lag is the seconds (or bytes) by which a replica trails the primary. It is a critical SLO signal: reads from a lagged replica return stale data, and a replica that cannot keep up will never catch up under sustained write load. Big-tech teams alert on lag at two thresholds — a warning at 10 seconds and a page at 60 seconds — and treat any lag trend above zero as a performance bug.

PostgreSQL replication lag is measured in both bytes (LSN difference) and wall-clock seconds. The authoritative source is pg_stat_replication on the primary:

-- PostgreSQL: lag in bytes and seconds per replica
SELECT
  application_name,
  client_addr,
  state,
  sync_state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn))              AS flush_lag,
  pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn))            AS replay_lag,
  write_lag, flush_lag AS flush_lag_time, replay_lag AS replay_lag_time
FROM pg_stat_replication
ORDER BY replay_lag DESC;

-- On the replica itself: seconds of lag
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag_seconds;

-- Prometheus alert rule (PrometheusRule) for replication lag
# prometheus-rules/database.yaml
groups:
  - name: database.replication
    rules:
      - alert: PostgresReplicationLagWarning
        expr: pg_replication_lag{job="postgres"} > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.instance }} is {{ $value | humanizeDuration }} behind primary"
      - alert: PostgresReplicationLagCritical
        expr: pg_replication_lag{job="postgres"} > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CRITICAL: Replica lag {{ $value }}s — reads may return stale data"

      - alert: MySQLSlaveIONotRunning
        expr: mysql_slave_status_slave_io_running{job="mysql"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "MySQL replica IO thread stopped on {{ $labels.instance }}"

For MySQL, the canonical lag metric is Seconds_Behind_Master from SHOW SLAVE STATUS — exposed by mysqld_exporter as mysql_slave_status_seconds_behind_master. Treat this value critically: it resets to 0 when the IO thread stops, giving a false green. Always monitor both the lag value and the IO/SQL thread running status.

Query Stats Aggregation with pg_stat_statements and Performance Schema

Ad-hoc slow-query analysis catches individual incidents. Query stats aggregation catches systemic patterns — the query that is fast individually but accounts for 40% of total DB CPU because it runs 50,000 times per minute. This is the difference between firefighting and capacity planning.

PostgreSQL's pg_stat_statements normalizes query fingerprints and accumulates per-query counters across the lifetime of the server. MySQL's Performance Schema events_statements_summary_by_digest serves the same purpose. Both should be queried on a schedule (every 1–5 minutes) and their deltas shipped to your metrics backend.

-- PromQL: surface queries dominating DB execution time
-- (postgres_exporter exposes pg_stat_statements as a Gauge per query fingerprint)

# Total execution time rate per query (top-5 by CPU burn)
topk(5,
  rate(pg_stat_statements_total_exec_time_seconds_total[5m])
)

# Cache hit ratio per query — below 0.95 means disk reads
1 - (
  rate(pg_stat_statements_blks_read_total[5m])
  /
  (rate(pg_stat_statements_blks_hit_total[5m]) + rate(pg_stat_statements_blks_read_total[5m]) + 0.001)
)

# MySQL: top 10 digests by total latency (Performance Schema)
SELECT
  schema_name,
  LEFT(digest_text, 100) AS query_fingerprint,
  count_star          AS executions,
  round(sum_timer_wait / 1e12, 2) AS total_latency_s,
  round(avg_timer_wait / 1e9, 2)  AS avg_latency_ms,
  sum_rows_examined,
  sum_rows_sent
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10;

The observability stack for databases at scale: Prometheus + postgres_exporter (or mysqld_exporter) scrapes engine metrics every 15 seconds. Grafana dashboards surface the four golden signals adapted for databases: error rate (connection failures, deadlocks), latency (p99 query time), traffic (queries/sec), and saturation (connection pool fill, replication lag). Loki ingests slow-query log lines for full-text search. PagerDuty routes critical alerts. This four-layer stack — scrape, store, visualize, alert — is the baseline at every major cloud company.

Building a Minimal Observability Baseline in 30 Minutes

If you are starting from zero on a new cluster, this is the order of operations that delivers 80% of production value fastest:

Enable pg_stat_statements (PostgreSQL) or Performance Schema digests (MySQL) — these are safe for production and are typically already on in managed services.
Deploy postgres_exporter or mysqld_exporter as a sidecar or systemd service; point Prometheus at it with a 15s scrape interval.
Import the community Grafana dashboards (ID 9628 for PostgreSQL, ID 7362 for MySQL) as a starting point, then customize thresholds for your traffic profile.
Set log_min_duration_statement = 500 (Postgres) or enable the slow query log (MySQL); ship logs to Loki or CloudWatch Logs Insights.
Write three alert rules: replication lag critical, connection saturation above 80%, and slow-query rate above your baseline.

Automated query regression detection: At Google and Meta scale, slow-query detection is automated — a CI step runs EXPLAIN ANALYZE on every new query in a pull request and fails the build if the plan cost exceeds a threshold. For most teams a simpler version works: ship pg_stat_statements deltas to Datadog or Grafana Cloud, and set anomaly-detection alerts on per-fingerprint execution time. The first time you catch a bad index removal in code review rather than a 3 AM page, you will understand why this investment pays off.

Database observability is not a one-time setup — it is a feedback loop. Every incident should end with a new alert rule or a new dashboard panel so the same failure mode is caught automatically next time. Teams that treat observability as a living system catch problems in minutes; teams that treat it as a checkbox catch them in hours or not at all.