Databases on Kubernetes
Databases on Kubernetes
Running databases on Kubernetes is one of the most debated topics in platform engineering. For years the conventional wisdom was "stateless workloads only — keep your databases outside the cluster." That advice made sense in 2016 when Kubernetes storage primitives were immature. In 2025, with mature operators, CSI drivers, and production battle-testing at companies like Zalando, Cloudflare, and GitLab, the calculus has shifted. But "you can" is not the same as "you should," and the conditions under which each answer is correct matter enormously.
Why Running Databases in Kubernetes Is Hard
Kubernetes is designed around cattle — stateless pods that can be killed, rescheduled, and scaled horizontally at will. Databases are pets: they have identity, disk state, replication topology, and cluster membership that must survive restarts and node failures without data loss. Several primitives bridge this gap:
- StatefulSet — assigns stable, ordered pod identities (
pg-0,pg-1,pg-2) that survive pod restarts. Each pod gets its own PersistentVolumeClaim. - PersistentVolumeClaim (PVC) — the pod's durable storage, backed by a CSI driver (EBS, GCE PD, Ceph, Longhorn, etc.). The PVC outlives the pod.
- Headless Service — a Service with
clusterIP: Nonethat exposes each pod's DNS name directly (pg-0.postgres-headless.default.svc.cluster.local), allowing the operator to route writes to the primary and reads to replicas. - PodDisruptionBudget (PDB) — prevents Kubernetes from draining too many database pods simultaneously during node maintenance.
Even with these primitives, you would not write a production-grade database operator from scratch. The coordination logic — leader election, replica promotion, switchover with zero data loss, configuration reload on spec change — is complex enough that the community has converged on operators as the standard delivery mechanism.
Operators: The Right Abstraction
A Kubernetes operator encodes a human operator's runbook as a control loop. It watches Custom Resources (CRs) you define — e.g. a Cluster object for PostgreSQL — and reconciles the actual cluster state toward the desired state. Good operators handle:
- Bootstrap: initialising the primary, streaming replicas from it, and registering them in the topology
- Failover: detecting primary failure, electing a new leader, updating the headless service endpoints, all without data loss
- Switchover: graceful, zero-downtime primary hand-off for maintenance
- Backup: scheduled base backups to object storage (S3, GCS) and continuous WAL archiving
- Configuration management: translating a spec change into a coordinated rolling reload or restart
- Minor and major version upgrades: orchestrated cluster-wide upgrades with rollback support
CloudNativePG — The Benchmark PostgreSQL Operator
CloudNativePG (CNPG) is a CNCF sandbox project and the most production-mature PostgreSQL operator available. It runs PostgreSQL directly in pods (no sidecar), uses PostgreSQL streaming replication natively, and integrates WAL archiving with object storage out of the box. Here is a minimal but production-aligned cluster manifest:
Apply it and CNPG bootstraps the cluster, configures streaming replication, starts WAL archiving, and creates a PodMonitor so Prometheus scrapes the built-in postgres_exporter metrics automatically. Check cluster status with:
Storage: The Make-or-Break Decision
The operator is only as reliable as the storage layer beneath it. Storage choice has a larger impact on database performance and data safety than almost any other Kubernetes decision.
Key storage principles for production databases on Kubernetes:
- Use block storage, not shared filesystems. NFS and most CephFS configurations are not safe for PostgreSQL or MySQL data directories because they do not provide the
fsyncguarantees databases depend on. Use block volumes (EBS, GCE PD, Ceph RBD in block mode). - Enable volume encryption at the StorageClass level, not at the application layer. An
encrypted: "true"annotation or a KMS key reference in the StorageClass ensures every PVC is encrypted automatically. - Request Guaranteed QoS for storage I/O. On nodes with multiple workloads, noisy-neighbour disk I/O is a major latency source. Use
topologySpreadConstraintsor dedicated node groups (via node selectors and taints) to isolate database pods. - Never use
emptyDirfor database data.emptyDiris ephemeral — it is destroyed when the pod is evicted. This is the single most common data-loss mistake when first running databases in Kubernetes. - Set a
reclaimPolicy: RetainStorageClass for all database PVCs. The defaultDeletepolicy will permanently destroy your data volume the moment a PVC is deleted — whether intentionally or by operator error.
kubectl delete pvc pg-prod-1 on a StorageClass with reclaimPolicy: Delete, the underlying EBS volume is permanently destroyed within seconds. Always use reclaimPolicy: Retain for database volumes. After a deliberate decommission, manually delete the PV and the backing volume only after confirming the data is no longer needed or a verified backup exists.
When NOT to Run Databases on Kubernetes
Even with mature operators, there are situations where running a database inside Kubernetes adds complexity with no proportional benefit:
- You are already on a managed service and have no operational problem. RDS, Cloud SQL, and Aurora solve real problems cheaply. Migrating to a CNPG cluster to gain control you do not need is pure complexity.
- Your team does not have Kubernetes expertise. Operating a CNPG cluster requires understanding StatefulSets, PVCs, StorageClasses, network policies, and the operator's own CRDs. A team unfamiliar with Kubernetes should not adopt it as a database platform first.
- Your storage layer is not production-grade. If the cluster uses ephemeral local disks, a poorly tuned Longhorn installation, or NFS, running a production database on it will end badly. Fix the infrastructure before running stateful workloads.
- Compliance or data residency requirements dictate isolation. Some regulatory frameworks require databases to run on dedicated, single-tenant hardware. A shared Kubernetes cluster violates these requirements regardless of operator maturity.
- Extremely high I/O workloads. A cluster running on general-purpose network block storage will never match the throughput of a bare-metal NVMe host. For write-intensive OLTP at extreme scale, a dedicated host or bare-metal database server is still the right answer.
Key Operational Practices
If you do run databases on Kubernetes, these practices separate production-grade deployments from experiments:
- Always configure a PDB. Without a PodDisruptionBudget, a
kubectl drainduring node maintenance can evict both the primary and a replica simultaneously, causing a brief outage or failover cascade. - Use anti-affinity rules to spread database pods across physical nodes and availability zones. A primary and both replicas on the same node means a single node failure takes down the entire cluster.
- Test failover regularly using the operator's promote command. Know the RTO from a primary failure before you experience it in production.
- Verify WAL archiving and restore. Running
kubectl cnpg backup pg-prodis not enough — regularly restore that backup into a separate namespace or staging cluster to confirm it is valid. - Separate the database namespace with NetworkPolicies that allow only application namespaces to reach database ports. Databases should never be reachable from the open internet or from unrelated workloads in the same cluster.