CSI & Stateful Storage Operations
CSI & Stateful Storage Operations
The previous two lessons covered how Kubernetes abstracts storage into PersistentVolumes and how StorageClasses drive dynamic provisioning. This lesson goes one layer deeper: the Container Storage Interface (CSI) — the plugin system that makes every provisioner, snapshot provider, and cloud storage backend pluggable without touching the Kubernetes core. You will also learn snapshots, backup patterns, and the failure modes that trip up teams running databases and message queues in production.
What Is CSI and Why It Replaced In-Tree Plugins
Before CSI, storage drivers (AWS EBS, GCE PD, Ceph RBD, NFS) lived inside the Kubernetes source tree. Every driver release was coupled to a Kubernetes release — a bug in the Ceph driver meant waiting for the next Kubernetes minor version. The CSI specification (driven by the CNCF Storage SIG) decouples drivers completely: a driver is a set of gRPC services that the Kubelet and the external-provisioner sidecar call over a Unix socket. The Kubernetes core knows nothing about EBS volumes; the CSI driver does.
Every CSI driver ships as a DaemonSet (node plugin) plus a Deployment (controller plugin), connected by several sidecar containers maintained by the Kubernetes storage SIG:
- external-provisioner — watches PVCs and calls
CreateVolumeon the driver. - external-attacher — calls
ControllerPublishVolumeto attach the block device to the target node. - external-resizer — triggers
ControllerExpandVolumewhen a PVC storage request is increased. - external-snapshotter — watches VolumeSnapshot objects and calls
CreateSnapshot. - node-driver-registrar — registers the node plugin socket with the Kubelet.
- liveness-probe — exposes a health endpoint for the driver pod.
Volume Lifecycle: Attach, Stage, Publish
When a Pod that needs a PVC is scheduled, the following sequence runs in strict order:
- CreateVolume — the external-provisioner calls the CSI controller driver to create the raw storage object (an EBS volume ID, a Ceph image, etc.).
- ControllerPublishVolume — the external-attacher calls the controller driver to attach the volume to the target node (this maps to an EC2
AttachVolumeAPI call for EBS). - NodeStageVolume — the Kubelet asks the node driver to format the block device and mount it to a global staging directory on the node. This happens once per volume per node.
- NodePublishVolume — the Kubelet bind-mounts the staged path into the Pod container filesystem. This happens once per Pod.
Teardown is exactly the reverse. Understanding this pipeline is critical when debugging a Pod stuck in ContainerCreating: check events on the Pod (kubectl describe pod), then the PVC, then the VolumeAttachment object, and finally the CSI node driver logs.
Volume Snapshots
CSI introduced a first-class Kubernetes API for volume snapshots via three Custom Resource Definitions installed alongside the external-snapshotter sidecar:
- VolumeSnapshotClass — driver-scoped configuration (analogous to a StorageClass), e.g. which snapshot policy to use.
- VolumeSnapshot — a user-facing request: "create a snapshot of this PVC right now."
- VolumeSnapshotContent — the actual snapshot object on the backend, either pre-provisioned by an admin or dynamically created.
Online Volume Resizing
If a StorageClass has allowVolumeExpansion: true and the CSI driver implements ControllerExpandVolume and NodeExpandVolume, you can grow a PVC live without restarting the Pod:
Backup Strategies for Stateful Workloads
At big-tech scale, three backup patterns are used together rather than in isolation:
- Application-level backup: Use the database engine itself —
pg_dumpfor PostgreSQL,mysqldump, or Kafka topic mirroring. These are application-consistent, portable across cloud providers, and testable. Run them as CronJobs, stream output to object storage (S3, GCS), and encrypt at rest. - CSI snapshot + VolumeSnapshotSchedule: Use a tool like Velero or the operator pattern to schedule automated VolumeSnapshots. Snapshots are fast (copy-on-write) and restore in minutes. They complement application-level backups but should not replace them because snapshots capture the raw block device state, which may not be application-consistent if the database had unflushed writes.
- Velero (cluster backup): Velero backs up Kubernetes object manifests (Deployments, PVCs, Secrets, ConfigMaps) plus optionally PVC data via CSI snapshots or restic/kopia file-level backups. This is the standard approach for entire namespace or cluster migration. Velero stores manifests in object storage and is the most portable cross-cluster DR solution.
Production Failure Modes for Stateful Workloads
The following failures account for the vast majority of storage incidents in production Kubernetes clusters:
- Detach stuck on node drain: When you drain a node, Kubernetes calls
ControllerUnpublishVolume(detach). If the node is unresponsive (kernel panic, network partition), the cloud API sees the volume as still attached and refuses the detach call. EBS has a 6-minute attach timeout by default. The mitigation is to usekubectl delete node <name>after the node is confirmed dead, which allows the VolumeAttachment object to be force-deleted and triggers a re-attach on a healthy node. - Multi-attach error: Block volumes (EBS, Azure Disk) only support
ReadWriteOnce. If a StatefulSet Pod is rescheduled before the original Pod fully terminates, Kubernetes tries to attach the same volume to two nodes simultaneously and the attach fails with aMulti-Attach error. Fix: set a pod-levelterminationGracePeriodSecondsthat gives the application time to flush writes and exit cleanly. - Filesystem corruption on force-delete: Deleting a Pod with
--grace-period=0 --forceskips the graceful shutdown hooks. If a database was mid-write, the block device may have a dirty journal. Always let the database engine handle shutdown (SIGTERM → checkpoint → exit) and never force-delete stateful Pods unless the node is confirmed dead. - CSI driver pod on the same node as the workload: If the node plugin crashes, the Kubelet cannot call
NodePublishVolumeand new Pods on that node will hang inContainerCreating. Monitor the CSI DaemonSet with a PodDisruptionBudget to prevent accidental node-plugin eviction.
allowVolumeExpansion: true from day one — adding it later requires recreating the StorageClass. (2) Schedule automated VolumeSnapshots with a TTL. (3) Run weekly restore drills from Velero backups into a staging namespace. (4) Set reclaimPolicy: Retain on production PVs so that deleting a PVC does not immediately destroy the underlying disk. (5) Use volumeBindingMode: WaitForFirstConsumer to ensure volumes are provisioned in the same AZ as the Pod that consumes them — cross-AZ I/O is expensive and slow on EBS.