Kubernetes Networking & Storage

CSI & Stateful Storage Operations

18 min Lesson 9 of 31

CSI & Stateful Storage Operations

The previous two lessons covered how Kubernetes abstracts storage into PersistentVolumes and how StorageClasses drive dynamic provisioning. This lesson goes one layer deeper: the Container Storage Interface (CSI) — the plugin system that makes every provisioner, snapshot provider, and cloud storage backend pluggable without touching the Kubernetes core. You will also learn snapshots, backup patterns, and the failure modes that trip up teams running databases and message queues in production.

What Is CSI and Why It Replaced In-Tree Plugins

Before CSI, storage drivers (AWS EBS, GCE PD, Ceph RBD, NFS) lived inside the Kubernetes source tree. Every driver release was coupled to a Kubernetes release — a bug in the Ceph driver meant waiting for the next Kubernetes minor version. The CSI specification (driven by the CNCF Storage SIG) decouples drivers completely: a driver is a set of gRPC services that the Kubelet and the external-provisioner sidecar call over a Unix socket. The Kubernetes core knows nothing about EBS volumes; the CSI driver does.

Every CSI driver ships as a DaemonSet (node plugin) plus a Deployment (controller plugin), connected by several sidecar containers maintained by the Kubernetes storage SIG:

external-provisioner — watches PVCs and calls CreateVolume on the driver.
external-attacher — calls ControllerPublishVolume to attach the block device to the target node.
external-resizer — triggers ControllerExpandVolume when a PVC storage request is increased.
external-snapshotter — watches VolumeSnapshot objects and calls CreateSnapshot.
node-driver-registrar — registers the node plugin socket with the Kubelet.
liveness-probe — exposes a health endpoint for the driver pod.

CSI architecture: the controller plugin provisions and attaches volumes; the node plugin mounts them into Pods.

Volume Lifecycle: Attach, Stage, Publish

When a Pod that needs a PVC is scheduled, the following sequence runs in strict order:

CreateVolume — the external-provisioner calls the CSI controller driver to create the raw storage object (an EBS volume ID, a Ceph image, etc.).
ControllerPublishVolume — the external-attacher calls the controller driver to attach the volume to the target node (this maps to an EC2 AttachVolume API call for EBS).
NodeStageVolume — the Kubelet asks the node driver to format the block device and mount it to a global staging directory on the node. This happens once per volume per node.
NodePublishVolume — the Kubelet bind-mounts the staged path into the Pod container filesystem. This happens once per Pod.

Teardown is exactly the reverse. Understanding this pipeline is critical when debugging a Pod stuck in ContainerCreating: check events on the Pod (kubectl describe pod), then the PVC, then the VolumeAttachment object, and finally the CSI node driver logs.

Volume Snapshots

CSI introduced a first-class Kubernetes API for volume snapshots via three Custom Resource Definitions installed alongside the external-snapshotter sidecar:

VolumeSnapshotClass — driver-scoped configuration (analogous to a StorageClass), e.g. which snapshot policy to use.
VolumeSnapshot — a user-facing request: "create a snapshot of this PVC right now."
VolumeSnapshotContent — the actual snapshot object on the backend, either pre-provisioned by an admin or dynamically created.

# 1. Ensure the snapshot CRDs and controller are installed (usually done by the CSI driver installer)
kubectl get crd volumesnapshots.snapshot.storage.k8s.io

# 2. Create a VolumeSnapshotClass (EBS CSI driver example)
cat <<'EOF' | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-vsc
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  tagSpecification_1: "Key=Environment,Value=production"
EOF

# 3. Take a snapshot of an existing PVC
cat <<'EOF' | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snap-20250601
  namespace: production
spec:
  volumeSnapshotClassName: ebs-vsc
  source:
    persistentVolumeClaimName: postgres-data
EOF

# 4. Wait for readiness and inspect
kubectl -n production get volumesnapshot postgres-snap-20250601 -w
# READY TO USE will flip to true when the snapshot is consistent on the backend

# 5. Restore: create a new PVC from the snapshot
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-restore
  namespace: production
spec:
  storageClassName: ebs-sc
  dataSource:
    name: postgres-snap-20250601
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
EOF

Snapshot vs. Backup: A CSI snapshot is a point-in-time copy stored on the same backend (EBS snapshot lives in S3, but within the same AWS account). It protects against data corruption and accidental deletion, but not against region failure or account compromise. A snapshot is not a backup — you need an off-cluster copy (Velero, S3 cross-region replication, or a dedicated backup tool) for true disaster recovery.

Online Volume Resizing

If a StorageClass has allowVolumeExpansion: true and the CSI driver implements ControllerExpandVolume and NodeExpandVolume, you can grow a PVC live without restarting the Pod:

# Patch the PVC to request more storage
kubectl -n production patch pvc postgres-data -p \
  '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Watch the resize flow: PVC condition will show FileSystemResizePending, then clear
kubectl -n production get pvc postgres-data -w

# The CSI driver expands the block device first (ControllerExpandVolume),
# then the Kubelet resizes the filesystem on next mount (NodeExpandVolume).
# For ext4/xfs the resize happens online; the Pod does not need to stop.

# Verify inside the running Pod
kubectl -n production exec -it postgres-0 -- df -h /var/lib/postgresql/data

Volume shrinking is not supported. Kubernetes CSI does not allow reducing a PVC size. If you need to reclaim space, you must back up data, delete and recreate the PVC with the smaller size, and restore. Plan storage capacity carefully from the start; over-provisioning slightly is cheaper than a maintenance window.

Backup Strategies for Stateful Workloads

At big-tech scale, three backup patterns are used together rather than in isolation:

Application-level backup: Use the database engine itself — pg_dump for PostgreSQL, mysqldump, or Kafka topic mirroring. These are application-consistent, portable across cloud providers, and testable. Run them as CronJobs, stream output to object storage (S3, GCS), and encrypt at rest.
CSI snapshot + VolumeSnapshotSchedule: Use a tool like Velero or the operator pattern to schedule automated VolumeSnapshots. Snapshots are fast (copy-on-write) and restore in minutes. They complement application-level backups but should not replace them because snapshots capture the raw block device state, which may not be application-consistent if the database had unflushed writes.
Velero (cluster backup): Velero backs up Kubernetes object manifests (Deployments, PVCs, Secrets, ConfigMaps) plus optionally PVC data via CSI snapshots or restic/kopia file-level backups. This is the standard approach for entire namespace or cluster migration. Velero stores manifests in object storage and is the most portable cross-cluster DR solution.

# Install Velero with the AWS CSI snapshot plugin (abbreviated — see Velero docs for full IAM config)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --features=EnableCSIVolumeSnapshots

# Create a scheduled backup of the production namespace every 6 hours
velero schedule create prod-6h \
  --schedule="0 */6 * * *" \
  --include-namespaces production \
  --ttl 72h

# Trigger an on-demand backup
velero backup create prod-manual-$(date +%Y%m%d-%H%M) \
  --include-namespaces production \
  --wait

# List backups and check status
velero backup get
velero backup describe prod-manual-20250601-1430 --details

# Restore a namespace into a new cluster or namespace
velero restore create --from-backup prod-manual-20250601-1430 \
  --namespace-mappings production:production-restored

Production Failure Modes for Stateful Workloads

The following failures account for the vast majority of storage incidents in production Kubernetes clusters:

Detach stuck on node drain: When you drain a node, Kubernetes calls ControllerUnpublishVolume (detach). If the node is unresponsive (kernel panic, network partition), the cloud API sees the volume as still attached and refuses the detach call. EBS has a 6-minute attach timeout by default. The mitigation is to use kubectl delete node <name> after the node is confirmed dead, which allows the VolumeAttachment object to be force-deleted and triggers a re-attach on a healthy node.
Multi-attach error: Block volumes (EBS, Azure Disk) only support ReadWriteOnce. If a StatefulSet Pod is rescheduled before the original Pod fully terminates, Kubernetes tries to attach the same volume to two nodes simultaneously and the attach fails with a Multi-Attach error. Fix: set a pod-level terminationGracePeriodSeconds that gives the application time to flush writes and exit cleanly.
Filesystem corruption on force-delete: Deleting a Pod with --grace-period=0 --force skips the graceful shutdown hooks. If a database was mid-write, the block device may have a dirty journal. Always let the database engine handle shutdown (SIGTERM → checkpoint → exit) and never force-delete stateful Pods unless the node is confirmed dead.
CSI driver pod on the same node as the workload: If the node plugin crashes, the Kubelet cannot call NodePublishVolume and new Pods on that node will hang in ContainerCreating. Monitor the CSI DaemonSet with a PodDisruptionBudget to prevent accidental node-plugin eviction.

Operational checklist for stateful workloads in production: (1) Enable allowVolumeExpansion: true from day one — adding it later requires recreating the StorageClass. (2) Schedule automated VolumeSnapshots with a TTL. (3) Run weekly restore drills from Velero backups into a staging namespace. (4) Set reclaimPolicy: Retain on production PVs so that deleting a PVC does not immediately destroy the underlying disk. (5) Use volumeBindingMode: WaitForFirstConsumer to ensure volumes are provisioned in the same AZ as the Pod that consumes them — cross-AZ I/O is expensive and slow on EBS.