Kubernetes Networking & Storage

StorageClasses & Dynamic Provisioning

18 min Lesson 8 of 31

StorageClasses & Dynamic Provisioning

In the previous lesson you learned that PersistentVolumes and PersistentVolumeClaims decouple pod definitions from the physical storage underneath. But statically pre-provisioning a PV for every application team that asks for storage is operationally unsustainable at scale — it requires a human in the loop for every new database or cache deployment. Dynamic provisioning solves this: a developer submits a PersistentVolumeClaim describing the size and characteristics they need, and a provisioner — a controller running in the cluster — creates the backing storage asset automatically, binds it to a freshly minted PV, and returns a ready-to-mount volume. The API object that governs this behavior is the StorageClass.

Anatomy of a StorageClass

A StorageClass is a cluster-scoped resource (no namespace) that names three things: the provisioner (which driver creates the volume), the parameters (driver-specific configuration like disk type, IOPS tier, or encryption key), and the reclaimPolicy (what happens to the underlying storage asset when the PVC that owns it is deleted).

# Real-world StorageClass for AWS EBS gp3 via the EBS CSI driver apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ebs-gp3 annotations: storageclass.kubernetes.io/is-default-class: "true" # every PVC without a storageClassName gets this provisioner: ebs.csi.aws.com volumeBindingMode: WaitForFirstConsumer # <-- critical for multi-AZ clusters (explained below) reclaimPolicy: Delete # Delete | Retain | Recycle (deprecated) allowVolumeExpansion: true parameters: type: gp3 iops: "6000" throughput: "250" # MB/s encrypted: "true" kmsKeyId: "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123" --- # A premium StorageClass for latency-sensitive workloads (io2 Block Express) apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ebs-io2-high-perf provisioner: ebs.csi.aws.com volumeBindingMode: WaitForFirstConsumer reclaimPolicy: Retain # keep the EBS volume even if the PVC is deleted — for auditing/recovery allowVolumeExpansion: true parameters: type: io2 iopsPerGB: "50" encrypted: "true"
The default StorageClass matters. Any PVC that omits storageClassName receives the cluster default. On EKS the default out of the box is an older gp2 StorageClass — it predates gp3 and costs more per GB for worse baseline performance. One of the first infrastructure changes at every EKS shop should be setting a gp3 StorageClass as the default and annotating the gp2 one as non-default. Failure to do this results in every team quietly burning money on gp2 volumes.

Provisioners: In-Tree vs. CSI

Historically Kubernetes shipped volume drivers baked into the controller-manager binary (called "in-tree" provisioners, e.g. kubernetes.io/aws-ebs). These are now deprecated and removed from recent Kubernetes versions. The modern replacement is the Container Storage Interface (CSI) — a vendor-neutral gRPC spec that lets storage vendors ship their drivers as ordinary Kubernetes workloads (Deployments and DaemonSets), fully decoupled from the Kubernetes release cycle. Every production cluster today should use CSI drivers exclusively.

Common CSI provisioners you will encounter:

  • ebs.csi.aws.com — AWS EBS (block)
  • efs.csi.aws.com — AWS EFS (file, ReadWriteMany)
  • disk.csi.azure.com — Azure Managed Disks
  • pd.csi.storage.gke.io — GCP Persistent Disk
  • rbd.csi.ceph.com — Ceph RBD (self-managed)
  • driver.longhorn.io — Longhorn (self-managed, replicated block)

volumeBindingMode: The Multi-AZ Trap

This field is the most common source of storage-related production incidents on cloud clusters. It has two values:

  • Immediate — the provisioner creates and binds the PV as soon as the PVC is created. This happens before any pod is scheduled, so the volume is created in an arbitrary AZ. When the scheduler then tries to place the pod, it may pick a node in a different AZ where the EBS volume is not accessible — the pod goes Pending forever with a volume node affinity conflict error.
  • WaitForFirstConsumer — PV creation is deferred until a pod that references the PVC is being scheduled. The scheduler picks the node first, then the provisioner creates the volume in the same AZ as that node. This is the only correct mode for zone-aware block storage on multi-AZ clusters.
WaitForFirstConsumer binding prevents AZ mismatch Immediate Binding — BROKEN PVC created Provisioner creates EBS in us-east-1a PV bound AZ: us-east-1a Scheduler picks node in us-east-1b Pod: Pending volume affinity conflict! WaitForFirstConsumer — CORRECT PVC created PV pending (not yet provisioned) Scheduler picks node in us-east-1c Provisioner EBS in us-east-1c Pod: Running volume mounted
Immediate binding creates the EBS volume before scheduling, causing AZ conflicts. WaitForFirstConsumer defers creation until the node is known, ensuring volume and node are co-located.

Reclaim Policies: What Happens When a PVC Is Deleted

The reclaimPolicy on a StorageClass controls the lifecycle of the backing storage asset after the PVC that bound it is deleted. There are three values, but only two matter in practice:

  • Delete (default for most cloud StorageClasses) — the CSI driver deletes the underlying asset (e.g. terminates the EBS volume) as soon as the PV is released. This is efficient but dangerous: deleting a PVC in the wrong namespace will irrecoverably destroy production data. Always pair Delete-policy StorageClasses with PVC deletion protection (e.g. Velero backup policies or a validating webhook that prevents PVC deletion in critical namespaces).
  • Retain — the PV transitions to the Released phase and the underlying storage asset is preserved. An administrator must manually delete the PV object and optionally clean up the backing resource. This is the correct policy for any tier that holds data you cannot afford to lose — databases, blob stores, audit logs. The recovered data can be re-attached by creating a new PV that points at the same backing resource and a new PVC with a matching volumeName.
Reclaim policy is set at PV creation time, not at deletion time. The StorageClass reclaimPolicy is copied into the PV at provisioning time. Changing the StorageClass afterwards has no effect on existing PVs. If you need to change the reclaim policy on an already-bound PV (e.g. you provisioned with Delete but want Retain before deleting the PVC), patch the PV directly: kubectl patch pv <pv-name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'. Do this before deleting the PVC.

Volume Expansion: Growing a PVC Without Downtime

When a volume fills up, the correct response is not to provision a new one and migrate data — it is to expand the existing PVC in place. For this to work, three things must be true: the StorageClass must have allowVolumeExpansion: true, the CSI driver must implement the ControllerExpandVolume and NodeExpandVolume RPC calls, and (for filesystem volumes) the node-side expansion must complete while the volume is mounted.

Modern CSI drivers for all major cloud providers support online expansion — the EBS volume is resized via the AWS API and the filesystem (ext4 or XFS) is expanded without unmounting. For older or self-managed drivers that only support offline expansion, you must delete the pod first, allow the PV to detach, then patch the PVC and bring the pod back up.

# 1. Verify the StorageClass allows expansion kubectl get sc ebs-gp3 -o jsonpath='{.allowVolumeExpansion}' # Expected: true # 2. Edit the PVC to request more storage kubectl edit pvc my-database-data -n production # Under spec.resources.requests.storage, change "50Gi" to "100Gi" # Or patch it imperatively: kubectl patch pvc my-database-data -n production \ -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}' # 3. Watch the resize happen — the PVC condition changes: kubectl get pvc my-database-data -n production -w # You will see conditions cycle through: # Resizing -> FileSystemResizePending -> (removed once done) # 4. Confirm the new size is visible from inside the pod kubectl exec -n production deployment/my-database -- df -h /data # The filesystem should now report 100Gi available # 5. Recover from a stuck resize (FileSystemResizePending but pod never sees new size) # This happens when node expansion stalled. Delete the pod so the volume detaches # and re-attaches, triggering node-level resize on mount: kubectl delete pod -n production -l app=my-database # Pod restarts, mounts the volume, CSI driver calls NodeExpandVolume, FS is resized
Set PVC storage requests conservatively and rely on expansion rather than over-provisioning. Starting with 20 Gi and expanding to 50 Gi when needed costs less and keeps storage inventory tidy. Configure alerts on kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.80 in Prometheus so you expand before the disk fills, not after the application crashes.

Putting It Together: StorageClass Design at Production Scale

A mature cluster typically has three to five StorageClasses that map to distinct cost and performance tiers. A representative design for an AWS EKS cluster might look like this: a default ebs-gp3 StorageClass for general workloads, an ebs-io2 class for databases that require predictable IOPS, an efs-shared class for read-write-many workloads like ML training data or shared config mounts, and a local-nvme class (using the local-path or TopoLVM CSI) for latency-critical scratch workloads that tolerate data loss on node failure.

Every StorageClass that handles production data should have reclaimPolicy: Retain, allowVolumeExpansion: true, and volumeBindingMode: WaitForFirstConsumer. These three settings together prevent AZ mismatches, accidental data loss, and the operational pain of re-provisioning volumes for capacity increases.

Snapshot support goes hand-in-hand with StorageClasses. CSI drivers that implement the snapshot RPC allow you to create a VolumeSnapshotClass (analogous to a StorageClass) and take crash-consistent snapshots of PVCs with kubectl apply -f volumesnapshot.yaml. Snapshots are the foundation of backup-and-restore pipelines for stateful workloads and are a prerequisite for safely testing volume migrations or schema changes in production clusters.