Deployment Strategies & Progressive Delivery

Rolling Deployments

18 min Lesson 2 of 28

Rolling Deployments

A rolling deployment replaces instances of the old version with the new version incrementally — one batch at a time — so that the application is never fully offline. It is the default deployment strategy in Kubernetes, Amazon ECS, and most managed compute platforms because it hits a practical sweet spot: zero downtime, minimal blast radius for a bad release, and no requirement for double the infrastructure (unlike blue-green).

Understanding the mechanics thoroughly — not just the happy path — is what separates engineers who configure a rolling deployment from engineers who can operate one safely in production.

The Mechanics: Surge, Unavailable, and Batch Size

Two parameters control the entire roll. In Kubernetes they live on the Deployment's spec.strategy.rollingUpdate stanza:

maxUnavailable — the maximum number of pods (or percentage of replicas) that may be below ready simultaneously during the rollout. Setting this to 0 means: never kill an old pod until a new one is confirmed healthy. This protects capacity at the cost of requiring the extra "surge" pod to exist.
maxSurge — the maximum number of pods above the desired replica count that may exist simultaneously. Setting this to 1 means Kubernetes is allowed to briefly run replicas + 1 pods. This is how it safely creates the new pod before terminating the old one.

The two settings trade capacity risk against speed. At one extreme, maxUnavailable: 25%, maxSurge: 0 kills a quarter of pods first, then fills them — fast but briefly under capacity. At the other, maxUnavailable: 0, maxSurge: 1 always adds before removing — zero capacity loss but uses more nodes momentarily. Most production systems pick a middle ground based on their resource headroom and SLO.

Key idea: maxUnavailable + maxSurge cannot both be zero — that would make progress impossible. Kubernetes validates this and rejects such a config. The defaults (maxSurge: 25%, maxUnavailable: 25%) are reasonable for stateless services but require tuning for anything with strict availability requirements.

Connection Draining: Why It Matters

When Kubernetes marks a pod for termination it immediately removes it from all Service endpoint slices — new connections stop being routed to it. But in-flight requests already connected to that pod are still being processed. Without draining, those requests get abruptly reset mid-flight.

The solution is connection draining, achieved through two cooperating mechanisms:

terminationGracePeriodSeconds on the pod — how long Kubernetes waits after sending SIGTERM before force-killing with SIGKILL. Defaults to 30 seconds. Your application must listen for SIGTERM and begin a graceful shutdown: stop accepting new connections, finish active requests, then exit cleanly.
preStop lifecycle hook — a small sleep (typically 5–15 seconds) injected before SIGTERM reaches the container. This window accounts for the propagation delay between the endpoint being removed from Service slices and all upstream load-balancer nodes (kube-proxy, Envoy sidecars, cloud LB) flushing their connection tables. Without this sleep, requests in-flight at the load balancer layer still arrive at the pod in the brief gap after its endpoint is removed but before it stops listening.

Production pitfall: Skipping the preStop sleep is the single most common cause of 502/504 errors during rolling deployments. The endpoint slice update and the load balancer flush are not instantaneous — there is a propagation window of 1–15 seconds depending on cluster size and LB type. Engineers who skip this see a clean rollout in staging (small clusters, fast propagation) and a burst of errors in production (large clusters, slow propagation). Always add the sleep.

# production-grade Deployment manifest
# Demonstrates: rolling strategy, surge/unavailable, drain lifecycle

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2          # at most 12 pods running simultaneously (10 + 2)
      maxUnavailable: 0    # never go below 10 healthy pods; zero capacity loss
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      terminationGracePeriodSeconds: 60   # must exceed preStop sleep + longest request
      containers:
        - name: api
          image: registry.example.com/api-server:v2.4.1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3           # pod must fail 3 consecutive checks to be pulled
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
                # 10-second window for LB flush before SIGTERM reaches the process

The Roll: Step-by-Step Diagram

The diagram below traces a 4-replica deployment rolling from v1 to v2 with maxSurge: 1, maxUnavailable: 0. Each column is one Kubernetes reconciliation tick.

Rolling wave with maxSurge=1, maxUnavailable=0: capacity never drops below 4 replicas; each old pod drains before termination.

Readiness Probes: The Safety Gate

The rolling controller will not remove an old pod until the new one passes its readiness probe. This is the mechanism that makes maxUnavailable: 0 meaningful. If your readiness probe is wrong — always-passes, checks the wrong path, or uses an initialDelaySeconds too short — the controller considers a booting pod "ready" and starts pulling capacity before the service can actually handle traffic.

At big-tech companies, readiness probes check a deep health endpoint that validates database connectivity, cache reachability, and critical dependency status — not just "is the HTTP port open." A pod that answers HTTP but cannot reach its database is not ready to serve traffic and must not receive it.

Pro practice: Separate your liveness and readiness probes. The liveness probe answers "is this pod deadlocked and needs to be restarted?" — it should be cheap and permissive. The readiness probe answers "is this pod capable of serving real traffic right now?" — it should be thorough. Conflating the two is a common misconfiguration that causes cascading restarts during dependency outages.

Monitoring the Rollout and Rollback

Never kick off a rolling deployment and walk away. Use kubectl rollout status to follow the wave live. Set a deadline with spec.progressDeadlineSeconds — if the rollout does not complete within that window, Kubernetes marks the deployment as stalled (surfaced as a DeadlineExceeded condition), which your CD system should treat as a failure and trigger rollback.

# Watch a rollout in real time
kubectl rollout status deployment/api-server -n production --timeout=300s

# Pause mid-roll (manual canary gate — stop the wave, observe metrics)
kubectl rollout pause deployment/api-server -n production

# Resume after verifying error rates / latency look clean
kubectl rollout resume deployment/api-server -n production

# Immediate rollback to the previous ReplicaSet (no re-deploy needed)
kubectl rollout undo deployment/api-server -n production

# Roll back to a specific revision
kubectl rollout history deployment/api-server -n production
kubectl rollout undo deployment/api-server -n production --to-revision=3

# See current status flags (check for DeadlineExceeded)
kubectl get deployment api-server -n production -o jsonpath='{.status.conditions}'

ECS Rolling Deployments

Amazon ECS uses different terminology for the same concepts. minimumHealthyPercent maps to the inverse of maxUnavailable, and maximumPercent determines the surge ceiling. For an ECS service with 10 tasks, setting minimumHealthyPercent: 90, maximumPercent: 110 is equivalent to Kubernetes's maxUnavailable: 1, maxSurge: 1. Load balancer connection draining is configured on the target group (deregistration_delay.timeout_seconds) — set it to match or exceed your longest expected request duration.

# AWS CLI: update ECS service with rolling config
aws ecs update-service \
  --cluster production \
  --service api-server \
  --task-definition api-server:42 \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "minimumHealthyPercent": 90,
    "maximumPercent": 110
  }' \
  --health-check-grace-period-seconds 30 \
  --region us-east-1

# deploymentCircuitBreaker with rollback: true is the ECS equivalent of
# progressDeadlineSeconds — if too many tasks fail to start healthy,
# ECS automatically rolls back to the previous task definition revision.

# Watch the deployment stabilize
aws ecs wait services-stable \
  --cluster production \
  --services api-server \
  --region us-east-1

Production Failure Modes and How to Catch Them

Rolling deployments fail in predictable patterns. Knowing them in advance lets you instrument for them before the deploy, not after the incident:

Version skew: During the roll, v1 and v2 run simultaneously. If v2 writes a database column that v1 does not know about, or changes an API response shape that a v1 caller expects, you have a skew bug. Lesson 8 (Expand-Contract) addresses this specifically — never deploy schema changes and application changes in the same roll.
Slow readiness probe: If your app takes 90 seconds to warm up but your initialDelaySeconds is 10, the probe will fire before the app is ready, fail, and the pod restarts in a CrashLoopBackOff cycle. The rollout stalls. Set initialDelaySeconds generously and use startupProbe for apps with variable boot times.
Resource starvation during surge: With maxSurge: 2 on a 10-replica deployment you briefly need nodes for 12 pods. If your cluster is already at 95% capacity, the surge pods will be Pending and the rollout will deadlock. Cluster autoscaler helps, but it has its own latency (1–3 minutes to provision a new node). Size your cluster with at least 20–30% headroom for rolling deployments.
PodDisruptionBudget (PDB) conflict: A PDB with minAvailable: 100% (or maxUnavailable: 0) will conflict with a Deployment whose maxUnavailable is also 0 — the eviction controller cannot satisfy both constraints and the rollout deadlocks. Coordinate your PDB and Deployment rolling settings.

Production pitfall — version skew is the silent killer: Unlike a bad readiness probe (which fails loudly), version skew fails silently. Requests routed to v1 pods succeed; requests routed to v2 pods fail. Your overall error rate climbs by a fraction proportional to the percentage of traffic hitting v2 pods. During a 25% roll, you see 25% of requests failing — but your alerting threshold may be 1%, and by the time it fires the roll may be 50% done. Always verify backward/forward compatibility of API and schema changes before starting a rolling deploy.

Rolling deployments are the workhorse strategy for routine releases. They are not appropriate for breaking changes — use feature flags (Lesson 5) or blue-green (Lesson 3) when the old and new versions cannot safely coexist. For everything else, a well-tuned rolling deployment with proper draining, tight readiness probes, and a circuit breaker is the right default.