Rollbacks & Roll-Forward
Rollbacks & Roll-Forward
No deployment strategy eliminates failures — they all reduce the blast radius. When a bad deploy slips through your canary analysis, your feature flag gates, or your smoke tests, two recovery paths exist: roll back (restore the previous known-good state) or roll forward (ship a targeted fix as the next release). Choosing the wrong path under pressure is one of the most costly operational mistakes you can make. This lesson codifies the mechanics, the decision framework, and the safety nets that let you recover in minutes rather than hours.
Why Rollback Is Not Free
Many teams treat rollback as a guaranteed escape hatch. It is not. A rollback is another deployment — it carries its own risk, its own migration window, and its own potential for failure. The assumptions that make rollback fast and safe are:
- The previous artifact still exists and is immutable. Container registries, S3 deployment buckets, and Helm chart repositories must retain old versions. If you allow tag mutation (e.g., re-pushing to
:latest), you have no artifact to roll back to. - The database schema is backward-compatible with the previous code. If the current release ran an additive migration (new column, new table), rolling back the code is safe — the old code simply ignores the new column. If the release ran a destructive migration (drop column, rename column, change type), rollback is catastrophically unsafe unless you have already applied the Expand-Contract pattern.
- External state has not advanced beyond the point of no return. If the new code sent emails, billed customers, or published events to Kafka, rolling back the service does not undo those side-effects. Plan compensating transactions or accept the drift.
ALTER TABLE migration and then tries to roll back. If the column has been dropped in production, no amount of Kubernetes rollout undo commands will help. Database changes must be the last line of a rollback plan, not the first instinct. Always apply Expand-Contract migrations before deploying new code.Fast Rollback Mechanics by Platform
Each deployment target has its own rollback primitive. Knowing all of them — and their latencies — is essential.
Kubernetes: kubectl rollout undo
Kubernetes stores the last ten ReplicaSet revisions by default (controlled by revisionHistoryLimit). A rollback reactivates a prior ReplicaSet rather than re-pulling an image — making it fast, typically completing in under 60 seconds for a small Deployment:
revisionHistoryLimit: 5 in your Deployment spec explicitly — the default of 10 wastes etcd storage at scale. Five revisions gives you enough rollback headroom while staying lean. In a GitOps world (ArgoCD, Flux), you roll back by reverting the Git commit; kubectl rollout undo should be reserved for break-glass emergencies only, because it creates drift between Git and the cluster.Helm: Rolling Back a Chart Release
Helm maintains a release history in Kubernetes Secrets. Rolling back a Helm release restores the previous values.yaml snapshot and re-applies the old templates:
AWS ECS / App Runner: Task Definition Rollback
The Decision Framework: Roll Back or Roll Forward?
The framework above distills to three questions asked in sequence:
- Has the database moved forward destructively? If yes, rollback is off the table — the old code expects a schema that no longer exists. Roll forward with a targeted fix.
- Do you know what is broken, and can you fix it fast? If yes (a missing nil-check, a wrong environment variable, a bad config value), it is often faster and safer to ship the fix than to coordinate a rollback — especially when traffic is already flowing through the new code path and your canary infrastructure is warmed up.
- Is the previous artifact intact and DB-safe? If both answers are yes, roll back immediately. Every minute of degraded service costs real money and real user trust.
GitOps Rollback — The Production-Safe Pattern
In a GitOps environment (ArgoCD, Flux), the Git repository is the source of truth. The correct rollback is a Git revert, not a kubectl command, because imperative kubectl changes create drift between the cluster state and the declared state in Git:
The ArgoCD app rollback command redeploys from a prior cached Git state — it is an escape hatch for when the reverted commit has not yet propagated. In steady state, a git revert + push is the canonical path because it produces an auditable change in Git history and does not leave the cluster in a state that ArgoCD considers "OutOfSync."
Deployment Safety Nets — Preventing the Need for Rollback
The best rollback is the one you never need. Three safety nets drastically reduce rollback frequency:
1. Automated Smoke Tests in the Deploy Pipeline
2. Progressive Traffic Shifting with Automatic Abort
Argo Rollouts (or Flagger) can be configured to abort and roll back automatically when error rate or latency thresholds are breached during a canary promotion. The rollout spec encodes the safety net directly:
When the AnalysisTemplate fires two consecutive failures (failureLimit: 2), Argo Rollouts automatically sets weight: 0 for the canary, scales the stable ReplicaSet back to 100%, and marks the rollout as Degraded. No human action required — the system rolls back itself.
3. Immutable Artifacts and Tag Discipline
prod-pinned lifecycle policy that prevents deletion. This guarantees that even if your CI system purges old images during cleanup, the five most recent production versions are always available for an emergency rollback without a rebuild. ECR lifecycle policies and Harbor retention rules both support this pattern.Roll-Forward in Practice — The Hotfix Pipeline
Rolling forward requires a hotfix pipeline that is faster than your standard release. At big-tech companies, hotfix pipelines are a first-class concern with dedicated tooling:
- Branch from the deployed SHA, not from HEAD.
HEADmay already include unreleased changes. Checkout the exact commit that is running in production, apply the single targeted fix, and promote that. - Run a minimal test suite. Smoke tests + the failing test case for the bug. Not the full 45-minute test suite — that is for standard releases. The hotfix CI job should complete in under 10 minutes.
- Bypass staging for true P0 incidents — with explicit approval. If the production error rate is 40% and the fix is a one-line nil-check, waiting for a full staging promotion cycle costs real users. Structure your pipeline to allow a gated bypass with a required approval from the on-call engineering manager.
- Deploy via the standard progressive strategy. Even a hotfix should use canary or blue-green — just with compressed step durations (1 minute at 5%, 2 minutes at 25%, then 100%).