Blue-Green Deployments
Blue-Green Deployments
Blue-green deployment is the foundation of zero-downtime releasing at scale. The model is deceptively simple: you keep two identical production environments — call them Blue (currently live) and Green (staging the new release). Traffic flows entirely to Blue. You deploy v2 onto Green, run every validation you need, and when you are satisfied, you flip a single router — a load balancer rule, a DNS weight, a service selector — so that 100 % of traffic instantly moves to Green. Blue is not destroyed; it sits warm as a one-command rollback target.
Netflix, Amazon, and Google all run variants of this pattern for their most critical services because it decouples deployment (putting code on servers) from release (exposing it to users). That separation is the core insight. Deployment becomes a low-stakes, testable operation. Release becomes a near-instantaneous, reversible switch.
Two Identical Environments
Identical means more than the same application container. It means the same machine type or pod resource request, the same auto-scaling policy, the same environment variables (modulo the version tag), the same network ACLs, and the same observability agents. If Green is under-provisioned compared to Blue, the cutover will generate a latency spike that looks exactly like a bug in v2 — and you will waste time investigating code that is not the problem.
In practice "identical" is enforced by sharing a single Terraform module or Helm chart and parameterising only the version, image tag, and environment label. The environments are distinguished at the routing layer, not the infrastructure layer.
The Cutover Mechanics
How you flip traffic determines your actual downtime budget. Three common mechanisms, in order of speed:
- Load balancer listener rule — AWS ALB target group swap, GCP Backend Service update. Sub-second. Best for same-region, same-VPC deployments.
- Kubernetes Service selector patch — change
spec.selector.versionfrombluetogreen. kube-proxy propagates the endpoint update cluster-wide in seconds. - DNS weighted routing — shift Route 53 or NS1 weights from 100/0 to 0/100. TTL determines latency; keep it at 30–60 s during release windows. Slowest, but works across regions and accounts.
deregistration_delay (AWS) or terminationGracePeriodSeconds (K8s) to cover your p99 request duration — typically 30–60 seconds. Connections that arrive after the flip land on Green. There is no gap.
AWS ALB — Target Group Swap
The canonical AWS pattern uses two target groups attached to a single ALB listener. The Terraform below provisions both groups and a listener that starts pointing at Blue:
To cut over, update the target_group_arn to aws_lb_target_group.green.arn and apply. No instance restart, no DNS propagation delay. The AWS CLI equivalent for a scripted pipeline:
Kubernetes — Service Selector Patch
In Kubernetes the pattern maps cleanly: Blue and Green are separate Deployments, and a single Service selects them by a version label. The switch is a one-line kubectl patch:
The Rollback: One Command, Sub-Second
This is the most underrated property of blue-green. Rolling deployments roll back by redeploying the old image — which takes minutes and may itself fail. Blue-green rollback is a router flip: the old environment is already running, already warm, already healthy. Mean time to recovery (MTTR) shrinks from 5–10 minutes to under 10 seconds.
The Shared Database Problem
Blue and Green share the same database. This is simultaneously the biggest strength (no data sync) and the biggest constraint (schema changes must be backward-compatible). v2 must be able to read data written by v1 and vice versa, because during the cutover both versions may be serving traffic briefly (in-flight requests from Blue draining).
The correct pattern is Expand-Contract (covered in Lesson 8): first deploy a v1.5 that adds the new column but does not require it, then cut over to v2 that uses it, then drop the old column in a later release. Never change a schema and the application in the same deployment when using blue-green.
Cost and When to Use Blue-Green
Blue-green doubles your compute footprint during the deployment window. For serverless (Lambda, Cloud Run) and container-based platforms the cost is negligible because Green consumes resources only during the staging and drain window. For large EC2-backed fleets the cost is real: use blue-green for your revenue-critical, low-tolerance services and rolling deployments for stateless workers and batch jobs where a short restart is acceptable.
Blue-green is the right default for: customer-facing APIs, payment processing, authentication services, and any service with an SLA of 99.9 % or higher. Its instant rollback and zero-downtime cutover make it worth the extra infrastructure complexity at any company operating at production scale.