Deployment Strategies & Progressive Delivery

Blue-Green Deployments

18 min Lesson 3 of 28

Blue-Green Deployments

Blue-green deployment is the foundation of zero-downtime releasing at scale. The model is deceptively simple: you keep two identical production environments — call them Blue (currently live) and Green (staging the new release). Traffic flows entirely to Blue. You deploy v2 onto Green, run every validation you need, and when you are satisfied, you flip a single router — a load balancer rule, a DNS weight, a service selector — so that 100 % of traffic instantly moves to Green. Blue is not destroyed; it sits warm as a one-command rollback target.

Netflix, Amazon, and Google all run variants of this pattern for their most critical services because it decouples deployment (putting code on servers) from release (exposing it to users). That separation is the core insight. Deployment becomes a low-stakes, testable operation. Release becomes a near-instantaneous, reversible switch.

Two Identical Environments

Identical means more than the same application container. It means the same machine type or pod resource request, the same auto-scaling policy, the same environment variables (modulo the version tag), the same network ACLs, and the same observability agents. If Green is under-provisioned compared to Blue, the cutover will generate a latency spike that looks exactly like a bug in v2 — and you will waste time investigating code that is not the problem.

In practice "identical" is enforced by sharing a single Terraform module or Helm chart and parameterising only the version, image tag, and environment label. The environments are distinguished at the routing layer, not the infrastructure layer.

Blue-Green Deployment — traffic cutover Users 100 % traffic Load Balancer Route rule / listener BLUE — v1 (LIVE) App Pod / EC2 ×3 Cache / Config Warm standby after cutover → instant rollback target GREEN — v2 (STAGING) App Pod / EC2 ×3 Cache / Config Deploy → Smoke test → flip router here active after flip Both environments share the same database
Blue-Green deployment: Green receives v2, passes smoke tests, then the load balancer flips all traffic to Green. Blue becomes the rollback target.

The Cutover Mechanics

How you flip traffic determines your actual downtime budget. Three common mechanisms, in order of speed:

  1. Load balancer listener rule — AWS ALB target group swap, GCP Backend Service update. Sub-second. Best for same-region, same-VPC deployments.
  2. Kubernetes Service selector patch — change spec.selector.version from blue to green. kube-proxy propagates the endpoint update cluster-wide in seconds.
  3. DNS weighted routing — shift Route 53 or NS1 weights from 100/0 to 0/100. TTL determines latency; keep it at 30–60 s during release windows. Slowest, but works across regions and accounts.
In-flight requests during a flip: the load balancer drains Blue connections before removing it from the target group. Set deregistration_delay (AWS) or terminationGracePeriodSeconds (K8s) to cover your p99 request duration — typically 30–60 seconds. Connections that arrive after the flip land on Green. There is no gap.

AWS ALB — Target Group Swap

The canonical AWS pattern uses two target groups attached to a single ALB listener. The Terraform below provisions both groups and a listener that starts pointing at Blue:

# main.tf — blue-green target groups + ALB listener rule resource "aws_lb_target_group" "blue" { name = "myapp-blue" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id target_type = "ip" health_check { path = "/healthz" interval = 15 healthy_threshold = 2 unhealthy_threshold = 3 } } resource "aws_lb_target_group" "green" { name = "myapp-green" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id target_type = "ip" health_check { path = "/healthz" interval = 15 healthy_threshold = 2 unhealthy_threshold = 3 } } resource "aws_lb_listener_rule" "main" { listener_arn = aws_lb_listener.https.arn priority = 100 action { type = "forward" target_group_arn = aws_lb_target_group.blue.arn # change to green on release } condition { host_header { values = ["api.example.com"] } } }

To cut over, update the target_group_arn to aws_lb_target_group.green.arn and apply. No instance restart, no DNS propagation delay. The AWS CLI equivalent for a scripted pipeline:

#!/usr/bin/env bash # cutover.sh — swap ALB listener default action to the green target group set -euo pipefail LISTENER_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/myapp/abc123" GREEN_TG_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp-green/def456" echo "[1/3] Waiting for Green health checks to pass..." aws elbv2 wait target-in-service \ --target-group-arn "$GREEN_TG_ARN" echo "[2/3] Flipping listener to Green..." aws elbv2 modify-listener \ --listener-arn "$LISTENER_ARN" \ --default-actions Type=forward,TargetGroupArn="$GREEN_TG_ARN" echo "[3/3] Cutover complete. Blue remains warm. Rollback: re-run with Blue TG ARN."

Kubernetes — Service Selector Patch

In Kubernetes the pattern maps cleanly: Blue and Green are separate Deployments, and a single Service selects them by a version label. The switch is a one-line kubectl patch:

# blue-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: myapp-blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: myapp image: myapp:1.4.2 ports: - containerPort: 8080 --- # service.yaml — currently pointing at blue apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # <-- patch this to "green" to cut over ports: - port: 80 targetPort: 8080 --- # Cutover command kubectl patch service myapp \ -p '{"spec":{"selector":{"version":"green"}}}' # Rollback command (instant) kubectl patch service myapp \ -p '{"spec":{"selector":{"version":"blue"}}}'

The Rollback: One Command, Sub-Second

This is the most underrated property of blue-green. Rolling deployments roll back by redeploying the old image — which takes minutes and may itself fail. Blue-green rollback is a router flip: the old environment is already running, already warm, already healthy. Mean time to recovery (MTTR) shrinks from 5–10 minutes to under 10 seconds.

Keep Blue warm for at least 30 minutes after cutover. Most production incidents surface within the first 10–15 minutes of a release. If you shut down Blue immediately after the flip, you lose the fast rollback and are back to a full redeploy on an incident. Set a post-release timer in your runbook.

The Shared Database Problem

Blue and Green share the same database. This is simultaneously the biggest strength (no data sync) and the biggest constraint (schema changes must be backward-compatible). v2 must be able to read data written by v1 and vice versa, because during the cutover both versions may be serving traffic briefly (in-flight requests from Blue draining).

The correct pattern is Expand-Contract (covered in Lesson 8): first deploy a v1.5 that adds the new column but does not require it, then cut over to v2 that uses it, then drop the old column in a later release. Never change a schema and the application in the same deployment when using blue-green.

Session affinity and caches: if your application stores session state in-process (not in Redis or a database), users on Blue will lose their sessions after the flip because Green does not share in-memory state. Always externalise session storage before adopting blue-green. Likewise, a warm cache on Blue does not transfer to Green — Green starts with a cold cache and may produce a temporary throughput dip. Pre-warm Green by sending synthetic traffic before the cutover.

Cost and When to Use Blue-Green

Blue-green doubles your compute footprint during the deployment window. For serverless (Lambda, Cloud Run) and container-based platforms the cost is negligible because Green consumes resources only during the staging and drain window. For large EC2-backed fleets the cost is real: use blue-green for your revenue-critical, low-tolerance services and rolling deployments for stateless workers and batch jobs where a short restart is acceptable.

Blue-green is the right default for: customer-facing APIs, payment processing, authentication services, and any service with an SLA of 99.9 % or higher. Its instant rollback and zero-downtime cutover make it worth the extra infrastructure complexity at any company operating at production scale.