Auto Scaling Groups & ELB
Auto Scaling Groups & ELB
Any production system designed for a fixed number of instances is designed to fail at scale. Auto Scaling Groups (ASGs) and Elastic Load Balancing (ELB) are the two primitives that make AWS workloads self-healing and horizontally scalable. Together with Launch Templates, they form a trio that every DevOps engineer must understand at depth — not just to pass an exam, but because misconfiguring any one of them is responsible for a large fraction of production outages at cloud-native companies.
This lesson covers the full picture: how Launch Templates describe what to launch, how ASGs decide when and how many to launch, and how Target Groups with health checks ensure only healthy instances receive traffic — with a focus on real production failure modes along the way.
Launch Templates: The Immutable Blueprint
A Launch Template (LT) is a versioned, immutable specification of everything needed to boot an EC2 instance: AMI ID, instance type, key pair, security groups, IAM instance profile, user data, EBS volumes, and network settings. It replaces the older Launch Configuration (now deprecated — do not use it for new workloads).
The key advantage of Launch Templates over Launch Configurations is versioning. You can have a $Default version and a $Latest version, and your ASG can pin to a specific version so a botched AMI update does not automatically roll out to production. This matters enormously at scale: at big-tech companies, the LT version is controlled by your CI/CD pipeline and promoted through dev → staging → production gates.
MetadataOptions.HttpTokens: required setting enforces IMDSv2 (token-based metadata access). IMDSv1 is exploitable via SSRF — if an attacker can make your app issue an HTTP request to 169.254.169.254, they can steal the instance role credentials. IMDSv2 requires a PUT request to obtain a session token first, which SSRF cannot do. Set it in every Launch Template you create, no exceptions.
Auto Scaling Groups: Self-Healing Fleets
An Auto Scaling Group maintains a fleet of EC2 instances between a configured minimum and maximum count. It uses your Launch Template to boot new instances and terminates old ones based on scaling policies and health check results. The ASG is the entity that glues together the LT (what to run), the Target Group (where to register), and CloudWatch metrics (when to scale).
Key ASG configuration parameters:
- MinSize / MaxSize / DesiredCapacity — the floor, ceiling, and current target instance count. In production, never set
MinSizeto 0 unless the workload is truly batch/off-hours. - Multi-AZ distribution — always span at least two Availability Zones. The ASG's
AvailabilityZonesorVPCZoneIdentifier(subnet IDs) controls this. When AZ-a fails, the ASG automatically launches replacements in AZ-b. - Health check type — either
EC2(instance status checks only) orELB(the load balancer's health check result). Always useELBin production: an instance can pass EC2 status checks while your application is totally broken. - Warmup / cooldown —
DefaultInstanceWarmuptells the ASG how long to wait for a new instance to be ready before counting its metrics toward scaling decisions. Without this, CloudWatch sees a momentarily high load average during bootstrap and over-provisions.
RequestCountPerTarget on the ALB rather than CPU — it reacts faster and correlates directly with user experience.
Target Groups and Health Checks: The Contract Between ALB and ASG
A Target Group (TG) is the mechanism by which an Application Load Balancer (ALB) or Network Load Balancer (NLB) knows which instances (or IPs, or Lambda functions) to route requests to. When you attach a TG to an ASG, every instance the ASG launches is automatically registered in the TG; every instance the ASG terminates is deregistered. This is seamless — but health checks are what make it safe.
A Target Group health check continuously polls each registered target on a configured path and port. Results:
- Healthy — the target is in the rotation and receives traffic.
- Unhealthy — the target is removed from rotation. After a configurable number of consecutive failures, the ASG also receives a health failure notification and will terminate and replace the instance.
- Initial — the target was just registered and is in the grace period before health checks begin.
- Draining (Deregistration delay) — when an instance is marked for termination, the TG waits up to
deregistration_delayseconds for in-flight requests to complete before closing connections. Default is 300 s — too long for most apps; tune it to 30–60 s so rolling deployments finish faster.
/health path returns HTTP 200 without actually checking whether your application can connect to the database, call downstream services, or read from its config, then ALB will happily route traffic to a broken instance that is returning 200 on health checks and 500 on real requests. Build a deep health check that validates all critical dependencies — and make it fast (under 2 s). Separate the deep check from the liveness check if your app has a slow startup.
The Trio: How Launch Template, ASG, and Target Group Work Together
Understanding the flow of a request — from user browser to your application running on an instance that was automatically provisioned — requires understanding how all three components interact. The diagram below shows this in full.
Instance Refresh: Zero-Downtime AMI Rollouts
When you bake a new AMI (e.g., after a security patch), you need to replace all running instances without dropping traffic. The ASG Instance Refresh feature handles this automatically: it respects the MinHealthyPercentage and drains the TG before terminating each batch.
CheckpointPercentages to [33, 66, 100] with a CheckpointDelay of 600 seconds means the refresh pauses at 33% and 66% completion. This gives your monitoring time to catch regressions before the entire fleet is rolled. Combined with automated canary checks in your deployment pipeline, this pattern gives you near-zero-downtime blue/green-like safety with far simpler infrastructure than maintaining two full ASGs.
Production Failure Modes to Know
These are the most common ASG/ELB failure scenarios encountered in real production environments:
- Health check grace period too short — if
HealthCheckGracePeriodis shorter than your application startup time, the ASG will mark newly launched instances as unhealthy and immediately terminate them. The fleet then thrashes in a launch/terminate loop. Set the grace period to at least 1.5x your p99 startup time. - Deregistration delay too long — the default 300 s delay blocks rolling deployments. Instances sit in
drainingstate for 5 minutes even when they have no in-flight connections, because the ALB conservatively waits. Tune this to 30–60 s for stateless HTTP services. - Scale-in termination targeting the wrong instance — the default termination policy removes instances in the AZ with the most instances, then the oldest LT version, then the one nearest its next billing hour. This is usually correct, but if you have stateful instances (e.g., a Kafka broker in the group), add a lifecycle hook on termination to gracefully drain the broker before the instance is killed.
- Capacity rebalance for Spot not enabled — if you use Spot Instances in a mixed-instance policy, enable
capacityRebalance: trueon the ASG. Without it, AWS terminates Spot Instances with only a 2-minute warning, causing abrupt instance loss. With it, the ASG proactively launches a replacement when a Spot interruption notice arrives.