Auto-Scaling & Elasticity
Auto-Scaling & Elasticity
Traffic is never flat. An e-commerce site that idles at 200 requests per second on a Tuesday morning may hit 20,000 requests per second on Black Friday. A news platform sees a sudden 50× spike the moment a story goes viral. A SaaS application experiences a sharp ramp-up every weekday morning as users start work, then drops off overnight.
Running enough servers to survive the peak at all times is wasteful — you pay for idle capacity 90% of the day. Running only enough for the average load means your service collapses the moment demand spikes. Auto-scaling is the solution: letting your infrastructure expand and contract automatically, matching capacity to demand in real time.
Elasticity vs Manual Scaling
Manual scaling means an engineer logs in, provisions new servers, adds them to the load balancer, and later tears them down. This works at small scale but cannot react fast enough to sudden spikes, and it introduces human error and delays. Auto-scaling removes the human from that loop: a control plane monitors metrics, compares them against thresholds, and adds or removes instances automatically, often within 60–90 seconds.
Elasticity is the broader property: a system is elastic if it can scale both up and down smoothly. Scaling only upward is not truly elastic — a system that adds servers during a spike but never releases them wastes money and may cause configuration drift over time.
How Auto-Scaling Works: The Control Loop
Every auto-scaling system follows the same fundamental feedback loop:
- Collect metrics — CPU utilisation, memory, request rate, queue depth, latency P99, custom application metrics.
- Evaluate policies — compare current metrics against defined thresholds (e.g. "scale out if average CPU > 70% for 2 minutes").
- Make a scaling decision — add N instances (scale-out) or remove N instances (scale-in).
- Converge the fleet — launch or terminate instances; update the load balancer target group.
- Stabilise — a cooldown period (typically 120–300 seconds) prevents oscillation before the next decision is made.
Scaling Policies
Cloud providers (AWS Auto Scaling Groups, Google Cloud Managed Instance Groups, Azure VMSS) offer several policy types:
- Target Tracking — the most common. You declare a target (e.g. "keep average CPU at 60%") and the control plane calculates how many instances are needed to reach it. Simple, self-tuning, and recommended as the default.
- Step Scaling — a tiered response based on alarm thresholds: "if CPU > 70%, add 2; if CPU > 90%, add 5." Useful when the relationship between load and required capacity is non-linear.
- Scheduled Scaling — pre-emptive. You know traffic rises every Monday morning at 09:00, so you schedule a capacity boost at 08:45. This avoids the lag inherent in reactive policies.
- Predictive Scaling (AWS, GCP ML-based) — the platform analyses historical traffic patterns and provisions capacity before demand arrives. Most effective when traffic has strong weekly/daily seasonality.
The Instance Warm-Up Problem
One of the trickiest aspects of auto-scaling is that new instances are not instantly ready. A fresh server must boot, the operating system must initialize, the application must start, load its config, warm its caches, and establish database connections. In practice this warm-up window ranges from 30 seconds (a lightweight container on Kubernetes) to 5–10 minutes (a large Java application with heavy startup). During that window the instance should not yet receive production traffic.
Cloud platforms handle this with health checks and warm-up periods. The load balancer only routes requests to an instance that passes its health check. The auto-scaling group has a configurable warm-up time during which the new instance is not counted in aggregate metrics — this prevents the controller from seeing a temporary CPU spike on a booting machine and deciding to launch even more instances.
Scale-In: The Overlooked Half
Engineers focus on scale-out (adding capacity) but scale-in (removing capacity) is equally important for cost control. Cloud auto-scaling scale-in policies typically have a longer cooldown and more conservative thresholds — you scale out aggressively but scale in cautiously. This asymmetry is intentional: it is far worse to remove a server that is still needed than to keep a spare one running for an extra few minutes.
Scale-in also requires that your instances be disposable — stateless services that can be terminated mid-request will drop connections. To handle this gracefully:
- Enable connection draining (AWS) or graceful termination (Kubernetes): the load balancer stops routing new requests to the instance being terminated while it finishes in-flight requests, typically with a 30–60 second drain window.
- Ensure your application handles
SIGTERMby finishing current work before exiting. - Never store session state or user data on the instance itself — use a shared cache (Redis) or database instead.
Auto-Scaling Dimensions
Auto-scaling is not limited to virtual machines. Modern systems auto-scale at multiple levels simultaneously:
- VM / Instance level — AWS Auto Scaling Groups, GCP Managed Instance Groups, Azure VMSS. The classic model.
- Container / Pod level — Kubernetes Horizontal Pod Autoscaler (HPA) scales the number of pods; Cluster Autoscaler adds nodes to the cluster when pods are unschedulable.
- Serverless / Function level — AWS Lambda, Google Cloud Functions, and Azure Functions scale from zero to thousands of concurrent invocations automatically. No fleet to manage; you pay per invocation.
- Queue-driven worker scaling — scale the number of background workers based on queue depth. AWS SQS + Lambda, Kubernetes KEDA, or a simple Auto Scaling Group triggered by an SQS CloudWatch metric. Extremely effective for batch-processing workloads.
- Database read replicas — some managed databases (Aurora Serverless, PlanetScale) can automatically add read replicas as read throughput grows.
Kubernetes Horizontal Pod Autoscaler (HPA)
In a Kubernetes environment the HPA continuously queries the Metrics Server (or a custom metrics adapter for business metrics) and adjusts the replicas field of a Deployment. A simple HPA targeting 50% CPU looks like this:
With this config Kubernetes will maintain between 2 and 20 pods, adding or removing pods to keep average CPU utilisation near 50%. KEDA (Kubernetes Event-Driven Autoscaling) extends this to scale on external signals like SQS queue depth, Kafka lag, or Prometheus metrics.
Cost Optimisation & Right-Sizing
Auto-scaling is primarily a reliability tool, but it also has major cost implications. Key strategies:
- Set a meaningful minimum — a minimum of zero sounds tempting for cost, but cold-start latency (a Lambda spinning up a new container) can be 500ms–2s. For latency-sensitive services, keep a standing minimum of 1–2 instances.
- Use Spot / Preemptible instances for the auto-scaled fleet. AWS Spot can be 70–90% cheaper than On-Demand. Because spot instances can be reclaimed with 2 minutes notice, design your application to handle graceful shutdown — the same resilience you need for scale-in anyway.
- Monitor scale-in efficiency — if your fleet never scales in, your thresholds or cooldowns are misconfigured. A healthy elastic system regularly releases capacity during off-peak hours.
- Watch for thrashing — rapid add/remove oscillation wastes money and creates instability. Increase cooldown periods or widen the dead-band between scale-out and scale-in thresholds.
Putting It All Together
A production-grade auto-scaling strategy for a typical web service looks like this: maintain a minimum fleet sized to absorb average off-peak traffic plus one server's worth of headroom; use target tracking (CPU 60%, or a custom RPS metric) as the primary reactive policy; add scheduled scaling to pre-warm before known events; configure connection draining on scale-in; place the scaled fleet on Spot/Preemptible nodes behind On-Demand instances; and instrument scale-out lag alerts so you know if the fleet is slower to react than expected.
When this system works well, it is invisible. Traffic doubles, the fleet quietly expands, latency stays flat, and then at 3 AM the extra servers are quietly terminated and the cloud bill stays reasonable. That invisibility — that effortless elasticity — is the goal.