Cloud Autoscaling Beyond Kubernetes
Cloud Autoscaling Beyond Kubernetes
Kubernetes HPA and Cluster Autoscaler are powerful tools, but a significant portion of production workloads — serverless functions, managed databases, legacy EC2 fleets, App Engine services, and Azure App Service deployments — scale entirely outside the Kubernetes plane. Understanding how cloud-native autoscaling mechanisms work at the infrastructure level is essential for senior engineers designing resilient, cost-efficient systems that span managed services, bare VMs, and container orchestration simultaneously.
This lesson covers the three pillars of cloud autoscaling that operate independently of Kubernetes: dynamic scaling policies (reactive, metric-driven), scheduled scaling (time-based, predictable load), and predictive scaling (ML-driven, anticipatory). We use AWS Auto Scaling Groups (ASG) as the canonical reference — GCP Managed Instance Groups and Azure VM Scale Sets follow the same conceptual model with provider-specific syntax.
Auto Scaling Groups: The Anatomy
An ASG wraps a fleet of EC2 instances with a desired/minimum/maximum capacity envelope and a set of scaling policies. Every scaling action — whether triggered by a CloudWatch alarm, a schedule expression, or a predictive model — ultimately writes to the same DesiredCapacity field. The ASG controller then reconciles actual running instances toward that target, launching or terminating instances across configured Availability Zones.
At big-tech scale, ASGs typically sit behind a Network Load Balancer (NLB) or Application Load Balancer (ALB) target group. When a new instance passes its health checks, the ASG registers it with the target group. Instance warm-up time — the window between "instance started" and "instance accepting traffic" — is a critical parameter that almost all teams get wrong the first time.
Dynamic Scaling Policies
AWS offers three reactive policy types; production systems typically layer all three for defense in depth.
Target Tracking Scaling
The simplest and most recommended starting policy. You declare a target value for a metric — CPU at 60%, request count per target at 1000 RPS, or SQS queue depth per instance — and the ASG continuously adjusts capacity to maintain that target. Under the hood AWS runs a PID-style control loop.
Step Scaling
Step scaling lets you define a piecewise function: if CPU breaches 70%, add 2 instances; if it breaches 85%, add 5; if it breaches 95%, add 10. This is the policy to reach for when your load profile is bursty and you know from experience that target-tracking's gradual adjustments are too slow. You trigger it from a CloudWatch alarm, not from the metric directly.
Scheduled Scaling
For workloads with predictable load cycles — business-hours API traffic, end-of-day batch jobs, weekly marketing email blasts, daily market-open surges — scheduled scaling is more reliable and cheaper than reactive scaling. You set the desired/min/max capacity at a specific time using cron expressions; no metric, no alarm, no lag.
A critical subtlety: a scheduled action only changes DesiredCapacity at the scheduled moment. If your reactive policies have already scaled you above the new desired value, the scheduled action will scale you down at that time. Always reason about the interaction between scheduled actions and live policy state — especially for the scale-down half.
aws_autoscaling_schedule with explicit UTC offsets) and alert on clock drift in your CI/CD pipeline.
Predictive Scaling
AWS Predictive Scaling, launched in GA in 2021, uses machine learning trained on up to 14 days of your ASG's historical CloudWatch metrics to forecast future load and proactively adjust capacity before the load arrives. The critical difference from scheduled scaling is that it learns and adapts — you do not need to maintain a schedule when your pattern shifts.
Predictive Scaling works in two modes. In Forecast Only mode, it generates forecasts and displays them in the console without taking action — useful for building confidence before enabling automation. In Forecast and Scale mode, it creates scheduled actions automatically, typically 1 hour in advance of a predicted load increase.
The scheduling_buffer_time of 300 seconds tells the system to schedule the scale-out 5 minutes before the predicted peak — critical when instance launch plus warm-up takes 3–4 minutes. The max_capacity_breach_behavior set to IncreaseMaxCapacity allows predictive scaling to temporarily exceed your configured max during unexpected spikes, avoiding a hard ceiling that could block scale-out at exactly the wrong moment.
Combining All Three: The Production Pattern
At companies running significant traffic, all three mechanisms operate simultaneously in a hierarchy. Predictive scaling handles the baseline forecast, keeping your fleet pre-warmed. Scheduled scaling covers known discrete events — product launches, marketing blasts, market-open windows — that are too sharp and specific for the ML model. Dynamic target-tracking absorbs the residual variance that neither scheduled nor predictive anticipated. Step scaling acts as a circuit-breaker for sudden, extreme spikes.
DefaultInstanceWarmup must be set to this real measured value, not the default of 300 seconds. Get this wrong and your CloudWatch metrics include not-yet-warm instances, making your scaling signals noisy and causing premature scale-in.
Equivalent Services on Other Clouds
GCP Managed Instance Groups use Autoscaler resources with autoscalingPolicy blocks — the same conceptual trinity of cpuUtilization target, loadBalancingUtilization, and schedules. Azure VM Scale Sets use Autoscale settings with rules (metric-based) and fixedDate or recurrence profiles for scheduled scaling. Azure does not yet have a native equivalent of AWS Predictive Scaling; Azure Monitor Predictive Autoscale is in preview as of 2025. For multi-cloud fleets, Terraform abstracts the provider differences into the same IaC workflow, though the semantics of cooldowns and evaluation windows differ enough to warrant per-cloud tuning.