Spot & Preemptible Compute
Spot & Preemptible Compute
Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure) are the single highest-leverage cost lever available to an engineering team running fault-tolerant workloads. On average they deliver 60–90% discount off on-demand pricing, with AWS Spot routinely sitting at 70–80% off for general-purpose instance families. At $500k/month of on-demand EC2 spend, migrating 40% of capacity to Spot typically saves $140–160k per month — without changing a single line of application code. The catch is that the cloud provider can reclaim the capacity with a two-minute warning. Everything in this lesson is about engineering around that constraint so reliably that the interruption becomes a routine, unexciting event.
How Spot Markets Work
AWS, GCP, and Azure each maintain pools of unused compute capacity they would otherwise leave idle. They surface that capacity at a discount with the proviso that they can reclaim it when on-demand demand rises. The mechanics differ by provider:
- AWS Spot: Price is no longer a bid — since 2017 AWS sets spot prices algorithmically per availability zone and instance family. You pay the current spot price for every second you run, and receive a two-minute interruption notice via the Instance Metadata Service (
http://169.254.169.254/latest/meta-data/spot/termination-time) and optionally via EventBridge. Interruption rates vary dramatically:m5.xlargeinus-east-1amight run uninterrupted for weeks; a GPU family instance in the same AZ might be reclaimed within hours. The Spot Instance Advisor publishes historical interruption frequency by family and AZ — consult it before choosing a fleet composition. - GCP Preemptible / Spot VMs: Preemptible VMs are capped at 24 hours of runtime regardless of demand, with a 30-second ACPI shutdown signal before preemption. Spot VMs (the newer offering) have no 24-hour cap but carry the same preemption risk as AWS Spot. The
--provisioning-model=SPOTflag enables them in Terraform andgcloud. - Azure Spot VMs: Use an eviction policy of
Deallocate(VM stopped, disk preserved, restartable — preferred) orDelete(VM and disks removed). A two-minute eviction notice lands in Azure Scheduled Events at the local metadata endpoint.
Workload Classification: What Belongs on Spot
Before touching a single launch template, classify your workloads honestly. The binary question: does a sudden, unannounced removal of this instance within two minutes cause user-visible failure or data loss?
- Ideal Spot candidates: CI/CD build agents, batch data pipelines (Spark, Flink, EMR), ML training jobs with checkpoint support, video transcoding, log processing, load-testing infrastructure, stateless web-tier workers behind an ALB with health checks, and any queue consumer that can safely retry failed work units.
- Conditional candidates: Kubernetes worker nodes running stateless pods (with PodDisruptionBudgets and
preStophooks), auto-scaling groups where desired capacity is well above the on-demand floor, dev and staging environments where interruptions are fully acceptable. - Bad Spot candidates: Database primaries, ZooKeeper or etcd quorum nodes, anything with local ephemeral state that is not checkpointed, leader-election singletons, production Kubernetes control-plane nodes.
Instance Diversity: The Most Important Spot Principle
The single most effective resilience technique is instance diversification: request capacity across multiple instance families, sizes, and availability zones simultaneously. A Spot ASG locked to one instance type in one AZ will be interrupted simultaneously across the entire fleet when that pool is reclaimed. A group spanning six instance families across three AZs will almost never see a whole-fleet interruption — each pool is independent and interruptions are localised.
Use the capacity-optimised allocation strategy (not lowest-price) in AWS Auto Scaling. Capacity-optimised picks from the deepest available pool, which minimises interruption frequency and avoids the herding problem where every fleet in the region piles into the cheapest pool simultaneously, draining it rapidly.
Interruption Handling: The Two-Minute Window
When AWS sends a Spot interruption notice, you have exactly 120 seconds before the instance is stopped or terminated. What you do in that window determines whether the interruption is invisible to your users or causes an incident. The standard pattern has three layers:
- Detect the notice. Poll the IMDS endpoint every 5 seconds from a lightweight agent process, or subscribe to the
EC2 Spot Instance Interruption WarningEventBridge event (published approximately 2 minutes before interruption). AWS also offers Spot Instance Rebalance Recommendations — an earlier, softer signal that your instance is at elevated interruption risk, published before the hard notice. Act on the rebalance recommendation proactively to get a clean replacement before the interruption. - Drain in-flight work. For Kubernetes, set the node to
NoScheduleand callkubectl drain. For web workers, deregister from the load balancer target group (the ALB connection draining timeout, typically 30 seconds, handles in-flight HTTP requests). For batch workers, checkpoint progress to S3 or a database and mark the work unit as resumable before the instance stops. - Signal the ASG. If using a lifecycle hook, send a
CONTINUEheartbeat to allow the ASG to launch a replacement immediately rather than waiting for the full health-check timeout.
EC2 Instance Rebalance Recommendation EventBridge events in addition to the interruption notice. This earlier signal gives you a head start — you can proactively replace an at-risk instance before the hard two-minute clock starts, which is especially valuable for long-running ML training jobs or stateful batch pipelines where a checkpoint takes more than two minutes to write.
Kubernetes on Spot: Node Groups and Pod Disruption Budgets
Running Kubernetes worker nodes on Spot is one of the highest-impact configurations in the ecosystem. The Kubernetes control plane and etcd should always run on on-demand; worker nodes for stateless workloads are ideal Spot candidates. The critical configurations are:
- Separate node groups per lifecycle. Have at least one on-demand node group (for system pods, PodDisruptionBudget-sensitive workloads) and one or more Spot node groups (for application pods). Use node selectors or taints to route workloads appropriately:
eks.amazonaws.com/capacityType=SPOT. - PodDisruptionBudgets. Every Deployment with more than one replica should have a PDB. A PDB of
maxUnavailable: 1ensures Kubernetes will not drain a Spot node if doing so would leave fewer than the desired number of ready pods. Without this, a simultaneous drain of two nodes could temporarily halve your running pods. - AWS Node Termination Handler. This DaemonSet watches for Spot interruption notices and Rebalance Recommendations at the node level, taints the node, drains it (via the Kubernetes API), and signals the Auto Scaling group before the cloud kills the instance. Install it before you put any production workload on Spot nodes.
Fault-Tolerant Batch Pipelines on Spot
For batch workloads the contract is simpler: work is divided into checkpointed units, and any unit not completed due to interruption is retried on a replacement instance. The design principles that make this work at scale are:
- Idempotent tasks. Every work unit must be safe to execute more than once. Write to a temporary S3 prefix and rename atomically on completion; use conditional writes (
PutItemwith a condition expression) in DynamoDB to prevent duplicate inserts. - Checkpoint frequently. For long-running ML training, checkpoint model weights to S3 every N steps (typically every 5–15 minutes of compute time, not wall time). On restart, load the latest checkpoint and continue. PyTorch Lightning, TensorFlow, and Hugging Face Trainer all support this natively.
- Atomic task claiming. Use SQS with visibility timeout as a distributed work queue. A worker claims a message (makes it invisible to others), processes it, and deletes it on success. If the instance is interrupted before deletion, the visibility timeout expires and another instance picks up the same message. Set the visibility timeout to 1.5x your expected task duration.
- AWS Spot + EMR. EMR Managed Scaling and Spot-aware task node configuration handle interruptions automatically for Spark and Hive jobs: task nodes (stateless compute) run on Spot; core nodes (HDFS replication) run on on-demand.
/tmp, or batch jobs that accumulate results in memory for hours before flushing. A single interruption destroys hours of compute. Fix: checkpoint to object storage at regular intervals before moving any workload to Spot.
Spot + On-Demand Blending Strategy
Production Spot fleets always blend in a percentage of on-demand capacity as a stable floor. The right split depends on workload tolerance:
- Batch / ML training: 0–10% on-demand. Pure Spot is fine if tasks are checkpointed. Keep a small on-demand base only to ensure the fleet never goes to zero when capacity is tight across all pools.
- Stateless web tier: 20–30% on-demand. The on-demand floor ensures a minimum number of healthy instances survive a worst-case simultaneous multi-pool interruption. The ALB routes across both; users see no difference.
- CI/CD agents: 80–100% Spot. Builds are naturally retryable. A failed build is requeued; no user impact. The small on-demand fraction just keeps the fleet from fully draining during a regional capacity crunch.
At big-tech scale, companies like Airbnb, Netflix, and Lyft run 60–80% of their total EC2 compute footprint on Spot or equivalent discounted capacity. The engineering investment to get there — interruption handlers, checkpointing, instance diversification, PDBs — pays back in weeks at any scale above $50k/month of EC2 spend.