FinOps & Cloud Cost Optimization

Spot & Preemptible Compute

18 min Lesson 6 of 26

Spot & Preemptible Compute

Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure) are the single highest-leverage cost lever available to an engineering team running fault-tolerant workloads. On average they deliver 60–90% discount off on-demand pricing, with AWS Spot routinely sitting at 70–80% off for general-purpose instance families. At $500k/month of on-demand EC2 spend, migrating 40% of capacity to Spot typically saves $140–160k per month — without changing a single line of application code. The catch is that the cloud provider can reclaim the capacity with a two-minute warning. Everything in this lesson is about engineering around that constraint so reliably that the interruption becomes a routine, unexciting event.

How Spot Markets Work

AWS, GCP, and Azure each maintain pools of unused compute capacity they would otherwise leave idle. They surface that capacity at a discount with the proviso that they can reclaim it when on-demand demand rises. The mechanics differ by provider:

AWS Spot: Price is no longer a bid — since 2017 AWS sets spot prices algorithmically per availability zone and instance family. You pay the current spot price for every second you run, and receive a two-minute interruption notice via the Instance Metadata Service (http://169.254.169.254/latest/meta-data/spot/termination-time) and optionally via EventBridge. Interruption rates vary dramatically: m5.xlarge in us-east-1a might run uninterrupted for weeks; a GPU family instance in the same AZ might be reclaimed within hours. The Spot Instance Advisor publishes historical interruption frequency by family and AZ — consult it before choosing a fleet composition.
GCP Preemptible / Spot VMs: Preemptible VMs are capped at 24 hours of runtime regardless of demand, with a 30-second ACPI shutdown signal before preemption. Spot VMs (the newer offering) have no 24-hour cap but carry the same preemption risk as AWS Spot. The --provisioning-model=SPOT flag enables them in Terraform and gcloud.
Azure Spot VMs: Use an eviction policy of Deallocate (VM stopped, disk preserved, restartable — preferred) or Delete (VM and disks removed). A two-minute eviction notice lands in Azure Scheduled Events at the local metadata endpoint.

The core mental model: Spot capacity is not unreliable — it is interruptible. A healthy Spot fleet in a well-chosen instance family and AZ runs for days or weeks uninterrupted. What you must never do is architect so that a single interruption causes user-visible failure or data loss. Design for interruption; be pleasantly surprised when it does not come.

Workload Classification: What Belongs on Spot

Before touching a single launch template, classify your workloads honestly. The binary question: does a sudden, unannounced removal of this instance within two minutes cause user-visible failure or data loss?

Ideal Spot candidates: CI/CD build agents, batch data pipelines (Spark, Flink, EMR), ML training jobs with checkpoint support, video transcoding, log processing, load-testing infrastructure, stateless web-tier workers behind an ALB with health checks, and any queue consumer that can safely retry failed work units.
Conditional candidates: Kubernetes worker nodes running stateless pods (with PodDisruptionBudgets and preStop hooks), auto-scaling groups where desired capacity is well above the on-demand floor, dev and staging environments where interruptions are fully acceptable.
Bad Spot candidates: Database primaries, ZooKeeper or etcd quorum nodes, anything with local ephemeral state that is not checkpointed, leader-election singletons, production Kubernetes control-plane nodes.

Instance Diversity: The Most Important Spot Principle

The single most effective resilience technique is instance diversification: request capacity across multiple instance families, sizes, and availability zones simultaneously. A Spot ASG locked to one instance type in one AZ will be interrupted simultaneously across the entire fleet when that pool is reclaimed. A group spanning six instance families across three AZs will almost never see a whole-fleet interruption — each pool is independent and interruptions are localised.

Use the capacity-optimised allocation strategy (not lowest-price) in AWS Auto Scaling. Capacity-optimised picks from the deepest available pool, which minimises interruption frequency and avoids the herding problem where every fleet in the region piles into the cheapest pool simultaneously, draining it rapidly.

# Terraform: ASG with instance diversification and capacity-optimised Spot strategy
# ec2/spot_asg.tf

resource "aws_autoscaling_group" "workers" {
  name                = "batch-workers-spot"
  vpc_zone_identifier = var.private_subnet_ids   # spread across 3 AZs
  desired_capacity    = 20
  min_size            = 2
  max_size            = 60

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2   # keep 2 on-demand as stable floor
      on_demand_percentage_above_base_capacity = 0   # everything else is Spot
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.worker.id
        version            = "$Latest"
      }

      # Six families with equivalent vCPU counts — independent capacity pools
      override { instance_type = "c5.4xlarge"  }
      override { instance_type = "c5a.4xlarge" }
      override { instance_type = "c5n.4xlarge" }
      override { instance_type = "m5.4xlarge"  }
      override { instance_type = "m5a.4xlarge" }
      override { instance_type = "r5.2xlarge"  }
    }
  }

  lifecycle { create_before_destroy = true }

  tag {
    key                 = "spot-fleet"
    value               = "batch-workers"
    propagate_at_launch = true
  }
}

Diversified Spot fleet across three AZs and six instance families. An interruption in one pool (red) is isolated — the ALB stops routing to it within seconds and the ASG replaces it from another pool.

Interruption Handling: The Two-Minute Window

When AWS sends a Spot interruption notice, you have exactly 120 seconds before the instance is stopped or terminated. What you do in that window determines whether the interruption is invisible to your users or causes an incident. The standard pattern has three layers:

Detect the notice. Poll the IMDS endpoint every 5 seconds from a lightweight agent process, or subscribe to the EC2 Spot Instance Interruption Warning EventBridge event (published approximately 2 minutes before interruption). AWS also offers Spot Instance Rebalance Recommendations — an earlier, softer signal that your instance is at elevated interruption risk, published before the hard notice. Act on the rebalance recommendation proactively to get a clean replacement before the interruption.
Drain in-flight work. For Kubernetes, set the node to NoSchedule and call kubectl drain. For web workers, deregister from the load balancer target group (the ALB connection draining timeout, typically 30 seconds, handles in-flight HTTP requests). For batch workers, checkpoint progress to S3 or a database and mark the work unit as resumable before the instance stops.
Signal the ASG. If using a lifecycle hook, send a CONTINUE heartbeat to allow the ASG to launch a replacement immediately rather than waiting for the full health-check timeout.

#!/usr/bin/env bash
# /usr/local/bin/spot-interruption-handler.sh
# Run as a systemd service or supervised process on each Spot instance.
# Polls IMDS for the interruption notice and triggers graceful shutdown.

IMDS="http://169.254.169.254"
TOKEN=$(curl -s -X PUT "$IMDS/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 30")

while true; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "X-aws-ec2-metadata-token: $TOKEN" \
    "$IMDS/latest/meta-data/spot/termination-time")

  if [ "$STATUS" = "200" ]; then
    echo "[$(date -u)] Spot interruption notice received — beginning graceful shutdown"

    # 1. Stop accepting new work
    systemctl stop worker-daemon   # your app-specific unit

    # 2. Checkpoint any in-flight state to S3
    aws s3 cp /var/worker/checkpoint.json \
      "s3://${CHECKPOINT_BUCKET}/checkpoints/$(hostname).json"

    # 3. Signal Auto Scaling lifecycle hook (if configured)
    REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
      "$IMDS/latest/meta-data/placement/region")
    INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
      "$IMDS/latest/meta-data/instance-id")
    ASG_NAME=$(aws autoscaling describe-auto-scaling-instances \
      --instance-ids "$INSTANCE_ID" --region "$REGION" \
      --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text)

    aws autoscaling complete-lifecycle-action \
      --lifecycle-action-result CONTINUE \
      --instance-id "$INSTANCE_ID" \
      --auto-scaling-group-name "$ASG_NAME" \
      --lifecycle-hook-name spot-interruption-hook \
      --region "$REGION" 2>/dev/null || true

    break
  fi

  sleep 5
done

AWS Instance Rebalance Recommendation: Subscribe to EC2 Instance Rebalance Recommendation EventBridge events in addition to the interruption notice. This earlier signal gives you a head start — you can proactively replace an at-risk instance before the hard two-minute clock starts, which is especially valuable for long-running ML training jobs or stateful batch pipelines where a checkpoint takes more than two minutes to write.

Kubernetes on Spot: Node Groups and Pod Disruption Budgets

Running Kubernetes worker nodes on Spot is one of the highest-impact configurations in the ecosystem. The Kubernetes control plane and etcd should always run on on-demand; worker nodes for stateless workloads are ideal Spot candidates. The critical configurations are:

Separate node groups per lifecycle. Have at least one on-demand node group (for system pods, PodDisruptionBudget-sensitive workloads) and one or more Spot node groups (for application pods). Use node selectors or taints to route workloads appropriately: eks.amazonaws.com/capacityType=SPOT.
PodDisruptionBudgets. Every Deployment with more than one replica should have a PDB. A PDB of maxUnavailable: 1 ensures Kubernetes will not drain a Spot node if doing so would leave fewer than the desired number of ready pods. Without this, a simultaneous drain of two nodes could temporarily halve your running pods.
AWS Node Termination Handler. This DaemonSet watches for Spot interruption notices and Rebalance Recommendations at the node level, taints the node, drains it (via the Kubernetes API), and signals the Auto Scaling group before the cloud kills the instance. Install it before you put any production workload on Spot nodes.

# Kubernetes: PodDisruptionBudget + Spot node toleration for a web Deployment
# k8s/web-deployment.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
  namespace: production
spec:
  minAvailable: "70%"    # at most 30% of pods may be disrupted at once
  selector:
    matchLabels:
      app: web

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      # Prefer Spot nodes; fall back to on-demand if none available
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: eks.amazonaws.com/capacityType
                    operator: In
                    values: ["SPOT"]
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      terminationGracePeriodSeconds: 90
      containers:
        - name: web
          image: myapp:latest
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]   # allow ALB to deregister cleanly
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 5
            failureThreshold: 3

---
# Install AWS Node Termination Handler via Helm
# helm repo add eks https://aws.github.io/eks-charts
# helm install aws-node-termination-handler eks/aws-node-termination-handler \
#   --namespace kube-system \
#   --set enableSpotInterruptionDraining=true \
#   --set enableRebalanceMonitoring=true \
#   --set enableScheduledEventDraining=true

Fault-Tolerant Batch Pipelines on Spot

For batch workloads the contract is simpler: work is divided into checkpointed units, and any unit not completed due to interruption is retried on a replacement instance. The design principles that make this work at scale are:

Idempotent tasks. Every work unit must be safe to execute more than once. Write to a temporary S3 prefix and rename atomically on completion; use conditional writes (PutItem with a condition expression) in DynamoDB to prevent duplicate inserts.
Checkpoint frequently. For long-running ML training, checkpoint model weights to S3 every N steps (typically every 5–15 minutes of compute time, not wall time). On restart, load the latest checkpoint and continue. PyTorch Lightning, TensorFlow, and Hugging Face Trainer all support this natively.
Atomic task claiming. Use SQS with visibility timeout as a distributed work queue. A worker claims a message (makes it invisible to others), processes it, and deletes it on success. If the instance is interrupted before deletion, the visibility timeout expires and another instance picks up the same message. Set the visibility timeout to 1.5x your expected task duration.
AWS Spot + EMR. EMR Managed Scaling and Spot-aware task node configuration handle interruptions automatically for Spark and Hive jobs: task nodes (stateless compute) run on Spot; core nodes (HDFS replication) run on on-demand.

The biggest Spot anti-pattern: Using Spot for stateful, long-running processes that write to local disk and have no checkpointing — ML training that writes model state only to /tmp, or batch jobs that accumulate results in memory for hours before flushing. A single interruption destroys hours of compute. Fix: checkpoint to object storage at regular intervals before moving any workload to Spot.

Spot + On-Demand Blending Strategy

Production Spot fleets always blend in a percentage of on-demand capacity as a stable floor. The right split depends on workload tolerance:

Batch / ML training: 0–10% on-demand. Pure Spot is fine if tasks are checkpointed. Keep a small on-demand base only to ensure the fleet never goes to zero when capacity is tight across all pools.
Stateless web tier: 20–30% on-demand. The on-demand floor ensures a minimum number of healthy instances survive a worst-case simultaneous multi-pool interruption. The ALB routes across both; users see no difference.
CI/CD agents: 80–100% Spot. Builds are naturally retryable. A failed build is requeued; no user impact. The small on-demand fraction just keeps the fleet from fully draining during a regional capacity crunch.

At big-tech scale, companies like Airbnb, Netflix, and Lyft run 60–80% of their total EC2 compute footprint on Spot or equivalent discounted capacity. The engineering investment to get there — interruption handlers, checkpointing, instance diversification, PDBs — pays back in weeks at any scale above $50k/month of EC2 spend.