FinOps & Cloud Cost Optimization

Right-Sizing & Eliminating Waste

18 min Lesson 4 of 26

Right-Sizing & Eliminating Waste

At most companies that have run in the cloud for two or more years, 25–40% of their monthly spend goes to resources that are either idle, wildly oversized, or completely orphaned. Right-sizing and waste elimination is therefore the highest-ROI FinOps activity — and it produces results in days, not quarters. This lesson covers the three dominant waste categories: idle resources, oversized instances, and orphaned storage. We treat each as a distinct operational problem with its own detection pipeline and remediation playbook.

Idle Resources

An idle resource is running and billing but delivering no user value. Common examples: a dev EC2 instance left on over a three-day weekend, an RDS database whose application was shut down six months ago, a load balancer with zero healthy targets, a CloudFront distribution pointing at a deleted origin, or a GKE node pool at 3% average CPU for a decommissioned service. The challenge is that occasional spikes make idle look active unless you look at the right signal over a long enough window.

The canonical detection query for EC2 uses CloudWatch metrics. Pull average CPU over 14 days and flag anything below 5%. But CPU alone is deceptive — a cache layer or a standby instance may show zero CPU yet serve critical traffic on port bursts. Cross-reference with network bytes and connection counts before acting.

# AWS CLI: find EC2 instances with avg CPU < 5% over the last 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123456789 \
  --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 1209600 \
  --statistics Average \
  --query 'Datapoints[0].Average'

# Bulk scan with AWS Cost Explorer + Resource Groups Tag API
# 1. Export all running instance IDs to a file
aws ec2 describe-instances \
  --filters Name=instance-state-name,Values=running \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,LaunchTime,Tags]' \
  --output json > running_instances.json

# 2. For each, compute 14-day avg; below 5% goes into the candidate list
# Use aws-ec2-instance-checker or Compute Optimizer's machine-learning recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=NotOptimized \
  --query 'instanceRecommendations[*].{ID:instanceArn,Finding:finding,CurrentType:currentInstanceType,RecommendedType:recommendationOptions[0].instanceType}' \
  --output table

Use AWS Compute Optimizer or GCP Recommender in preference to raw metric queries. They apply ML over 14 days of metrics (CPU, memory via CloudWatch agent or Ops Agent, network, disk, EBS throughput) and surface a single ranked list with projected monthly savings per instance. Wire their JSON output into a Jira automation or Slack webhook so every week a ticket is auto-created for each team whose resources appear.

Oversized Instances

Oversized is distinct from idle: the resource is actively used, but it was provisioned at a size that was never justified — or once was but the workload shrank. A common production pattern is a microservice that launched on an m5.4xlarge because "we were not sure of the load" and six months later the P99 CPU is 8% and the service is still running on the same box. The cost delta between an m5.4xlarge ($0.768/hr on-demand) and the right-sized m5.large ($0.096/hr) is 8x — on a fleet of 50 such services that is $300k/year.

Right-sizing is a three-step loop: measure, recommend, change with a rollback plan. Never resize in production without a tested rollback. For stateless services behind a load balancer, the rollback is a one-minute Auto Scaling Group launch template change. For stateful services (databases, queues, coordination services), the rollback is a snapshot restore — test that restore time before you resize.

# Terraform: parameterise instance type so right-sizing is a one-line PR
variable "instance_type" {
  description = "EC2 instance type; reviewed quarterly by FinOps"
  type        = string
  default     = "m5.large"
}

resource "aws_instance" "api_server" {
  ami           = data.aws_ami.app.id
  instance_type = var.instance_type
  # ... rest of config
}

# Right-sizing RDS: change instance class (requires maintenance window)
resource "aws_db_instance" "postgres" {
  identifier        = "prod-postgres"
  instance_class    = var.db_instance_class   # was db.r6g.4xlarge; right-sized to db.r6g.xlarge
  apply_immediately = false                    # apply in next maintenance window
  # always take a snapshot before resizing
  skip_final_snapshot = false
  final_snapshot_identifier = "pre-resize-snapshot-${formatdate("YYYYMMDD", timestamp())}"
}

Memory metrics are blind spots in AWS Compute Optimizer unless you install the CloudWatch agent. An instance running an in-memory cache may look right-sized by CPU but is actually under-provisioned by memory. Always deploy the CloudWatch agent (mem_used_percent metric) and GCP Ops Agent on every instance before trusting any right-sizing recommendation. Without memory data, Optimizer will recommend a compute-optimized instance for a memory-bound workload and cause an OOM in production.

Orphaned Storage

Orphaned storage is the most insidious waste category because the resources are not visible in normal operations dashboards. Common orphans: EBS volumes detached from terminated instances (not deleted by default), EBS snapshots of volumes that no longer exist, AMIs registered but never used after the instance was rebuilt, S3 buckets with lifecycle policies that were never set or were accidentally deleted, RDS snapshots beyond retention policy, and Elastic IP addresses allocated but unattached (billed at $0.005/hr each — trivial per unit, massive at scale).

In AWS, the single most overlooked default is that the EC2 console DeleteOnTermination flag for the root EBS volume defaults to true, but additional data volumes default to false. Every auto-scaling group that attaches a secondary data volume will leave orphaned EBS volumes on every scale-in event unless you explicitly set DeleteOnTermination=true at attach time or via the launch template.

#!/usr/bin/env bash
# Orphan hunter: find and report unattached EBS volumes and unattociated EIPs

echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,AZ:AvailabilityZone,Created:CreateTime}' \
  --output table

echo ""
echo "=== Unassociated Elastic IPs ==="
aws ec2 describe-addresses \
  --filters Name=domain,Values=vpc \
  --query 'Addresses[?AssociationId==`null`].{AllocationId:AllocationId,PublicIP:PublicIp,Tags:Tags}' \
  --output table

echo ""
echo "=== Snapshots older than 90 days with no corresponding volume ==="
CUTOFF=$(date -u -d '90 days ago' +%Y-%m-%dT%H:%M:%SZ)
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='${CUTOFF}'].{ID:SnapshotId,VolumeId:VolumeId,Size:VolumeSize,Created:StartTime}" \
  --output table | head -60

echo ""
echo "=== S3 buckets with no lifecycle policy ==="
for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do
  lc=$(aws s3api get-bucket-lifecycle-configuration --bucket "$bucket" 2>&1)
  if echo "$lc" | grep -q "NoSuchLifecycleConfiguration"; then
    size=$(aws s3api list-objects-v2 --bucket "$bucket" \
      --query 'sum(Contents[*].Size)' --output text 2>/dev/null || echo "0")
    echo "NO LIFECYCLE: $bucket  (approx bytes: $size)"
  fi
done

The Right-Sizing Workflow at Scale

Ad-hoc cleanup is a one-time event; production waste accumulates continuously. Senior engineers build a weekly automated pipeline that does the following: pull recommendations from Compute Optimizer and the equivalent GCP/Azure tools; join against the tagging database to identify the owning team and cost centre; compute the projected monthly savings; open a Jira ticket assigned to the team with a two-week SLA; track closure rate per team; escalate unclosed tickets to engineering managers. This transforms right-sizing from a heroic cleanup sprint into a continuous operational discipline.

Weekly right-sizing automation pipeline: from cloud recommendations to team-owned tickets with escalation.

Set S3 Intelligent-Tiering for all buckets over 128 KB average object size and with access patterns you cannot predict. It moves objects through Frequent, Infrequent, Archive Instant, and Deep Archive tiers automatically with no retrieval fee for the first two tiers. For buckets with known access patterns (e.g., log archives never accessed after 30 days), set an explicit lifecycle rule instead — it is cheaper because Intelligent-Tiering charges a small per-object monitoring fee that adds up on buckets with millions of tiny objects.

Production-Grade Guardrails

Every waste-elimination program eventually deletes something it should not have. Build safeguards before you automate deletions: tag every resource with do-not-delete=true for anything that must survive regardless of metrics; require human approval for any deletion above a configurable size threshold (e.g., EBS volumes over 500 GB, S3 buckets with over 1 TB); dry-run all deletion scripts for one week before enabling live mode; and keep a 30-day audit log of everything deleted, by whom (or which automation), with the resource ARN and the metric that triggered it. The audit log has saved multiple on-call engineers when a mysterious outage traced back to an automated cleanup that removed a "zero-traffic" queue that turned out to be a dead-letter queue checked once a month by a compliance job.

Right-sizing and waste elimination are not a one-time cleanup project. They are a continuous engineering discipline. The fastest-growing companies run this pipeline weekly, publish team-level savings dashboards (visible to VPs), and treat a team's waste percentage as a first-class engineering metric alongside reliability and velocity.