Right-Sizing & Eliminating Waste
Right-Sizing & Eliminating Waste
At most companies that have run in the cloud for two or more years, 25–40% of their monthly spend goes to resources that are either idle, wildly oversized, or completely orphaned. Right-sizing and waste elimination is therefore the highest-ROI FinOps activity — and it produces results in days, not quarters. This lesson covers the three dominant waste categories: idle resources, oversized instances, and orphaned storage. We treat each as a distinct operational problem with its own detection pipeline and remediation playbook.
Idle Resources
An idle resource is running and billing but delivering no user value. Common examples: a dev EC2 instance left on over a three-day weekend, an RDS database whose application was shut down six months ago, a load balancer with zero healthy targets, a CloudFront distribution pointing at a deleted origin, or a GKE node pool at 3% average CPU for a decommissioned service. The challenge is that occasional spikes make idle look active unless you look at the right signal over a long enough window.
The canonical detection query for EC2 uses CloudWatch metrics. Pull average CPU over 14 days and flag anything below 5%. But CPU alone is deceptive — a cache layer or a standby instance may show zero CPU yet serve critical traffic on port bursts. Cross-reference with network bytes and connection counts before acting.
Oversized Instances
Oversized is distinct from idle: the resource is actively used, but it was provisioned at a size that was never justified — or once was but the workload shrank. A common production pattern is a microservice that launched on an m5.4xlarge because "we were not sure of the load" and six months later the P99 CPU is 8% and the service is still running on the same box. The cost delta between an m5.4xlarge ($0.768/hr on-demand) and the right-sized m5.large ($0.096/hr) is 8x — on a fleet of 50 such services that is $300k/year.
Right-sizing is a three-step loop: measure, recommend, change with a rollback plan. Never resize in production without a tested rollback. For stateless services behind a load balancer, the rollback is a one-minute Auto Scaling Group launch template change. For stateful services (databases, queues, coordination services), the rollback is a snapshot restore — test that restore time before you resize.
mem_used_percent metric) and GCP Ops Agent on every instance before trusting any right-sizing recommendation. Without memory data, Optimizer will recommend a compute-optimized instance for a memory-bound workload and cause an OOM in production.
Orphaned Storage
Orphaned storage is the most insidious waste category because the resources are not visible in normal operations dashboards. Common orphans: EBS volumes detached from terminated instances (not deleted by default), EBS snapshots of volumes that no longer exist, AMIs registered but never used after the instance was rebuilt, S3 buckets with lifecycle policies that were never set or were accidentally deleted, RDS snapshots beyond retention policy, and Elastic IP addresses allocated but unattached (billed at $0.005/hr each — trivial per unit, massive at scale).
In AWS, the single most overlooked default is that the EC2 console DeleteOnTermination flag for the root EBS volume defaults to true, but additional data volumes default to false. Every auto-scaling group that attaches a secondary data volume will leave orphaned EBS volumes on every scale-in event unless you explicitly set DeleteOnTermination=true at attach time or via the launch template.
The Right-Sizing Workflow at Scale
Ad-hoc cleanup is a one-time event; production waste accumulates continuously. Senior engineers build a weekly automated pipeline that does the following: pull recommendations from Compute Optimizer and the equivalent GCP/Azure tools; join against the tagging database to identify the owning team and cost centre; compute the projected monthly savings; open a Jira ticket assigned to the team with a two-week SLA; track closure rate per team; escalate unclosed tickets to engineering managers. This transforms right-sizing from a heroic cleanup sprint into a continuous operational discipline.
Production-Grade Guardrails
Every waste-elimination program eventually deletes something it should not have. Build safeguards before you automate deletions: tag every resource with do-not-delete=true for anything that must survive regardless of metrics; require human approval for any deletion above a configurable size threshold (e.g., EBS volumes over 500 GB, S3 buckets with over 1 TB); dry-run all deletion scripts for one week before enabling live mode; and keep a 30-day audit log of everything deleted, by whom (or which automation), with the resource ARN and the metric that triggered it. The audit log has saved multiple on-call engineers when a mysterious outage traced back to an automated cleanup that removed a "zero-traffic" queue that turned out to be a dead-letter queue checked once a month by a compliance job.
Right-sizing and waste elimination are not a one-time cleanup project. They are a continuous engineering discipline. The fastest-growing companies run this pipeline weekly, publish team-level savings dashboards (visible to VPs), and treat a team's waste percentage as a first-class engineering metric alongside reliability and velocity.