FinOps & Cloud Cost Optimization

Project: A Cost Optimization Program

18 min Lesson 10 of 26

Project: A Cost Optimization Program

This lesson is the capstone of the FinOps tutorial. You will work through a realistic scenario: a $480,000/month AWS bill for a mid-size SaaS platform, walk every line of that bill with a structured audit methodology, and produce a concrete savings roadmap with effort tiers, dollar estimates, and a 12-month delivery calendar. This is the exact exercise a FinOps practitioner runs when joining a new organisation or when cloud spend starts outgrowing revenue growth.

The Sample Bill — Anatomy of $480k/Month

The platform is a B2B SaaS product serving 8,000 tenants in us-east-1 and eu-west-1. The engineering org is 120 engineers across 14 product squads. The bill has never been systematically reviewed. Current state:

  • EC2 & Auto Scaling: $198,000 (41%) — 420 production instances, all on-demand. Sizes range from t3.large to c6i.8xlarge. No Savings Plans, no RIs. Average utilisation reported by CloudWatch is 22% CPU across the fleet.
  • RDS & Aurora: $87,000 (18%) — 38 Aurora clusters (MySQL-compatible). 11 clusters are db.r6g.8xlarge running multi-AZ. No reserved instances. 6 clusters have not received a single write query in the past 14 days (dev/test environments not shut down on nights/weekends).
  • Data Transfer: $62,000 (13%) — the single largest line item nobody examined. $41,000 is cross-AZ transfer. $14,000 is inter-region replication to eu-west-1 for tenants that actually sit entirely in us-east-1.
  • S3: $34,000 (7%) — 1.2 PB stored. No Intelligent-Tiering, no lifecycle rules. CloudWatch shows 85% of objects have not been accessed in over 90 days.
  • CloudWatch Logs: $29,000 (6%) — default infinite retention on 340 log groups. 60% of ingestion is debug-level logs from a Java service that should have been switched to INFO in production 18 months ago.
  • NAT Gateways: $24,000 (5%) — 12 NAT gateways across 4 VPCs. $19,000 of that is data-processing charges from S3 and DynamoDB traffic routing through NAT instead of VPC endpoints.
  • Other (EBS snapshots, ELBs, ECR, Lambda, SQS, SNS): $46,000 (10%)
Sample bill breakdown: $480k/month by service category Bill Breakdown — $480k / Month EC2 / ASG $198,000 — 41% RDS / Aurora $87,000 — 18% Data Transfer $62,000 — 13% S3 $34,000 — 7% CW Logs $29,000 — 6% NAT GW $24,000 — 5% Other: $46,000 — 10% 0 $240k $480k
Monthly bill by service category. Data transfer and NAT gateways are disproportionately expensive relative to what they deliver.

Phase 1 — Audit: Ask the Right Questions Before Touching Anything

The worst mistake in a bill audit is immediately clicking "purchase Reserved Instances" or deleting resources without understanding causality. A structured audit follows this sequence:

  1. Verify tag coverage. Run aws resourcegroupstaggingapi get-resources --tag-filters Key=team and measure what fraction of resources carry the mandatory team, env, and service tags. In the sample bill, tag coverage is 38% — meaning 62% of spend is unallocated. Fix tagging before analysing anything else, or your findings will be meaningless.
  2. Export 90 days of Cost Explorer data. Export at DAILY granularity grouped by SERVICE, USAGE_TYPE, and the team tag. Load into a spreadsheet or a BigQuery/Athena table. Look for monotonic growth lines (cost that grows every day without a corresponding feature launch), step-function spikes (cost that jumped suddenly — usually a new workload or a misconfiguration), and flat lines on large amounts (committed resources sitting idle).
  3. Cross-reference with CloudWatch metrics. For EC2, pull average CPUUtilization, NetworkIn, and NetworkOut over 90 days. Anything averaging below 10% CPU is a right-sizing or termination candidate. For RDS, pull DatabaseConnections — an Aurora cluster with zero connections over 14 days is a dev environment that never got a shutdown schedule.
  4. Map data flows. Use VPC Flow Logs aggregated in Athena to identify the top-10 source/destination pairs by byte count. This is the only way to understand your $62,000 data transfer bill without guessing. The cross-AZ traffic almost always comes from a small number of chattty services that were deployed without AZ affinity.
# Step 1: Measure tag coverage across all tagged resource types aws resourcegroupstaggingapi get-resources \ --resource-type-filters ec2:instance rds:db elasticloadbalancing:loadbalancer \ --query 'ResourceTagMappingList[*].{ARN:ResourceARN, Tags:Tags}' \ --output json | jq ' .[] | { arn: .ARN, has_team: ((.Tags // []) | map(.Key) | contains(["team"])), has_env: ((.Tags // []) | map(.Key) | contains(["env"])), has_service: ((.Tags // []) | map(.Key) | contains(["service"])) } ' | jq -s ' { total: length, fully_tagged: (map(select(.has_team and .has_env and .has_service)) | length), coverage_pct: ((map(select(.has_team and .has_env and .has_service)) | length) * 100 / length) } ' # Step 2: Identify EC2 instances with avg CPU < 10% over 90 days # (replace INSTANCE_ID and dates; run once per region) aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=INSTANCE_ID \ --start-time 2025-03-15T00:00:00Z \ --end-time 2025-06-12T00:00:00Z \ --period 7776000 \ --statistics Average \ --query 'Datapoints[0].Average' # Step 3: Athena query — top cross-AZ traffic pairs from VPC Flow Logs SELECT srcaddr, dstaddr, SUM(bytes) / 1073741824.0 AS gigabytes_transferred FROM vpc_flow_logs WHERE date_partition >= '2025-03-01' AND src_az != dst_az -- filter cross-AZ only GROUP BY srcaddr, dstaddr ORDER BY gigabytes_transferred DESC LIMIT 20;

The Savings Roadmap — Tiered by Effort and Time to Value

A savings roadmap is not a wish list. Every initiative needs a dollar estimate, an effort estimate, an owner, and a completion date. The tiers below reflect real-world implementation difficulty and the typical organisational friction involved.

The 80/20 rule of cloud savings: in virtually every first audit, four categories account for 80% of recoverable savings: idle/orphaned resources, right-sizing compute, commitment discounts, and data transfer architecture. Everything else is noise until those four are done.

Tier 0 — No-Risk Deletions (Week 1–2, zero engineering effort):

  • Terminate the 6 idle Aurora dev clusters: saves $14,400/month. These have zero connections. Snapshot them first (aws rds create-db-cluster-snapshot), then delete. Add a Lambda + EventBridge scheduler to auto-stop dev clusters at 18:00 weekdays and restart at 08:00 — this pattern saves $8–12k/month on its own by eliminating 65% of dev environment runtime.
  • Set CloudWatch Logs retention to 30 days on all log groups (90 days for audit-sensitive groups). Switch the Java service log level to INFO: saves $17,400/month combined. This is a single aws logs put-retention-policy call per log group, scriptable in 20 minutes.
  • Delete unattached EBS volumes, unused Elastic IPs, and idle load balancers found in the "Other" category: estimated $8,000–12,000/month.

Tier 1 — Quick Architectural Fixes (Weeks 2–6, 1–2 engineers each):

  • VPC Endpoints for S3 and DynamoDB: $19,000/month of NAT Gateway data-processing charges is the single highest-ROI fix in the bill. Gateway-type VPC Endpoints are free; the data no longer routes through NAT. Implementation: one Terraform module, one PR, one apply. Saves ~$19,000/month.
  • Enable S3 Intelligent-Tiering: 1.2 PB with 85% cold objects. IT-Flexible tier reduces storage cost from ~$0.023/GB to ~$0.004/GB for infrequent-access objects. Net saving after monitoring fee: ~$18,000/month. Single S3 Batch Operations job to tag objects.
  • Fix cross-region replication scope: $14,000/month of inter-region transfer for US-only tenants. Scope replication to EU-domiciled tenants only. Saves ~$11,000/month (some legitimate EU tenants remain).

Tier 2 — Right-Sizing (Weeks 4–10, 0.5 FTE for 6 weeks):

  • 22% average CPU across 420 instances means significant over-provisioning. AWS Compute Optimizer generates right-sizing recommendations with ML-derived confidence scores. Conservative approach: only action HIGH-confidence recommendations, moving instances down one size class. Expected reduction: 20–30% of instance costs. On $198,000, a 25% reduction is $49,500/month. Use instance scheduler for non-production to add another $15–20k.
  • Right-sizing must happen before you buy Savings Plans — committing to over-provisioned instance types locks in the waste at a discount.

Tier 3 — Commitment Discounts (Month 2–3, FinOps lead + finance sign-off):

  • After right-sizing, the EC2 fleet will cost approximately $148,500/month on-demand. Pull the trailing 30-day minimum hourly spend (post-right-sizing): this becomes the safe 3-year Compute Savings Plan commitment. The 11 Aurora r6g.8xlarge multi-AZ clusters are stable — purchase 3-year Partial Upfront RDS Reserved Instances. Combined estimated saving at 55–65% off on-demand: $60,000–75,000/month.
# Generate AWS Compute Optimizer right-sizing recommendations (CSV export) aws compute-optimizer export-ec2-instance-recommendations \ --s3-destination-config bucket=my-finops-exports,keyPrefix=optimizer/ \ --include-member-accounts \ --recommendation-preferences EnhancedInfrastructureMetrics=Active # After export, query with Athena: SELECT instanceid, instancetype AS current_type, recommendationoptions_1_instancetype AS recommended_type, finding, -- OVER_PROVISIONED / OPTIMIZED / UNDER_PROVISIONED utilizationmetrics_cpu_maximum AS max_cpu_pct, recommendationoptions_1_estimatedmonthlysavings_value AS monthly_saving_usd FROM compute_optimizer_ec2 WHERE finding = \'OVER_PROVISIONED\' AND recommendationoptions_1_performancerisk <= 2 -- HIGH confidence only ORDER BY monthly_saving_usd DESC LIMIT 50; # Auto-stop dev Aurora clusters at 18:00 UTC — EventBridge Scheduler target aws scheduler create-schedule \ --name "dev-aurora-stop-weekday" \ --schedule-expression "cron(0 18 ? * MON-FRI *)" \ --flexible-time-window Mode=OFF \ --target '{ "Arn": "arn:aws:scheduler:::aws-sdk:rds:stopDBCluster", "RoleArn": "arn:aws:iam::123456789012:role/SchedulerRole", "Input": "{\"DbClusterIdentifier\": \"dev-aurora-cluster-1\"}" }'

The 12-Month Savings Calendar

Sequencing matters. Doing commitment discounts before right-sizing wastes money. Fixing data transfer before understanding traffic patterns can break replication. The calendar below is the recommended execution order:

12-month FinOps savings roadmap — sequencing by tier Initiative M1-2 M3-4 M5-6 M7-9 M10-12 $/mo saving Cumulative run-rate Tier 0: Deletions Idle DB, Logs, EBS Active ~$40,000 $440k/mo (was $480k) Tier 1: Arch Fixes NAT EP, S3 IT, XR Active ~$48,000 $392k/mo Tier 2: Right-Sizing EC2 Optimizer Active ~$49,500 $342k/mo Tier 3: Savings Plans + RDS RIs Active ~$67,500 $275k/mo Operate: Governance Tags, budgets, CI Ongoing — prevents regression Total 12-month run-rate saving: ~$205,000/month (43% reduction) $480k → $275k/mo | $2.46M annualised saving
12-month FinOps roadmap for the sample bill. Tiers must be executed in order — commitments always come last.

Unit Economics: Closing the Loop

A savings roadmap that stops at "we reduced the bill" misses half the value. Mature FinOps connects cloud cost to business metrics. For a B2B SaaS, the key ratio is cost per tenant per month. At $480k/month serving 8,000 tenants, cost-per-tenant is $60. After the 12-month programme, the same 8,000 tenants cost $34/month — a 43% improvement that, if revenue is growing, means significantly improved gross margin.

Instrument this in your observability stack. Emit a daily metric cloud.cost_per_tenant to your Grafana/Datadog dashboard, plotted alongside revenue_per_tenant and gross_margin_pct. When cost-per-tenant starts rising without a corresponding feature investment, something went wrong — new workload without right-sizing, a data pipeline whose volume grew unexpectedly, a service that lost its Spot coverage after a spot-interruption failure. Catching these signals at the unit-economics level is faster than waiting for the monthly bill review.

Present savings as gross margin improvement, not just dollar savings. A CTO and CFO engage with "our cloud gross margin improved from 58% to 67%" far more than "we saved $2.4M/year." Translate the savings roadmap into the business metric your leadership tracks before presenting it.

Governance: Preventing Regression

The most common failure mode of a FinOps programme is a 6-month sprint that achieves great results, followed by a 12-month slow drift back to the original spend as the organisation grows and nobody enforces the new patterns. Prevention requires three structural controls:

  1. Infracost in every Terraform PR. Cost diff is a required CI check, not optional. A PR that adds $5,000/month of new spend without a JIRA ticket linking to a business justification is blocked until an engineer explicitly overrides it. This is exactly the same pattern as a security scanner blocking a PR with a critical CVE.
  2. Monthly FinOps reviews with team-level showback. Each squad sees their cost trend on the same slide deck as their SLO performance. Cost spikes get the same attention as error rate spikes.
  3. Tagging enforcement via AWS Config Rules / SCPs. Any resource created without the mandatory tags is automatically sent a remediation event that tags it with team=untagged and triggers an alert to the FinOps lead. Resources tagged untagged after 7 days are eligible for auto-deletion in non-production accounts.
The commitment discount trap after headcount reductions: if your organisation has purchased 3-year Savings Plans based on a 500-engineer headcount and then reduces to 300 engineers with a correspondingly smaller fleet, the committed spend becomes wasted. Savings Plans cannot be cancelled early (they can be sold on the AWS Marketplace, but at a discount). Whenever a company signals a reduction in force or a major product sunset, immediately model the impact on commitment utilisation and stop new long-term purchases until the new steady-state is clear.

Your Deliverables

As the engineer who owns the cost optimization programme, your output by end of month one should be: a one-page executive summary with four numbers (current spend, annualised saving opportunity, 12-month plan, and cost-per-tenant before/after); a Terraform module implementing VPC endpoints and S3 Intelligent-Tiering; a Jira epic with one ticket per Tier 0 and Tier 1 initiative, each with dollar estimates in the acceptance criteria; and a Grafana dashboard with cloud.cost_per_tenant, Savings Plan utilisation percentage, and top-5 services by spend. That artefact set is how you demonstrate FinOps maturity at the senior engineer and staff engineer level.