FinOps & Cloud Cost Optimization

Why FinOps?

18 min Lesson 1 of 26

Why FinOps?

Cloud computing promised economics that matched spend to usage. The reality, for most organisations that scaled past a few dozen engineers, has been the opposite: bills that grew faster than headcount, surprise charges that showed up weeks late, and engineering teams that had no idea what their services actually cost. FinOps — Financial Operations — is the discipline that closes that gap. Before you learn to optimise anything, you need to understand why the problem exists in the first place and what the practise framework looks like at big-tech scale.

The Cloud Spend Problem

Three structural forces combine to make cloud cost ungovernable without deliberate effort.

Decentralised provisioning with centralised billing. Any developer with the right IAM role can terraform apply a GPU cluster or a NAT gateway. The charge shows up on one consolidated invoice 30 days later, attributed only to an account or a project — not to the service or the team that created it. By the time finance flags the anomaly, the resource may already be gone or the engineer who created it may have moved on.
Pricing complexity. AWS alone publishes over 2 million SKU price points. EC2 pricing varies by region, OS, tenancy, purchase option, and network placement. Data transfer fees are particularly opaque: an egress charge from one AZ to another is billed differently from the same transfer crossing a VPC peering link, and both differ from transfer leaving the region. Nobody memorises this; most engineers have a vague intuition that is usually wrong by a factor of two to ten.
The lag between action and signal. Cloud bills are not real-time. Cost Explorer data is typically 24 hours stale. Committed-use discount analysis requires months of historical usage data. A team that ships a new feature with an architectural mistake — say, a fanout that reads millions of S3 objects per request — will not see the cost impact until two or three billing cycles later, by which point the feature is deeply embedded in production.

The result at scale is predictable: a SaaS company growing 3x year-over-year often finds that cloud spend grows 5–8x in the same period because of accumulated waste — oversized instances nobody right-sized, dev environments running 24/7, cross-region data copies that nobody decommissioned, and on-demand pricing on workloads that have been stable for 18 months. Gartner and McKinsey both estimate that 30–35% of enterprise cloud spend is waste. At $10M/month of cloud spend that is $3–3.5M walking out the door every month.

Why this matters for a DevOps engineer: Cost is a system property, not a finance problem. The engineer who designs the inter-service communication pattern, chooses the storage tier, or decides whether a job runs on a Spot instance is making a cost decision. FinOps simply makes that decision visible and explicit.

The FinOps Foundation Framework

The FinOps Foundation (finops.org), now a Linux Foundation project, standardised the practise around three phases that form a continuous loop: Inform, Optimize, and Operate. This is not a one-time project; it is an operating model that runs permanently in parallel with your engineering delivery cycles.

The FinOps lifecycle: a continuous loop of Inform → Optimize → Operate, never a one-time project.

Phase 1 — Inform

You cannot optimise what you cannot see. The Inform phase is about achieving cost visibility at the granularity of a team, a service, and eventually a single unit of business value (a user, a transaction, a request). The key deliverables are:

Tagging taxonomy. Every resource tagged with env, team, service, and cost-centre. Without this, cost allocation is guesswork. AWS Config rules, Azure Policy, and GCP Organisation Policies enforce mandatory tags at resource creation time. Enforce before you scale — retrofitting tags onto 50,000 resources is a multi-quarter project.
Showback and chargeback. At minimum, weekly reports emailed to team leads showing what their services cost. Chargeback — actually debiting the team's budget — follows once showback data is trusted. The psychological effect of seeing your service cost on your OKR dashboard is enormous.
Anomaly detection. AWS Cost Anomaly Detection, GCP Budget alerts, and Azure Cost Alerts provide automated signals when spend deviates from a rolling baseline. A 20% day-over-day spike in one service is worth investigating before the month closes.

# AWS CLI: last 7-day cost grouped by SERVICE and team tag (production only)
aws ce get-cost-and-usage \
  --time-period Start=2025-06-04,End=2025-06-11 \
  --granularity DAILY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
            Type=TAG,Key=team \
  --filter '{"Tags":{"Key":"env","Values":["production"]}}' \
  --query 'ResultsByTime[*].Groups[*].{Service:Keys[0],Team:Keys[1],Cost:Metrics.UnblendedCost.Amount}' \
  --output table

# GCP: per-label cost breakdown via BigQuery billing export
# (assumes billing export is already configured to dataset 'billing_export')
bq query --use_legacy_sql=false '
SELECT
  labels.value AS team,
  service.description AS service,
  SUM(cost) AS total_cost
FROM `my-project.billing_export.gcp_billing_export_v1_XXXXXX`
CROSS JOIN UNNEST(labels) AS labels
WHERE labels.key = "team"
  AND DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 50;
'

Phase 2 — Optimize

Once you can see your costs, you can reduce them. Optimization is a portfolio of interventions, each with a different time horizon and complexity:

Immediate wins (days): delete idle resources — unattached EBS volumes, unused Elastic IPs, stopped EC2 instances that have not run in 30 days, orphaned load balancers. AWS Trusted Advisor and the open-source cloud-nuke tool automate discovery. Most organisations find 5–15% of spend here on the first sweep.
Medium-term wins (weeks): right-sizing instances, moving workloads to graviton/ARM, enabling S3 Intelligent-Tiering, setting lifecycle policies on CloudWatch Logs (default retention is forever; change it to 30 days unless compliance requires longer).
Structural wins (months): commitment discounts — Reserved Instances, Savings Plans, GCP CUDs — typically deliver 40–70% off on-demand for stable baseline workloads. Spot/Preemptible for fault-tolerant batch. Architectural changes like replacing a polling pattern with an event-driven fanout, or replacing a NAT gateway with a VPC endpoint for S3 and DynamoDB traffic.

Tier your optimisation effort by ROI: a 3-engineer week spent getting Savings Plan coverage from 60% to 75% on $500k/month of compute saves roughly $75k/month permanently. The same 3-engineer week spent shaving 5ms off a Lambda cold start saves almost nothing in cost. Always calculate the dollar value of the saving before committing engineering time to it.

Phase 3 — Operate

Operate is where FinOps becomes culture rather than a project. The goal is to embed cost awareness into every engineering workflow so that optimisation happens continuously rather than in quarterly fire drills. The mechanisms are:

Cost gates in CI/CD. Tools like Infracost integrate into Terraform PRs and post a cost diff comment before merge. An engineer adding a new RDS multi-AZ instance sees the $400/month impact before the code ships. This is the FinOps equivalent of shifting security left.
Per-team budget alerts at 80% and 100%. Alerts go to the team Slack channel, not just to finance. The team owns the response.
Unit economics tracking. Cost per transaction, cost per active user, cost per API call — tracked in the same dashboards as latency and error rate. When cost-per-user starts climbing while user count is flat, something architectural has changed and the team sees it immediately.
Regular FinOps reviews. Monthly reviews at the team level, quarterly reviews at the VP/CTO level. Review what was committed, what was optimised, and what the next quarter's target is.

# Infracost: estimate cost impact of a Terraform change before merging
# Install: brew install infracost  (or download from infracost.io/docs/installation)
infracost auth login

# Run against a Terraform plan
cd infra/
terraform plan -out tfplan.binary
terraform show -json tfplan.binary > tfplan.json

infracost diff --path tfplan.json \
  --format table \
  --show-skipped

# Example output (truncated):
# Name                                Monthly Qty  Unit       Monthly Cost
# aws_db_instance.primary
#  ├── Database instance (db.r6g.xl)          730  hours       $219.00
#  ├── Storage (gp3, 100 GB)                  100  GB            $11.50
#  └── Multi-AZ                               730  hours       $219.00
# OVERALL TOTAL                                                 $449.50
# +$449.50 vs. baseline ($0)

Personas: Who Does FinOps?

The FinOps Foundation identifies three personas that must collaborate for the framework to work:

Engineering — makes the architectural and provisioning decisions that determine cost. Responsible for tagging, right-sizing, and implementing Spot/Savings Plans.
Finance — owns the budget, does chargeback, and tracks cloud spend against forecasts. Needs granular, trusted allocation data from Engineering.
Product/Business — owns unit economics. Determines which cost targets are acceptable given the business model (a low-margin SaaS has a very different cost tolerance than a high-margin enterprise product).

At a company like Netflix or Spotify, FinOps is a dedicated team of 10–20 engineers and analysts. At a mid-size SaaS, it is typically a shared responsibility across a platform team and a finance business partner, meeting weekly. At a startup, it is one person looking at the Cost Explorer dashboard once a month. The tooling and cadence scale; the principles do not change.

The "we will optimise later" trap: the most expensive FinOps debt is architectural debt. Choosing a synchronous, per-request database call over an async batch job is 100x cheaper to fix at design time than after 50 microservices have adopted the same pattern. Build cost review into your architecture review process from day one — not as a gate, but as a column in the decision matrix alongside latency, reliability, and security.

Getting Started: The First 30 Days

If you are the engineer tasked with starting a FinOps programme from scratch, the first 30 days should produce three things: a tagging standard that is enforced by policy, a Slack channel or dashboard that shows each team's weekly cost, and a list of the top 10 idle resources with owners identified. Nothing else. Resist the temptation to buy a FinOps platform (Apptio Cloudability, CloudHealth, Spot.io) on day one — you do not yet understand your own data well enough to configure it correctly. Start with native tools (Cost Explorer, BigQuery Billing Export, Azure Cost Management), build intuition, then evaluate third-party platforms from a position of knowledge.