FinOps & Cloud Cost Optimization

Understanding Cloud Bills

18 min Lesson 3 of 26

Understanding Cloud Bills

Cloud bills are not random. Every dollar on an AWS Cost Explorer, GCP Billing, or Azure Cost Management invoice maps to a discrete metered event: a CPU-second consumed, a gigabyte stored, a byte transferred out of a region. The problem is that cloud providers surface those events through hundreds of SKUs, dozens of pricing dimensions, and free-tier traps that make the total look opaque. This lesson teaches you to read a bill the way a FinOps engineer at Netflix or Lyft would — identifying the three fundamental cost drivers (compute, storage, egress) and the predictable culprits that inflate each one.

The 80/20 rule of cloud spend: In a typical production workload, compute accounts for 50–70% of spend, storage for 10–20%, and data transfer (egress) for 5–20%. The remaining 5–15% is database managed services, DNS, monitoring, and miscellaneous. Knowing this distribution tells you where to focus: a 30% reduction in compute spend beats a 30% reduction in egress every time — unless your egress bill is anomalously large (common in media streaming or data export-heavy SaaS).

Compute: The Dominant Cost Driver

Compute cost has three sub-dimensions that cloud providers always price separately: instance hours (vCPU + RAM time), OS licensing (Windows/RHEL surcharge on top of hardware), and accelerator add-ons (GPU, FPGA, or Elastic Network Adapter throughput). Reading an EC2 line item without understanding all three gives you an incomplete picture.

The usual suspects that inflate compute bills:

Zombie instances: Instances left running after a load test, a demo, or a dev environment that was never torn down. At $0.096/hr for an m5.xlarge, a forgotten instance costs $840/year. Multiply by the 15–30 instances engineering teams typically leave behind and you have a $15–25k leak with zero business value.
Over-provisioned instance families: Picking m5.2xlarge because it was the default in Terraform and nobody questioned it. At big tech, the average CPU utilization across on-demand compute is 12–25%. Right-sizing (Lesson 4) recovers much of this, but you cannot right-size what you cannot see.
Missing Savings Plans or Reserved Instances: On-demand pricing is the list price — the ceiling, not the floor. AWS Compute Savings Plans give 40–60% off on-demand for a 1- or 3-year commitment. Running $200k/month of steady-state EC2 on-demand when you could be on a Savings Plan is a $80–120k/year leak. (Lesson 5 covers commitments in depth.)
Auto-Scaling misconfiguration: Scale-out policies that add capacity faster than scale-in policies remove it, or minimum fleet sizes set too conservatively for off-hours traffic. A minimum of 10 instances that should be 3 during nights and weekends costs real money 168 hours a week.
Idle NAT Gateways and Load Balancers: Each AWS NAT Gateway costs $0.045/hr (~$390/yr) plus $0.045/GB processed, regardless of traffic. An ALB costs $0.008/hr plus LCU charges. Environments that were deployed for a sprint and never deleted leave these resources running indefinitely.

# Query your AWS bill for compute line items using the Cost Explorer CLI.
# This retrieves the last 30 days broken down by instance type and purchase option.

aws ce get-cost-and-usage \
  --time-period Start=2025-05-01,End=2025-06-01 \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["Amazon Elastic Compute Cloud - Compute"]
    }
  }' \
  --group-by '[
    {"Type":"DIMENSION","Key":"INSTANCE_TYPE"},
    {"Type":"DIMENSION","Key":"PURCHASE_TYPE"}
  ]' \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Groups[*].[Keys,Metrics.UnblendedCost.Amount]' \
  --output table

# Equivalent GCP: list Compute Engine spend by SKU description
gcloud billing accounts list   # get ACCOUNT_ID
bq query --use_legacy_sql=false \
  'SELECT sku.description, SUM(cost) AS total_cost
   FROM `billing_export_dataset.gcp_billing_export_v1_XXXXXX`
   WHERE service.description = "Compute Engine"
     AND DATE(_PARTITIONTIME) BETWEEN "2025-05-01" AND "2025-06-01"
   GROUP BY 1 ORDER BY 2 DESC LIMIT 20'

Storage: Silent but Accumulating

Storage costs are deceptive because they grow incrementally and never generate an alert. A team that deletes nothing adds gigabytes every sprint — old container images, superseded database snapshots, unused EBS volumes attached to terminated instances, and S3 buckets that became a dumping ground for logs.

Key storage cost categories:

EBS volumes detached from terminated instances: When an EC2 instance is terminated without DeleteOnTermination: true on its root volume, the EBS volume persists and continues billing at $0.08–$0.10/GB-month. A 500 GB gp3 volume costs $40/month doing nothing. Scale this across a team that terminates dozens of instances per month and never checks for orphaned volumes.
S3 storage class mismatches: S3 Standard at $0.023/GB-month vs S3 Glacier Instant Retrieval at $0.004/GB-month for data that is accessed once a quarter. Logs, backups, and ML training datasets sitting in Standard because nobody configured a lifecycle policy is a 5–6x overpayment.
Snapshot sprawl: RDS automated backups create daily snapshots retained for 7 days by default — acceptable. But manual snapshots created during a maintenance window and never deleted accumulate indefinitely. A 2 TB RDS instance with 18 months of weekly manual snapshots has ~70 TB of snapshot storage at $0.095/GB-month = $6,650/month in snapshot costs alone, nearly certainly exceeding the instance cost.
Container registry bloat: Every docker push to ECR or GCR stores a new manifest. An active team pushing 20 builds/day with a 1 GB image and no lifecycle policy accumulates 7,200 image layers per year. ECR charges $0.10/GB-month after the free tier.

The three primary cloud bill drivers: compute dominates at 50–70%, storage accumulates silently, and egress spikes with cross-AZ or cross-region data flows. Red boxes indicate the most common waste patterns.

Egress: The Hidden Tax That Surprises Teams

Egress pricing is deliberately asymmetric: ingress (data coming into the cloud) is free; egress (data leaving) is expensive. AWS charges approximately $0.09/GB for the first 10 TB/month of internet egress, dropping to $0.085/GB at scale. GCP charges $0.08/GB for the first 10 TB. Azure charges $0.087/GB. These numbers seem small until you operate a data platform moving terabytes daily.

The egress cost categories that catch teams off guard:

Cross-AZ traffic: AWS charges $0.01/GB for data crossing Availability Zones. This is invisible in development (single-AZ) but becomes material in production multi-AZ deployments. A microservices architecture making 100,000 synchronous cross-service calls per second, each returning 5 KB, crosses AZ boundaries on roughly 50% of those calls in a 3-AZ setup — that is 15 GB/hr or $130/month from a single service pair. Multiply by 50 microservices.
Cross-region replication: Replicating an S3 bucket or RDS read replica across regions costs both the storage and the data transfer. Replicating 5 TB of data from us-east-1 to eu-west-1 costs $0.02/GB = $100/TB one-time, plus ongoing replication traffic.
NAT Gateway data processing: Every byte from a private subnet traversing a NAT Gateway is charged at $0.045/GB on top of the gateway hourly cost. Lambda functions in a VPC hitting an S3 endpoint through NAT (instead of via a VPC endpoint) is a canonical example of paying $0.045/GB for data that could cost nothing.
Direct API egress vs. CloudFront: Serving large assets (images, videos, downloadable files) directly from an EC2 origin or S3 costs $0.09/GB. Fronting with CloudFront costs $0.0085/GB from the CDN edge — a 10x reduction for cache-hit traffic.

# Identify your top egress charges with AWS Cost Explorer and tag analysis.
# Step 1 — find the top data transfer line items for the last month
aws ce get-cost-and-usage \
  --time-period Start=2025-05-01,End=2025-06-01 \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["AWS Data Transfer"]
    }
  }' \
  --group-by '[{"Type":"DIMENSION","Key":"USAGE_TYPE"}]' \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Groups[*].[Keys[0],Metrics.UnblendedCost.Amount]' \
  --output table | sort -t'|' -k3 -rn | head -20

# Step 2 — spot cross-AZ traffic inside a VPC using VPC Flow Logs.
# Query with Athena (assumes flow logs table already created via the VPC console):
# SELECT srcaddr, dstaddr, SUM(bytes)/1073741824.0 AS gb_transferred
# FROM vpc_flow_logs
# WHERE action = 'ACCEPT'
#   AND az_id IS NOT NULL
#   AND srcaddr LIKE '10.%'          -- internal RFC-1918
# GROUP BY 1, 2
# ORDER BY 3 DESC
# LIMIT 30;

# Step 3 — check for Lambda or EC2 hitting S3 through NAT instead of VPC endpoint
aws ec2 describe-vpc-endpoints \
  --query 'VpcEndpoints[?ServiceName==`com.amazonaws.us-east-1.s3`].[VpcId,State]' \
  --output table
# If empty: you have no S3 VPC endpoint — all S3 traffic routes through NAT Gateway

The VPC endpoint trap: Creating a VPC endpoint for S3 is free. Routing through NAT Gateway costs $0.045/GB. A Lambda-heavy workload reading from S3 without a VPC endpoint can generate a NAT Gateway bill that is larger than the Lambda execution cost. This is one of the highest-ROI five-minute fixes in cloud infrastructure — create the gateway endpoint, update your route tables, and the traffic is automatically rerouted at zero cost.

Reading the Bill Systematically: A FinOps Workflow

Every FinOps review should follow the same top-down decomposition. Start at the total, break it into service categories, then drill into the biggest movers and anomalies. AWS Cost Explorer, GCP Billing Reports, and the ce CLI all support this pattern.

# A structured cost triage script — runs monthly from a CI job or cron.
# Outputs a cost summary to Slack or a markdown report.
#!/usr/bin/env bash
set -euo pipefail

MONTH_START=$(date -d "$(date +%Y-%m-01) -1 month" +%Y-%m-%d 2>/dev/null \
  || date -v-1m -v1d +%Y-%m-%d)   # GNU / BSD date compat
MONTH_END=$(date +%Y-%m-01)

echo "=== Cloud Cost Report: ${MONTH_START} to ${MONTH_END} ==="

# 1. Total cost
echo -e "\n--- TOTAL ---"
aws ce get-cost-and-usage \
  --time-period Start="${MONTH_START}",End="${MONTH_END}" \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

# 2. Top 10 services by cost
echo -e "\n--- TOP SERVICES ---"
aws ce get-cost-and-usage \
  --time-period Start="${MONTH_START}",End="${MONTH_END}" \
  --granularity MONTHLY \
  --group-by '[{"Type":"DIMENSION","Key":"SERVICE"}]' \
  --metrics "UnblendedCost" \
  --query 'sort_by(ResultsByTime[0].Groups, &Metrics.UnblendedCost.Amount)[-10:][*].[Keys[0],Metrics.UnblendedCost.Amount]' \
  --output table

# 3. Untagged resources (cost allocation gap)
echo -e "\n--- UNTAGGED SPEND ---"
aws ce get-cost-and-usage \
  --time-period Start="${MONTH_START}",End="${MONTH_END}" \
  --granularity MONTHLY \
  --filter '{"Not":{"Tags":{"Key":"team","Values":["*"],"MatchOptions":["EXISTS"]}}}' \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

Tag early, tag everything. At Netflix, every cloud resource is tagged with team, service, environment, and cost-center at creation time, enforced by an SCP (Service Control Policy) that rejects untagged resource creation. Without tags, you cannot allocate cost to a team, cannot set budgets per service, and cannot tell which microservice owns the S3 bucket that is costing $4,000/month. The fourth column in any FinOps conversation is always "which team owns this?" — tagging is the infrastructure that makes that question answerable.

Translating SKUs to Behaviors

Cloud provider line items use terse SKU descriptions that obscure the underlying behavior. Learning the translation table is a core FinOps skill:

DataTransfer-Out-Bytes — internet egress from EC2 or ELB; $0.09/GB
APN1-DataTransfer-Regional-Bytes — cross-AZ within a region; $0.01/GB (the "Regional" label is the hint)
USE1-NatGateway-Bytes — NAT Gateway data processing; $0.045/GB
EBS:VolumeUsage.gp3 — provisioned EBS storage regardless of attachment state; $0.08/GB-month
RDS:StorageIOUsage — I/O charges on older gp2-based RDS instances; zero on io1/gp3 provisioned IOPS
BoxUsage:m5.xlarge — on-demand EC2 instance hours; the most common compute line item
HeavyUsage:m5.xlarge — legacy Reserved Instance usage; appears in older accounts still on RI rather than Savings Plans

When an unfamiliar SKU appears, the fastest resolution is the AWS Pricing API or the GCP SKU catalog — both are queryable and return human-readable descriptions. Never guess what a SKU means based on its name; the pricing dimension hidden inside (per-hour vs per-GB vs per-request) determines the optimization lever.

Anomaly detection is not optional at scale. AWS Cost Anomaly Detection and GCP Budget Alerts should be configured before you need them. A runaway Lambda that processes a malformed event in an infinite retry loop, a misconfigured S3 replication rule sending 10 TB/day cross-region, or an autoscaling group that scales to 500 instances during a DDoS all look the same on the bill: a sudden spike. Set a daily anomaly alert at 20% above the 7-day average as a baseline. At $100k/month spend, a 20% daily spike means $667 in unexpected cost — detectable in hours rather than discovered at month-end.