Understanding Cloud Bills
Understanding Cloud Bills
Cloud bills are not random. Every dollar on an AWS Cost Explorer, GCP Billing, or Azure Cost Management invoice maps to a discrete metered event: a CPU-second consumed, a gigabyte stored, a byte transferred out of a region. The problem is that cloud providers surface those events through hundreds of SKUs, dozens of pricing dimensions, and free-tier traps that make the total look opaque. This lesson teaches you to read a bill the way a FinOps engineer at Netflix or Lyft would — identifying the three fundamental cost drivers (compute, storage, egress) and the predictable culprits that inflate each one.
Compute: The Dominant Cost Driver
Compute cost has three sub-dimensions that cloud providers always price separately: instance hours (vCPU + RAM time), OS licensing (Windows/RHEL surcharge on top of hardware), and accelerator add-ons (GPU, FPGA, or Elastic Network Adapter throughput). Reading an EC2 line item without understanding all three gives you an incomplete picture.
The usual suspects that inflate compute bills:
- Zombie instances: Instances left running after a load test, a demo, or a dev environment that was never torn down. At $0.096/hr for an m5.xlarge, a forgotten instance costs $840/year. Multiply by the 15–30 instances engineering teams typically leave behind and you have a $15–25k leak with zero business value.
- Over-provisioned instance families: Picking
m5.2xlargebecause it was the default in Terraform and nobody questioned it. At big tech, the average CPU utilization across on-demand compute is 12–25%. Right-sizing (Lesson 4) recovers much of this, but you cannot right-size what you cannot see. - Missing Savings Plans or Reserved Instances: On-demand pricing is the list price — the ceiling, not the floor. AWS Compute Savings Plans give 40–60% off on-demand for a 1- or 3-year commitment. Running $200k/month of steady-state EC2 on-demand when you could be on a Savings Plan is a $80–120k/year leak. (Lesson 5 covers commitments in depth.)
- Auto-Scaling misconfiguration: Scale-out policies that add capacity faster than scale-in policies remove it, or minimum fleet sizes set too conservatively for off-hours traffic. A minimum of 10 instances that should be 3 during nights and weekends costs real money 168 hours a week.
- Idle NAT Gateways and Load Balancers: Each AWS NAT Gateway costs $0.045/hr (~$390/yr) plus $0.045/GB processed, regardless of traffic. An ALB costs $0.008/hr plus LCU charges. Environments that were deployed for a sprint and never deleted leave these resources running indefinitely.
Storage: Silent but Accumulating
Storage costs are deceptive because they grow incrementally and never generate an alert. A team that deletes nothing adds gigabytes every sprint — old container images, superseded database snapshots, unused EBS volumes attached to terminated instances, and S3 buckets that became a dumping ground for logs.
Key storage cost categories:
- EBS volumes detached from terminated instances: When an EC2 instance is terminated without
DeleteOnTermination: trueon its root volume, the EBS volume persists and continues billing at $0.08–$0.10/GB-month. A 500 GB gp3 volume costs $40/month doing nothing. Scale this across a team that terminates dozens of instances per month and never checks for orphaned volumes. - S3 storage class mismatches: S3 Standard at $0.023/GB-month vs S3 Glacier Instant Retrieval at $0.004/GB-month for data that is accessed once a quarter. Logs, backups, and ML training datasets sitting in Standard because nobody configured a lifecycle policy is a 5–6x overpayment.
- Snapshot sprawl: RDS automated backups create daily snapshots retained for 7 days by default — acceptable. But manual snapshots created during a maintenance window and never deleted accumulate indefinitely. A 2 TB RDS instance with 18 months of weekly manual snapshots has ~70 TB of snapshot storage at $0.095/GB-month = $6,650/month in snapshot costs alone, nearly certainly exceeding the instance cost.
- Container registry bloat: Every
docker pushto ECR or GCR stores a new manifest. An active team pushing 20 builds/day with a 1 GB image and no lifecycle policy accumulates 7,200 image layers per year. ECR charges $0.10/GB-month after the free tier.
Egress: The Hidden Tax That Surprises Teams
Egress pricing is deliberately asymmetric: ingress (data coming into the cloud) is free; egress (data leaving) is expensive. AWS charges approximately $0.09/GB for the first 10 TB/month of internet egress, dropping to $0.085/GB at scale. GCP charges $0.08/GB for the first 10 TB. Azure charges $0.087/GB. These numbers seem small until you operate a data platform moving terabytes daily.
The egress cost categories that catch teams off guard:
- Cross-AZ traffic: AWS charges $0.01/GB for data crossing Availability Zones. This is invisible in development (single-AZ) but becomes material in production multi-AZ deployments. A microservices architecture making 100,000 synchronous cross-service calls per second, each returning 5 KB, crosses AZ boundaries on roughly 50% of those calls in a 3-AZ setup — that is 15 GB/hr or $130/month from a single service pair. Multiply by 50 microservices.
- Cross-region replication: Replicating an S3 bucket or RDS read replica across regions costs both the storage and the data transfer. Replicating 5 TB of data from us-east-1 to eu-west-1 costs $0.02/GB = $100/TB one-time, plus ongoing replication traffic.
- NAT Gateway data processing: Every byte from a private subnet traversing a NAT Gateway is charged at $0.045/GB on top of the gateway hourly cost. Lambda functions in a VPC hitting an S3 endpoint through NAT (instead of via a VPC endpoint) is a canonical example of paying $0.045/GB for data that could cost nothing.
- Direct API egress vs. CloudFront: Serving large assets (images, videos, downloadable files) directly from an EC2 origin or S3 costs $0.09/GB. Fronting with CloudFront costs $0.0085/GB from the CDN edge — a 10x reduction for cache-hit traffic.
Reading the Bill Systematically: A FinOps Workflow
Every FinOps review should follow the same top-down decomposition. Start at the total, break it into service categories, then drill into the biggest movers and anomalies. AWS Cost Explorer, GCP Billing Reports, and the ce CLI all support this pattern.
team, service, environment, and cost-center at creation time, enforced by an SCP (Service Control Policy) that rejects untagged resource creation. Without tags, you cannot allocate cost to a team, cannot set budgets per service, and cannot tell which microservice owns the S3 bucket that is costing $4,000/month. The fourth column in any FinOps conversation is always "which team owns this?" — tagging is the infrastructure that makes that question answerable.
Translating SKUs to Behaviors
Cloud provider line items use terse SKU descriptions that obscure the underlying behavior. Learning the translation table is a core FinOps skill:
DataTransfer-Out-Bytes— internet egress from EC2 or ELB; $0.09/GBAPN1-DataTransfer-Regional-Bytes— cross-AZ within a region; $0.01/GB (the "Regional" label is the hint)USE1-NatGateway-Bytes— NAT Gateway data processing; $0.045/GBEBS:VolumeUsage.gp3— provisioned EBS storage regardless of attachment state; $0.08/GB-monthRDS:StorageIOUsage— I/O charges on older gp2-based RDS instances; zero on io1/gp3 provisioned IOPSBoxUsage:m5.xlarge— on-demand EC2 instance hours; the most common compute line itemHeavyUsage:m5.xlarge— legacy Reserved Instance usage; appears in older accounts still on RI rather than Savings Plans
When an unfamiliar SKU appears, the fastest resolution is the AWS Pricing API or the GCP SKU catalog — both are queryable and return human-readable descriptions. Never guess what a SKU means based on its name; the pricing dimension hidden inside (per-hour vs per-GB vs per-request) determines the optimization lever.