Measuring Utilization & Saturation
Measuring Utilization & Saturation
Capacity planning without measurement is guesswork. The metrics you collect — and how you interpret them — are the difference between a confident scale-out decision and a fire drill at 2 AM. This lesson covers the precise signals that matter for capacity decisions at production scale: what they measure, how to collect them reliably, and the senior-level interpretation that separates good operators from great ones.
The USE Method: A Framework for Resource Analysis
Brendan Gregg's USE method (Utilization, Saturation, Errors) gives you a systematic checklist for every physical and virtual resource — CPUs, memory, disks, network interfaces, Kubernetes resource quotas, and cloud service limits. Apply it per-resource before concluding that a system is "healthy".
- Utilization: What fraction of the resource's capacity is being consumed over a time window? High utilization is not inherently bad, but it shrinks your headroom.
- Saturation: Is work queuing because the resource cannot service it immediately? Saturation is the leading indicator of latency degradation — it appears before p99 latency spikes do.
- Errors: Are requests or operations failing at the resource level (TCP retransmits, disk I/O errors, OOM kills, throttled API calls)?
CPU: The Signals That Matter
Raw CPU percentage is overused and underspecified. At senior level you track these signals together:
node_cpu_seconds_total{mode="idle"}— derive utilization as1 - rate(idle[5m])per CPU, then aggregate. Splitting bymode(user, system, iowait, steal) reveals why CPU is busy.- CPU steal (
mode="steal") — on cloud VMs this exposes noisy-neighbour hypervisor contention. A steal value above 5% means your VM is being throttled by the host, not by your workload. HPA will not help; you need a larger instance class. - Run-queue length —
node_load1,node_load5,node_load15. Divide by CPU count. A ratio consistently above 1.0 means CPUs are saturated. Google SRE teams use load-per-CPU > 0.7 as the alerting threshold to leave headroom for traffic bursts. - CFS throttling — in Kubernetes,
container_cpu_cfs_throttled_seconds_totalcounts CPU cycles a container was denied because it hit its CGroup quota. A high throttle ratio while overall node CPU is moderate means you have under-provisioned limits, not a node-level capacity problem.
Memory: Utilization Is a Lie Without Context
Memory metrics are uniquely deceptive. The Linux kernel aggressively uses free RAM for page cache, making "used memory" a misleading number. The signals that actually indicate capacity pressure are:
- Available memory (
node_memory_MemAvailable_bytes) — this is the kernel's own estimate of how much memory can be freed under pressure without swapping. It is what you should alert on, notMemFree. - Major page faults / swap activity —
node_vmstat_pgmajfaultandnode_memory_SwapUsed_bytes. Any sustained swap usage in a containerised workload is a saturation signal; latency is already degraded. - OOM kill events —
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}. Each OOM kill is a memory saturation event that production traffic experienced. - Working set vs. RSS — Kubernetes uses
container_memory_working_set_bytes(not RSS) for eviction and OOM decisions. Alert on working set approaching the container memory limit.
jvm_gc_pause_seconds_sum) or equivalent runtime metrics.
Disk I/O: Throughput, IOPS, and Latency Together
A disk saturated on IOPS can have low throughput (many small random reads), and vice versa. Capacity decisions require all three dimensions:
node_disk_io_time_seconds_total— the fraction of time the device was busy (rateover 1m gives utilization 0–1). Above 0.8 is a saturation risk.node_disk_read_bytes_total/node_disk_written_bytes_total— throughput; compare against published device limits (gp3 baseline: 125 MB/s, provisioned up to 1,000 MB/s).node_disk_read_time_seconds_total / node_disk_reads_completed_total— average read latency. Above 1 ms for NVMe or 10 ms for network EBS suggests saturation or queue depth issues.
Network: Bandwidth, Packet Loss, and Connection Saturation
Network saturation manifests in three ways: bandwidth exhaustion, connection-table exhaustion, and retransmit storms. Each requires a different remediation.
- Bandwidth utilization:
rate(node_network_transmit_bytes_total[5m]) * 8for bits/s. On AWS, c5.xlarge has 10 Gbps burst; sustained above ~60% of baseline triggers throttling silently. - Retransmit rate:
rate(node_netstat_Tcp_RetransSegs[5m]). Above 0.1% of total segments is a signal worth investigating — it indicates either packet loss or receive-buffer saturation on the remote end. - Conntrack exhaustion:
node_nf_conntrack_entries / node_nf_conntrack_entries_limit. When this ratio exceeds 0.8, new TCP connections start failing withENFILE. In Kubernetes, this silently drops traffic while CPU and memory look fine — a classic false-healthy state.
Kubernetes-Specific Capacity Signals
Kubernetes adds a layer of virtual resources on top of OS-level signals. Both layers must be healthy; a node can appear under-utilized at the OS level while being over-committed at the scheduler level.
- Node allocatable pressure:
kube_node_status_allocatable{resource="cpu"}vs.sum(kube_pod_container_resource_requests{resource="cpu"}) by (node). When requests exceed allocatable, new pods cannot schedule — this is capacity exhaustion even if actual CPU usage is 40%. - Pod disruption budget violations: A saturated cluster that cannot drain nodes for upgrades is operationally saturated regardless of CPU metrics.
- Namespace quota utilisation:
kube_resourcequotametrics track hard/used per namespace. A team hitting their quota ceiling is experiencing capacity saturation from their perspective.
The Four Golden Signals in a Capacity Context
Google's four golden signals (latency, traffic, errors, saturation) map directly to capacity decisions. The key insight is that saturation predicts future degradation while the other three report current state. Build your capacity dashboards so saturation signals are placed above latency signals — act before customers feel it.
- Traffic — requests-per-second or messages-per-second is your demand signal. Capacity = supply; traffic = demand. Track p50, p90, p99 traffic volumes, not just averages.
- Saturation — use the per-resource saturation signals described above. The aggregate saturation score for a service is the highest saturation across all resources it depends on.
- Latency — p99 and p999 latencies are lagging indicators of saturation. If you are already seeing p99 degradation, you are already over capacity. Use saturation to trigger scale-out 5–10 minutes earlier.
- Errors — 5xx rates and timeout rates confirm that saturation has tipped into user-visible impact.
Collecting Metrics Reliably: Pitfalls at Scale
Measurement infrastructure itself has failure modes that corrupt capacity data:
- Scrape interval vs. alert window: Prometheus default scrape is 15s; HPA uses a 30s sync. If your alert window is
[1m]and your scrape interval is 60s, you have only one sample in that window — effectively a point-in-time reading, not a rate. Use at least 4 scrape points per alert window. - Counter resets on pod restart:
rate()handles resets correctly;delta()does not. Userate()orincrease()for counters that span restarts. - Metric cardinality explosion: Adding a high-cardinality label (user_id, request_id) to a counter creates millions of series and eventually kills your Prometheus. Keep labels bounded — namespace, pod, container, node, service are safe. User-level labels belong in logs, not metrics.
- Clock skew between nodes: Metrics from nodes with NTP drift appear out-of-order and can cause
rate()to return negative values. Checknode_timex_sync_status.
With these signals instrumented and understood, you have the measurement foundation that every subsequent lesson in this tutorial depends on — HPA target metrics, VPA recommendations, cluster autoscaler triggers, and capacity review baselines all trace back to the utilization and saturation numbers you collect here.