Capacity Planning & Autoscaling

Measuring Utilization & Saturation

18 min Lesson 2 of 27

Measuring Utilization & Saturation

Capacity planning without measurement is guesswork. The metrics you collect — and how you interpret them — are the difference between a confident scale-out decision and a fire drill at 2 AM. This lesson covers the precise signals that matter for capacity decisions at production scale: what they measure, how to collect them reliably, and the senior-level interpretation that separates good operators from great ones.

The USE Method: A Framework for Resource Analysis

Brendan Gregg's USE method (Utilization, Saturation, Errors) gives you a systematic checklist for every physical and virtual resource — CPUs, memory, disks, network interfaces, Kubernetes resource quotas, and cloud service limits. Apply it per-resource before concluding that a system is "healthy".

  • Utilization: What fraction of the resource's capacity is being consumed over a time window? High utilization is not inherently bad, but it shrinks your headroom.
  • Saturation: Is work queuing because the resource cannot service it immediately? Saturation is the leading indicator of latency degradation — it appears before p99 latency spikes do.
  • Errors: Are requests or operations failing at the resource level (TCP retransmits, disk I/O errors, OOM kills, throttled API calls)?
Why saturation is more important than utilization: A CPU at 70% utilization with a run-queue length of 0 is fine. A CPU at 55% utilization with a sustained run-queue depth of 4 is already degrading latency for every process competing for cycles. Always pair utilization with its saturation signal.

CPU: The Signals That Matter

Raw CPU percentage is overused and underspecified. At senior level you track these signals together:

  • node_cpu_seconds_total{mode="idle"} — derive utilization as 1 - rate(idle[5m]) per CPU, then aggregate. Splitting by mode (user, system, iowait, steal) reveals why CPU is busy.
  • CPU steal (mode="steal") — on cloud VMs this exposes noisy-neighbour hypervisor contention. A steal value above 5% means your VM is being throttled by the host, not by your workload. HPA will not help; you need a larger instance class.
  • Run-queue lengthnode_load1, node_load5, node_load15. Divide by CPU count. A ratio consistently above 1.0 means CPUs are saturated. Google SRE teams use load-per-CPU > 0.7 as the alerting threshold to leave headroom for traffic bursts.
  • CFS throttling — in Kubernetes, container_cpu_cfs_throttled_seconds_total counts CPU cycles a container was denied because it hit its CGroup quota. A high throttle ratio while overall node CPU is moderate means you have under-provisioned limits, not a node-level capacity problem.
# PromQL: CPU utilization per node (1 = 100%) 1 - avg by(node) ( rate(node_cpu_seconds_total{mode="idle"}[5m]) ) # PromQL: CFS throttle ratio per container — flags under-provisioned limits sum by (namespace, pod, container) ( rate(container_cpu_cfs_throttled_seconds_total[5m]) ) / sum by (namespace, pod, container) ( rate(container_cpu_cfs_periods_total[5m]) ) # CLI: live run-queue on a node (column 1=running procs, use vs cpu count) vmstat -w 2 10 # CLI: per-CPU utilization breakdown including steal mpstat -P ALL 5 3

Memory: Utilization Is a Lie Without Context

Memory metrics are uniquely deceptive. The Linux kernel aggressively uses free RAM for page cache, making "used memory" a misleading number. The signals that actually indicate capacity pressure are:

  • Available memory (node_memory_MemAvailable_bytes) — this is the kernel's own estimate of how much memory can be freed under pressure without swapping. It is what you should alert on, not MemFree.
  • Major page faults / swap activitynode_vmstat_pgmajfault and node_memory_SwapUsed_bytes. Any sustained swap usage in a containerised workload is a saturation signal; latency is already degraded.
  • OOM kill eventskube_pod_container_status_last_terminated_reason{reason="OOMKilled"}. Each OOM kill is a memory saturation event that production traffic experienced.
  • Working set vs. RSS — Kubernetes uses container_memory_working_set_bytes (not RSS) for eviction and OOM decisions. Alert on working set approaching the container memory limit.
The "memory leak vs. cache" trap: A container whose working set grows steadily over hours may have a memory leak — or it may simply be warming a query cache. Before raising a P2 incident, check whether the growth plateaus and whether eviction/GC events correlate. Plot working set alongside GC pause duration (JVM: jvm_gc_pause_seconds_sum) or equivalent runtime metrics.

Disk I/O: Throughput, IOPS, and Latency Together

A disk saturated on IOPS can have low throughput (many small random reads), and vice versa. Capacity decisions require all three dimensions:

  • node_disk_io_time_seconds_total — the fraction of time the device was busy (rate over 1m gives utilization 0–1). Above 0.8 is a saturation risk.
  • node_disk_read_bytes_total / node_disk_written_bytes_total — throughput; compare against published device limits (gp3 baseline: 125 MB/s, provisioned up to 1,000 MB/s).
  • node_disk_read_time_seconds_total / node_disk_reads_completed_total — average read latency. Above 1 ms for NVMe or 10 ms for network EBS suggests saturation or queue depth issues.
# PromQL: disk utilization (0-1) per device per node rate(node_disk_io_time_seconds_total{device!~"loop.*"}[5m]) # PromQL: average write latency in milliseconds ( rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) ) * 1000 # CLI: real-time disk utilization, IOPS, and latency breakdown iostat -xz 2 5 # CLI: identify the exact process driving disk I/O iotop -o -P

Network: Bandwidth, Packet Loss, and Connection Saturation

Network saturation manifests in three ways: bandwidth exhaustion, connection-table exhaustion, and retransmit storms. Each requires a different remediation.

  • Bandwidth utilization: rate(node_network_transmit_bytes_total[5m]) * 8 for bits/s. On AWS, c5.xlarge has 10 Gbps burst; sustained above ~60% of baseline triggers throttling silently.
  • Retransmit rate: rate(node_netstat_Tcp_RetransSegs[5m]). Above 0.1% of total segments is a signal worth investigating — it indicates either packet loss or receive-buffer saturation on the remote end.
  • Conntrack exhaustion: node_nf_conntrack_entries / node_nf_conntrack_entries_limit. When this ratio exceeds 0.8, new TCP connections start failing with ENFILE. In Kubernetes, this silently drops traffic while CPU and memory look fine — a classic false-healthy state.
USE Method Signal Map CPU Utilization: 1 - idle rate mode split: user / system / steal Saturation: run-queue / CPU node_load1 / nproc > 0.7 = alert Errors: CFS throttle ratio throttled / total periods > 5% Memory Utilization: working set bytes vs container limit (not RSS) Saturation: MemAvailable low major faults, swap activity Errors: OOMKilled pods kube_pod_container_status_* Network / Disk Utilization: bytes/s vs link speed disk: io_time rate vs 1.0 Saturation: conntrack ratio disk: avg latency (ms) Errors: retransmit rate disk: I/O errors, SMART events Utilization Saturation Errors Apply USE to every resource: Disk, Network, Pod CPU/Memory limits, Node allocatable, Cloud API quotas
The USE method applied across CPU, Memory, and Network/Disk resources — each category has distinct utilization, saturation, and error signals.

Kubernetes-Specific Capacity Signals

Kubernetes adds a layer of virtual resources on top of OS-level signals. Both layers must be healthy; a node can appear under-utilized at the OS level while being over-committed at the scheduler level.

  • Node allocatable pressure: kube_node_status_allocatable{resource="cpu"} vs. sum(kube_pod_container_resource_requests{resource="cpu"}) by (node). When requests exceed allocatable, new pods cannot schedule — this is capacity exhaustion even if actual CPU usage is 40%.
  • Pod disruption budget violations: A saturated cluster that cannot drain nodes for upgrades is operationally saturated regardless of CPU metrics.
  • Namespace quota utilisation: kube_resourcequota metrics track hard/used per namespace. A team hitting their quota ceiling is experiencing capacity saturation from their perspective.
# kubectl: check node-level allocatable vs. requested (CPU and memory at a glance) kubectl describe nodes | grep -A 6 "Allocated resources" # kubectl: find pods with no resource requests set (invisible to the scheduler) kubectl get pods -A -o json | \ jq '.items[] | select(.spec.containers[].resources.requests == null) | {name: .metadata.name, ns: .metadata.namespace}' # PromQL: node CPU over-commitment ratio (requests vs allocatable) sum by (node) ( kube_pod_container_resource_requests{resource="cpu", unit="core"} ) / sum by (node) ( kube_node_status_allocatable{resource="cpu"} )

The Four Golden Signals in a Capacity Context

Google's four golden signals (latency, traffic, errors, saturation) map directly to capacity decisions. The key insight is that saturation predicts future degradation while the other three report current state. Build your capacity dashboards so saturation signals are placed above latency signals — act before customers feel it.

  • Traffic — requests-per-second or messages-per-second is your demand signal. Capacity = supply; traffic = demand. Track p50, p90, p99 traffic volumes, not just averages.
  • Saturation — use the per-resource saturation signals described above. The aggregate saturation score for a service is the highest saturation across all resources it depends on.
  • Latency — p99 and p999 latencies are lagging indicators of saturation. If you are already seeing p99 degradation, you are already over capacity. Use saturation to trigger scale-out 5–10 minutes earlier.
  • Errors — 5xx rates and timeout rates confirm that saturation has tipped into user-visible impact.
Production practice — headroom targets by tier: Tier-1 services (revenue-critical) are typically kept below 60% utilization on every dimension at peak. Tier-2 services use 70–75%. Tier-3 (internal tooling) may run to 85%. These margins exist to absorb sudden traffic spikes (flash sales, viral moments) and give autoscalers time to react. Define these SLOs explicitly in your capacity review documents — do not leave them implicit.

Collecting Metrics Reliably: Pitfalls at Scale

Measurement infrastructure itself has failure modes that corrupt capacity data:

  • Scrape interval vs. alert window: Prometheus default scrape is 15s; HPA uses a 30s sync. If your alert window is [1m] and your scrape interval is 60s, you have only one sample in that window — effectively a point-in-time reading, not a rate. Use at least 4 scrape points per alert window.
  • Counter resets on pod restart: rate() handles resets correctly; delta() does not. Use rate() or increase() for counters that span restarts.
  • Metric cardinality explosion: Adding a high-cardinality label (user_id, request_id) to a counter creates millions of series and eventually kills your Prometheus. Keep labels bounded — namespace, pod, container, node, service are safe. User-level labels belong in logs, not metrics.
  • Clock skew between nodes: Metrics from nodes with NTP drift appear out-of-order and can cause rate() to return negative values. Check node_timex_sync_status.
# Check Prometheus scrape health for node-exporter targets curl -s http://localhost:9090/api/v1/targets | \ jq '.data.activeTargets[] | select(.health != "up") | {job: .labels.job, instance: .labels.instance, err: .lastError}' # Check for high-cardinality metrics (top 10 series by name) curl -s http://localhost:9090/api/v1/label/__name__/values | \ jq '.data | length' # Prometheus TSDB stats endpoint (cardinality analysis) curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'

With these signals instrumented and understood, you have the measurement foundation that every subsequent lesson in this tutorial depends on — HPA target metrics, VPA recommendations, cluster autoscaler triggers, and capacity review baselines all trace back to the utilization and saturation numbers you collect here.