Performance & Load Testing

System-Level Bottlenecks

18 min Lesson 7 of 28

System-Level Bottlenecks

Load tests surface symptoms — high latency, dropped requests, queue saturation. Diagnosing the root cause means moving one layer deeper: the operating system and hardware that everything runs on. The USE method (Utilization, Saturation, Errors), coined by Brendan Gregg, gives a disciplined checklist for every physical resource: CPU, memory, disk I/O, and network. Apply it in that order whenever a load test reveals unexplained degradation.

The USE Method in Practice

For each resource, three metrics matter:

  • Utilization — what fraction of available capacity is busy (0–100 %).
  • Saturation — work queued because the resource is already at capacity (e.g., run-queue depth, disk queue depth).
  • Errors — hardware or driver-level faults: ECC memory corrections, TCP retransmits, NIC drops.

High utilization alone is not alarming; saturation always is. A CPU at 95 % with zero run-queue lag is merely busy. A CPU at 70 % with a persistent run-queue of 8 on a 4-core machine is saturated and will produce erratic p99 latency spikes.

Why USE before profiling? Application profiling (flame graphs, tracing) is expensive and narrow. USE is a 30-second checklist that rules out an entire class of resource before you instrument anything. Run USE first; profile only the resource that is saturated or throwing errors.

CPU Bottlenecks

CPU saturation shows up as a rising load average relative to core count and a non-zero %wa (I/O wait) or high %us + %sy in top. The canonical quick-look commands:

# 1-second snapshots of all CPU states mpstat -P ALL 1 5 # Run-queue depth and load average (USE: saturation) vmstat 1 10 # Who is consuming the CPU (top 10 threads) ps -eo pid,tid,pcpu,psr,comm --sort=-pcpu | head -12 # Scheduler latency per task — requires BCC tools /usr/share/bcc/tools/runqlat 10 # Histogram: how long tasks wait in the run-queue before getting a CPU

At large scale, scheduler affinity matters. A JVM pinned to a NUMA node it does not own pays a ~100 ns cross-NUMA memory penalty per allocation. Verify with numastat -c <pid> and set numactl --cpunodebind=0 --membind=0 for latency-sensitive services. For containerised workloads, match the cpu.cpuset cgroup to a single NUMA node.

Steal time in VMs: %st in top is CPU cycles taken by the hypervisor. Sustained steal > 5 % indicates noisy neighbours on the same physical host — escalate to the cloud provider or migrate to a dedicated host. Steal does not appear in your application metrics; it silently inflates p99.

Memory Bottlenecks

Memory pressure in Linux manifests through two separate mechanisms: the OOM killer (hard limit) and swapping / page reclaim (soft saturation). Either degrades p99 latency long before an OOM event or an actual swap file fills.

# Overall memory picture free -h # Per-process RSS, swap, virtual ps -eo pid,comm,rss,vsz,pmem --sort=-rss | head -15 # Page reclaim pressure — pgmajfault climbing = swapping active pages vmstat 1 | awk '{print $7, $8, $10, $11}' # columns: swpd, free, si(swap-in), so(swap-out) # THP (Transparent Huge Pages) stalls grep -E 'thp_fault_alloc|thp_collapse_alloc_failed' /proc/vmstat # USE: errors — ECC memory corrections (requires ipmitool or vendor agent) ipmitool sel list | grep -i "correctable\|uncorrectable"

For production services: disable swap on latency-sensitive nodes (swapoff -a + remove swap entries from /etc/fstab). A process hitting swap on a modern SSD still adds 10–100 µs of latency per page fault. Kubernetes nodes that run latency-critical pods must have memory.swappiness=0 (or vm.swappiness=1) set in the node's sysctl profile.

Memory fragmentation is a subtler source of latency spikes. THP collapse failures cause multi-millisecond stalls in memory-hungry services (Redis, Java with large heaps). Monitor thp_collapse_alloc_failed in /proc/vmstat and set /sys/kernel/mm/transparent_hugepage/enabled to madvise so only explicitly opted-in allocations use THP.

USE Method applied across the four resource domains USE Method — Four Resource Domains CPU Utilization %us + %sy (top/mpstat) Saturation run-queue depth (vmstat r) Errors machine-check exceptions Tools: mpstat, perf, runqlat, numastat, flame graphs Memory Utilization RSS / total RAM (free -h) Saturation swap-in/out, page faults Errors ECC corrections (ipmitool) Tools: vmstat, ps, smem, /proc/vmstat, cachestat (BCC) Disk I/O Utilization %util per device (iostat) Saturation avgqu-sz (queue depth) Errors dmesg I/O errors, SMART Tools: iostat, iotop, blktrace, biolatency (BCC), smartctl Network Utilization RX/TX bytes vs link speed Saturation socket TX queue, drops Errors retransmits, overruns (ss) Tools: sar, ss, nstat, tcpdump, nethogs, tcpretrans (BCC)
The USE method checklist — Utilization, Saturation, and Errors for each of the four system resource domains.

Disk I/O Bottlenecks

Disk I/O saturation is lethal to databases and write-heavy microservices. The key metric is avgqu-sz from iostat — a persistent queue depth above 1 on an NVMe device indicates the device is backlogged. At big-tech scale, even short I/O spikes are noticeable because they push kernel page cache pressure, which evicts hot data and compounds the problem.

# Real-time per-device stats (1-second interval, 10 samples) iostat -xz 1 10 # Key columns to watch: # %util — device utilization (approaching 100% = saturated HDD; NVMe can sustain >100% via NCQ) # await — average I/O latency in ms (SSD healthy: <1 ms; HDD: <10 ms; alert if >20 ms under load) # avgqu-sz — queue depth; >1 = saturation beginning # r/s, w/s — IOPS; compare against device spec sheet # BCC biolatency — latency histogram for block I/O /usr/share/bcc/tools/biolatency -D 10 # Identify which process owns the I/O iotop -o -b -n 3 # SMART health check (early failure warning) smartctl -a /dev/nvme0n1

Tuning levers in production: set the I/O scheduler to none (pass-through) for NVMe devices — the in-kernel mq-deadline or bfq schedulers add overhead that NVMe hardware queues already handle. Verify with cat /sys/block/nvme0n1/queue/scheduler. For databases, direct I/O (O_DIRECT) bypasses the page cache and eliminates double-buffering; PostgreSQL uses it via effective_io_concurrency and wal_buffers sizing.

The noisy-neighbour disk problem in Kubernetes: Container storage is often a shared network volume (EBS, GCS PD, Ceph). All pods on the same node compete for the same underlying IOPS budget. A misconfigured batch job writing 500 MB/s can saturate the shared storage and cause latency spikes in completely unrelated services. Always set resources.limits with an I/O-aware storage class or use dedicated node pools for I/O-intensive workloads.

Network Bottlenecks

Network saturation shows up as TCP retransmits, socket send-queue overflow, and NIC hardware drops — all invisible in application-level metrics unless you instrument them explicitly. On a 10 Gbit/s NIC, a single service sending 9+ Gbit/s of data crowds out every other pod on the same host.

# NIC throughput vs link capacity (USE: utilization) sar -n DEV 1 5 # Look at rxkB/s and txkB/s; compare to interface speed # TCP retransmit rate (USE: errors — network-level) nstat -az | grep -E 'TcpRetrans|TcpInErrs|TcpOutRsts' # Or with ss: ss -s # Per-socket queue depth (saturation: send queue overflow) ss -tnp | awk '{print $2, $3, $5}' | sort -k1 -rn | head -20 # Column 2 = Recv-Q, column 3 = Send-Q; non-zero Send-Q = kernel backpressure # BCC tcpretrans — trace retransmit events with source/dest /usr/share/bcc/tools/tcpretrans # Kernel drop counters (NIC ring buffer overflows) ethtool -S eth0 | grep -i drop

Critical kernel tunables for high-throughput services (applied via sysctl or /etc/sysctl.d/):

# /etc/sysctl.d/99-perf.conf — production network tuning net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728 net.ipv4.tcp_congestion_control = bbr net.core.default_qdisc = fq net.ipv4.tcp_slow_start_after_idle = 0 net.core.somaxconn = 65535 net.ipv4.tcp_max_syn_backlog = 65535 net.core.netdev_max_backlog = 250000 # Apply immediately (idempotent — safe to re-run) sysctl --system

TCP BBR (Bottleneck Bandwidth and RTT) is the congestion algorithm deployed by Google at scale; it achieves significantly better throughput than CUBIC in lossy or high-BDP networks. Enabling it requires kernel 4.9+ (any modern production Linux). Pair with the fq (fair-queue) qdisc — BBR depends on pacing that only fq provides.

Connecting System Metrics to Load-Test Results

When a k6 run shows p99 latency climbing past 500 ms, the diagnostic path is:

  1. Check vmstat 1 — is the run-queue saturated (CPU) or is there swap I/O (memory)?
  2. Check iostat -xz 1 — is await elevated or avgqu-sz > 1?
  3. Check nstat -az | grep Retrans — are TCP retransmits rising?
  4. If all three are clean, the bottleneck is application-layer (thread pool exhaustion, GC, lock contention) — escalate to flame graphs and application tracing.
Correlate with Prometheus: Export node_exporter metrics and build a USE dashboard in Grafana with panels for node_cpu_seconds_total, node_memory_SwapTotal_bytes, node_disk_io_time_seconds_total, and node_network_transmit_drop_total. Overlay your load-test timeline as a Grafana annotation so you can visually correlate traffic ramps with resource exhaustion events — this is standard operating procedure in SRE postmortems.