Performance & Load Testing

System-Level Bottlenecks

18 min Lesson 7 of 28

System-Level Bottlenecks

Load tests surface symptoms — high latency, dropped requests, queue saturation. Diagnosing the root cause means moving one layer deeper: the operating system and hardware that everything runs on. The USE method (Utilization, Saturation, Errors), coined by Brendan Gregg, gives a disciplined checklist for every physical resource: CPU, memory, disk I/O, and network. Apply it in that order whenever a load test reveals unexplained degradation.

The USE Method in Practice

For each resource, three metrics matter:

Utilization — what fraction of available capacity is busy (0–100 %).
Saturation — work queued because the resource is already at capacity (e.g., run-queue depth, disk queue depth).
Errors — hardware or driver-level faults: ECC memory corrections, TCP retransmits, NIC drops.

High utilization alone is not alarming; saturation always is. A CPU at 95 % with zero run-queue lag is merely busy. A CPU at 70 % with a persistent run-queue of 8 on a 4-core machine is saturated and will produce erratic p99 latency spikes.

Why USE before profiling? Application profiling (flame graphs, tracing) is expensive and narrow. USE is a 30-second checklist that rules out an entire class of resource before you instrument anything. Run USE first; profile only the resource that is saturated or throwing errors.

CPU Bottlenecks

CPU saturation shows up as a rising load average relative to core count and a non-zero %wa (I/O wait) or high %us + %sy in top. The canonical quick-look commands:

# 1-second snapshots of all CPU states
mpstat -P ALL 1 5

# Run-queue depth and load average (USE: saturation)
vmstat 1 10

# Who is consuming the CPU (top 10 threads)
ps -eo pid,tid,pcpu,psr,comm --sort=-pcpu | head -12

# Scheduler latency per task — requires BCC tools
/usr/share/bcc/tools/runqlat 10
# Histogram: how long tasks wait in the run-queue before getting a CPU

At large scale, scheduler affinity matters. A JVM pinned to a NUMA node it does not own pays a ~100 ns cross-NUMA memory penalty per allocation. Verify with numastat -c <pid> and set numactl --cpunodebind=0 --membind=0 for latency-sensitive services. For containerised workloads, match the cpu.cpuset cgroup to a single NUMA node.

Steal time in VMs: %st in top is CPU cycles taken by the hypervisor. Sustained steal > 5 % indicates noisy neighbours on the same physical host — escalate to the cloud provider or migrate to a dedicated host. Steal does not appear in your application metrics; it silently inflates p99.

Memory Bottlenecks

Memory pressure in Linux manifests through two separate mechanisms: the OOM killer (hard limit) and swapping / page reclaim (soft saturation). Either degrades p99 latency long before an OOM event or an actual swap file fills.

# Overall memory picture
free -h

# Per-process RSS, swap, virtual
ps -eo pid,comm,rss,vsz,pmem --sort=-rss | head -15

# Page reclaim pressure — pgmajfault climbing = swapping active pages
vmstat 1 | awk '{print $7, $8, $10, $11}'
# columns: swpd, free, si(swap-in), so(swap-out)

# THP (Transparent Huge Pages) stalls
grep -E 'thp_fault_alloc|thp_collapse_alloc_failed' /proc/vmstat

# USE: errors — ECC memory corrections (requires ipmitool or vendor agent)
ipmitool sel list | grep -i "correctable\|uncorrectable"

For production services: disable swap on latency-sensitive nodes (swapoff -a + remove swap entries from /etc/fstab). A process hitting swap on a modern SSD still adds 10–100 µs of latency per page fault. Kubernetes nodes that run latency-critical pods must have memory.swappiness=0 (or vm.swappiness=1) set in the node's sysctl profile.

Memory fragmentation is a subtler source of latency spikes. THP collapse failures cause multi-millisecond stalls in memory-hungry services (Redis, Java with large heaps). Monitor thp_collapse_alloc_failed in /proc/vmstat and set /sys/kernel/mm/transparent_hugepage/enabled to madvise so only explicitly opted-in allocations use THP.

The USE method checklist — Utilization, Saturation, and Errors for each of the four system resource domains.

Disk I/O Bottlenecks

Disk I/O saturation is lethal to databases and write-heavy microservices. The key metric is avgqu-sz from iostat — a persistent queue depth above 1 on an NVMe device indicates the device is backlogged. At big-tech scale, even short I/O spikes are noticeable because they push kernel page cache pressure, which evicts hot data and compounds the problem.

# Real-time per-device stats (1-second interval, 10 samples)
iostat -xz 1 10

# Key columns to watch:
# %util   — device utilization (approaching 100% = saturated HDD; NVMe can sustain >100% via NCQ)
# await   — average I/O latency in ms (SSD healthy: <1 ms; HDD: <10 ms; alert if >20 ms under load)
# avgqu-sz — queue depth; >1 = saturation beginning
# r/s, w/s — IOPS; compare against device spec sheet

# BCC biolatency — latency histogram for block I/O
/usr/share/bcc/tools/biolatency -D 10

# Identify which process owns the I/O
iotop -o -b -n 3

# SMART health check (early failure warning)
smartctl -a /dev/nvme0n1

Tuning levers in production: set the I/O scheduler to none (pass-through) for NVMe devices — the in-kernel mq-deadline or bfq schedulers add overhead that NVMe hardware queues already handle. Verify with cat /sys/block/nvme0n1/queue/scheduler. For databases, direct I/O (O_DIRECT) bypasses the page cache and eliminates double-buffering; PostgreSQL uses it via effective_io_concurrency and wal_buffers sizing.

The noisy-neighbour disk problem in Kubernetes: Container storage is often a shared network volume (EBS, GCS PD, Ceph). All pods on the same node compete for the same underlying IOPS budget. A misconfigured batch job writing 500 MB/s can saturate the shared storage and cause latency spikes in completely unrelated services. Always set resources.limits with an I/O-aware storage class or use dedicated node pools for I/O-intensive workloads.

Network Bottlenecks

Network saturation shows up as TCP retransmits, socket send-queue overflow, and NIC hardware drops — all invisible in application-level metrics unless you instrument them explicitly. On a 10 Gbit/s NIC, a single service sending 9+ Gbit/s of data crowds out every other pod on the same host.

# NIC throughput vs link capacity (USE: utilization)
sar -n DEV 1 5
# Look at rxkB/s and txkB/s; compare to interface speed

# TCP retransmit rate (USE: errors — network-level)
nstat -az | grep -E 'TcpRetrans|TcpInErrs|TcpOutRsts'
# Or with ss:
ss -s

# Per-socket queue depth (saturation: send queue overflow)
ss -tnp | awk '{print $2, $3, $5}' | sort -k1 -rn | head -20
# Column 2 = Recv-Q, column 3 = Send-Q; non-zero Send-Q = kernel backpressure

# BCC tcpretrans — trace retransmit events with source/dest
/usr/share/bcc/tools/tcpretrans

# Kernel drop counters (NIC ring buffer overflows)
ethtool -S eth0 | grep -i drop

Critical kernel tunables for high-throughput services (applied via sysctl or /etc/sysctl.d/):

# /etc/sysctl.d/99-perf.conf — production network tuning
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
net.ipv4.tcp_slow_start_after_idle = 0
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.core.netdev_max_backlog = 250000

# Apply immediately (idempotent — safe to re-run)
sysctl --system

TCP BBR (Bottleneck Bandwidth and RTT) is the congestion algorithm deployed by Google at scale; it achieves significantly better throughput than CUBIC in lossy or high-BDP networks. Enabling it requires kernel 4.9+ (any modern production Linux). Pair with the fq (fair-queue) qdisc — BBR depends on pacing that only fq provides.

Connecting System Metrics to Load-Test Results

When a k6 run shows p99 latency climbing past 500 ms, the diagnostic path is:

Check vmstat 1 — is the run-queue saturated (CPU) or is there swap I/O (memory)?
Check iostat -xz 1 — is await elevated or avgqu-sz > 1?
Check nstat -az | grep Retrans — are TCP retransmits rising?
If all three are clean, the bottleneck is application-layer (thread pool exhaustion, GC, lock contention) — escalate to flame graphs and application tracing.

Correlate with Prometheus: Export node_exporter metrics and build a USE dashboard in Grafana with panels for node_cpu_seconds_total, node_memory_SwapTotal_bytes, node_disk_io_time_seconds_total, and node_network_transmit_drop_total. Overlay your load-test timeline as a Grafana annotation so you can visually correlate traffic ramps with resource exhaustion events — this is standard operating procedure in SRE postmortems.