Networking Essentials for DevOps

Network Debugging Toolbox

18 min Lesson 9 of 30

Network Debugging Toolbox

Every production network incident — a microservice that stops responding, a deployment that breaks latency, a VPN that drops packets silently — demands a disciplined diagnostic methodology. Senior engineers do not guess; they use a repeatable ladder of tools that narrows the problem from "something is wrong" to "packet loss at hop 4 between us-east-1a and the NAT gateway" in under five minutes. This lesson arms you with that ladder: ping, traceroute / mtr, ss, and tcpdump.

Mental model first. Before running any command, form a hypothesis: Is it a routing problem? A firewall drop? A port that is not open? A TLS failure? Each tool answers a different question. Running tcpdump when the problem is a missing DNS record wastes precious minutes during an incident.

Layer 1 of the Ladder — ping: Is the Host Reachable?

ping sends ICMP Echo Requests and measures round-trip time (RTT). It answers: Can I reach this IP at all, and is the path stable? In production environments ICMP is often rate-limited or blocked by security groups, so a failed ping does not conclusively prove a host is down — but a working ping with rising RTT or intermittent packet loss is a strong signal of a network-layer problem.

# Basic reachability — send 10 packets, report summary
ping -c 10 10.0.1.45

# Flood ping (requires root) — expose burst packet loss
sudo ping -f -c 1000 10.0.1.45

# Specify packet size — detect MTU-related drops
ping -s 1400 -c 5 10.0.1.45

# Interpret the output
# rtt min/avg/max/mdev = 0.412/0.501/1.243/0.198 ms
# mdev (mean deviation) > 2x avg  →  jitter problem
# packet loss > 0%              →  routing loop or firewall drop

Packet size matters. The default 64-byte ping misses MTU black holes. A 1400-byte ping reaches across jumbo-frame boundaries and exposes fragmentation drops that only appear under real workloads. If large pings fail while small ones succeed, suspect an MTU mismatch — common at VPN tunnel boundaries and between cloud regions.

Layer 2 — traceroute / mtr: Where Does the Path Break?

traceroute (Linux: traceroute or tracepath; macOS: traceroute) probes the route hop-by-hop by sending packets with incrementing TTL values. When TTL reaches zero, each router emits an ICMP Time Exceeded reply, revealing its IP and RTT. Use it when ping shows packet loss or high RTT but you do not know where in the path the problem lives.

mtr (Matt's Traceroute) combines ping and traceroute into a continuous real-time view. It is the superior production tool because it shows per-hop packet loss over time, which exposes transient faults invisible to a one-shot traceroute.

# Standard traceroute (UDP probes, port 33434+)
traceroute 10.0.1.45

# Use ICMP probes (bypasses firewalls that block UDP)
traceroute -I 10.0.1.45

# Use TCP SYN on port 443 (reaches through most firewalls)
traceroute -T -p 443 10.0.1.45

# mtr — interactive, real-time, 100 cycles then report
mtr --report --report-cycles 100 10.0.1.45

# mtr with TCP on port 443 (best for cloud environments)
mtr --tcp --port 443 --report-cycles 50 api.internal.example.com

# Reading mtr output:
# Loss%   — packet loss AT or BEFORE this hop
# Avg     — average RTT to this hop
# StDev   — jitter; >5ms for backbone is concerning
# Loss at intermediate hops but NOT at final hop = ICMP rate-limiting (benign)
# Loss at intermediate hop AND all hops beyond = real loss at that hop

Misleading intermediate loss. Many backbone routers deprioritize ICMP and will show 20-100% loss on an intermediate hop even when the path beyond is clean. The rule: if the final destination shows 0% loss, ignore loss at intermediate hops. Only worry when you see loss starting at a hop that persists through all subsequent hops.

Layer 3 — ss: Is the Process Listening on the Right Port?

ss (socket statistics) is the modern replacement for netstat. It reads directly from the kernel's socket tables, making it faster and more accurate. Use ss to answer: Is the process actually bound? Is it listening on the right interface? Are connections in TIME_WAIT piling up?

# Show all listening TCP sockets with PID
ss -tlnp

# Show all established connections to port 5432 (Postgres)
ss -tnp state established '( dport = :5432 or sport = :5432 )'

# Count sockets by state — detect connection exhaustion
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn

# Show UNIX domain sockets (inter-process on same host)
ss -xlp

# Watch socket counts in real time (refresh every 2s)
watch -n 2 'ss -s'

# Key states to know:
# LISTEN      — process is bound and accepting
# ESTABLISHED — active connection
# TIME_WAIT   — closed connection, OS holding 2×MSL (~60s on Linux)
# CLOSE_WAIT  — remote closed; local process has NOT called close() — app bug
# SYN_RECV    — SYN received, waiting for ACK — SYN flood if thousands appear

CLOSE_WAIT is almost always an application bug. It means the remote side closed the TCP connection but the local process never called close() on its socket. You will typically see this with misbehaving connection pools or services that leak file descriptors. ss -tnp state close-wait pinpoints the PID immediately.

Layer 4 — tcpdump: What Are the Bytes on the Wire?

tcpdump captures raw packets at the network interface. It is the tool of last resort — not because it is hard to use, but because it produces the highest-fidelity answer at the cost of noise. Use it when you need to confirm: Is traffic actually leaving this host? What are the exact headers? Is the TLS handshake completing? Is the client sending what we expect?

# Capture all TCP traffic on eth0 to/from port 443, save to file
sudo tcpdump -i eth0 -w /tmp/capture.pcap tcp port 443

# Inspect the capture offline (on your laptop with Wireshark or tcpdump)
tcpdump -r /tmp/capture.pcap -nn -A | head -100

# Watch HTTP requests in real time (plaintext only, not HTTPS)
sudo tcpdump -i any -nn -A 'tcp port 80 and (tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420)'

# Capture SYN packets — detect connection attempts
sudo tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn) != 0 and port 8080'

# Capture RST packets — detect rejected connections
sudo tcpdump -i eth0 'tcp[tcpflags] & (tcp-rst) != 0'

# Minimal output: no DNS resolution (-n), no port names (-n), line-buffered (-l)
sudo tcpdump -i eth0 -nn -l port 5432 | head -50

# On a running Kubernetes pod (kubectl exec + nsenter pattern)
kubectl exec -it my-pod -- sh -c "apt-get install -y tcpdump && \
  tcpdump -i eth0 -nn -c 100 port 8080"

Putting It Together: The Diagnosis Decision Flow

The diagram below captures the structured methodology that experienced SREs follow. Start at the top, answer each question with a specific tool, and exit as soon as the answer points to a root cause.

Network debugging decision flow: ping → mtr → ss → tcpdump, narrowing the failure layer at each step.

Production Practices at Scale

At large-scale companies, individual tool invocations are wrapped into runbooks and automated health checks. A few patterns worth internalizing:

Capture first, analyze later. During an active incident, run tcpdump -w /tmp/incident-$(date +%s).pcap immediately and investigate the file once the incident is mitigated. Packet captures are forensic gold and disappear when the pod restarts.
Baseline your RTT. Run mtr --report against your critical dependencies weekly during low-traffic periods. When an incident occurs, you will know whether 5 ms to your database is normal or alarming.
Use ss before assuming a firewall. Half of "firewall" tickets are actually services that crashed and stopped listening. ss -tlnp | grep 8080 takes two seconds and eliminates that hypothesis immediately.
Prefer -T -p 443 for cloud traceroutes. Cloud security groups block ICMP and UDP by default. A TCP traceroute on a port your service actually uses is the only reliable path probe in AWS, GCP, and Azure.
Never run tcpdump in production without a -c count or a time limit. On a busy interface, an unbounded capture fills the disk in minutes. Use -c 10000 or wrap the command with timeout 30 tcpdump ....

The 5-layer mental stack for debugging. (1) Is the host reachable? — ping. (2) Where in the path? — mtr. (3) Is the port open on this host? — ss. (4) Are bytes reaching the process? — tcpdump. (5) Did the application-layer protocol succeed? — application logs / curl -v. Work down this stack in order. Skipping levels wastes time.