Networking Essentials for DevOps

Load Balancing

18 min Lesson 6 of 30

Load Balancing

A single server can handle only so many concurrent connections before latency climbs and requests time out. Load balancing solves this by distributing incoming traffic across a pool of backend servers (called upstreams or origin servers). At big-tech scale, load balancers are not just performance tools — they are the primary mechanism for zero-downtime deployments, fault isolation, and horizontal scaling.

Load balancing vs. service discovery: These are complementary. Service discovery (Consul, Kubernetes endpoints) answers "which instances exist right now?" Load balancing answers "which one should get this request?" In Kubernetes, a Service object combines both: kube-proxy maintains iptables/IPVS rules that load-balance across healthy pod IPs automatically.

L4 vs L7: Where in the Stack Does the Decision Happen?

The most important design choice is whether your load balancer operates at Layer 4 (Transport) or Layer 7 (Application).

Layer 4 load balancing works at the TCP/UDP level. The balancer forwards byte streams without inspecting the payload. It sees the destination IP and port, picks a backend, and establishes two TCP connections — one from the client to the LB (the "frontend"), one from the LB to the backend (the "backend connection") — stitching them together. It is extremely fast and works for any TCP/UDP protocol.

Layer 7 load balancing terminates the application protocol (HTTP, gRPC, WebSocket). The balancer fully parses the HTTP request before forwarding it, which allows decisions based on URL path, host header, cookies, query parameters, or request body. This extra power comes with extra latency (~1–5 ms per hop at high load) and extra complexity.

L4 vs L7 load balancers — L4 forwards TCP streams blindly; L7 reads the full HTTP request and can route by URL, host, or cookie.

In practice, production stacks often chain both: an L4 load balancer (AWS NLB, GCP TCP Proxy) at the network edge for raw throughput and sticky TLS, feeding into L7 reverse proxies (nginx, Envoy, HAProxy) that do content-based routing and TLS termination.

Load Balancing Algorithms

The algorithm determines which backend receives the next request. The wrong choice causes hot-spots even in a "balanced" cluster.

Round Robin — requests rotate through the pool in order. Simple and fair when all backends are identical and requests have uniform cost. Default in most proxies.
Weighted Round Robin — backends get a fraction of traffic proportional to an assigned weight. Use during canary deployments (10% to new version, 90% to stable) or when instances have different hardware capacity.
Least Connections — the next request goes to the backend with the fewest active connections. Better than round robin for long-lived connections (WebSocket, gRPC streaming, database connections) where one slow backend can accumulate a large queue.
IP Hash / Sticky Sessions — hashes the client IP (or a cookie) to always map a client to the same backend. Required for stateful applications that store session data in memory. A significant anti-pattern in stateless microservices because it prevents even load distribution when a single heavy client appears.
Random with Two Choices (Power of Two) — pick two backends at random and forward to the one with fewer connections. Achieves near-optimal load distribution with O(1) overhead. Used internally by Envoy and many large-scale systems.
Least Response Time — routes to the backend with the lowest average latency. Useful when backends have heterogeneous performance (mixed instance types).

Google SRE rule of thumb: For stateless HTTP microservices, prefer least connections or round robin with fine-grained health checks. Avoid IP hash unless the application truly requires server affinity — it silently defeats horizontal scaling and complicates rolling deploys.

Health Checks: The Heart of a Working LB

A load balancer is only as good as its health checks. Without them, it will happily send traffic to a backend that is listening on a port but returning 500s — the worst failure mode because it is silent from the client's perspective.

There are three types of health checks, in increasing sophistication:

TCP check — the LB opens a TCP connection to the backend. If it connects, the backend is "up." This proves the process is listening but not that it can actually serve requests.
HTTP check — the LB sends an HTTP request (usually GET /healthz) and expects a 2xx response within a timeout. This is the standard for HTTP services.
Application-level check — the LB sends a request to a /readyz endpoint that your application actively validates: checks if the database connection pool is healthy, if the cache is reachable, if background workers are running. This is what Kubernetes readiness probes implement.

Liveness vs. readiness — do not conflate them. A liveness probe tells Kubernetes to restart the container if the process is stuck. A readiness probe tells the load balancer to remove the pod from the service endpoint slice temporarily (e.g., during startup or under DB connection saturation). Returning a failing readiness probe when your DB is down means your pod stops receiving traffic — which is correct. Returning a failing liveness probe means Kubernetes kills and restarts you, which may make the problem worse under cascade failures.

Real nginx L7 Config: Upstream Pool with Health Checks

# /etc/nginx/conf.d/api.conf
upstream api_pool {
    # least_conn is better than round-robin for APIs with variable response times
    least_conn;

    server 10.0.1.10:8080 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 weight=1 max_fails=3 fail_timeout=30s;

    # keepalive: reuse upstream connections (reduces TCP handshake overhead)
    keepalive 64;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate     /etc/letsencrypt/live/api.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;

    location / {
        proxy_pass         http://api_pool;
        proxy_http_version 1.1;
        # Required for keepalive to work
        proxy_set_header   Connection "";
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;

        # Timeouts: fail fast, let the client retry
        proxy_connect_timeout  3s;
        proxy_send_timeout     30s;
        proxy_read_timeout     30s;

        # Retry on errors — but NOT on POST (would double-submit)
        proxy_next_upstream    error timeout http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
    }

    # Active health check endpoint (nginx Plus) or passive via max_fails above
    location /healthz {
        proxy_pass http://api_pool;
        access_log off;
    }
}

HAProxy: A Complete L4 + L7 Example

# /etc/haproxy/haproxy.cfg

global
    log /dev/log local0
    maxconn 50000
    nbthread 4          # Match to CPU cores
    tune.ssl.default-dh-param 2048

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  forwardfor          # adds X-Forwarded-For header
    option  http-server-close   # enables HTTP keepalive to backends
    timeout connect 3s
    timeout client  30s
    timeout server  30s
    retries 3

frontend http_in
    bind *:443 ssl crt /etc/haproxy/certs/api.example.com.pem alpn h2,http/1.1
    default_backend api_servers

    # L7 routing: /payments goes to a dedicated pool
    acl is_payments path_beg /payments
    use_backend payments_servers if is_payments

backend api_servers
    balance leastconn
    option httpchk GET /healthz HTTP/1.1\r\nHost:\ api.example.com
    http-check expect status 200

    server api1 10.0.1.10:8080 check inter 2s fall 3 rise 2
    server api2 10.0.1.11:8080 check inter 2s fall 3 rise 2
    server api3 10.0.1.12:8080 check inter 2s fall 3 rise 2

backend payments_servers
    balance roundrobin
    option httpchk GET /healthz HTTP/1.1\r\nHost:\ api.example.com
    http-check expect status 200

    server pay1 10.0.2.10:8080 check inter 2s fall 3 rise 2
    server pay2 10.0.2.11:8080 check inter 2s fall 3 rise 2

Key parameters to understand: inter 2s — probe every 2 seconds; fall 3 — mark down after 3 consecutive failures; rise 2 — mark up again after 2 consecutive successes. The asymmetry (3 down / 2 up) is intentional: you want to be confident a backend is really healthy before sending production traffic to it.

Balanced Tier Architecture

Two-tier load balancing: an L4 NLB at the edge for stable IPs and TCP passthrough, followed by L7 nginx proxies in each AZ for TLS termination and application-aware routing to the app pod pool.

Common Production Failure Modes

Thundering herd after backend restart — when a backend comes back online, all LBs simultaneously send it pent-up traffic. Mitigate with a slow start ramp: HAProxy slowstart 30s, nginx Plus slow start, or Envoy slow_start_window.
Health check flapping — a backend oscillates between healthy and unhealthy, causing traffic to splash unpredictably. Tune rise (min 2–3 successes before restoring) and add jitter to probe intervals so all LBs do not check simultaneously.
Sticky sessions masking capacity — with IP hash, a single heavy client locks onto one backend. Under high traffic from a CDN (all traffic appears to come from a handful of CDN IPs), this can overload one server to 100% while others idle at 10%. Solution: use cookie-based stickiness or, better, eliminate server-side session state.
Connection pool exhaustion — the LB holds 1,000 keepalive connections to a backend that only accepts 100. Set keepalive on the upstream block to a value below the backend's max_connections limit.

Observability must-haves for any LB: instrument active connections per backend, request rate per backend, error rate (4xx/5xx) per backend, and health check state transitions. At Google/Meta scale, the rule is: if the LB cannot tell you which backend is saturated in under 30 seconds, the load balancer config is not production-ready.

Kubernetes: Load Balancing Without Thinking About It

In Kubernetes, every ClusterIP Service is a virtual IP backed by iptables or IPVS rules maintained by kube-proxy. When you create a Deployment with 5 replicas, kube-proxy automatically adds rules that round-robin (iptables random probability chain or IPVS least-connections) across the 5 healthy pod IPs. The Service endpoint controller removes a pod from the endpoint slice as soon as its readiness probe fails — before the pod is terminated.

# Inspect which pods are in a Service's endpoint slice
kubectl get endpoints my-api-service -n production

# Force IPVS mode in kube-proxy (better for large clusters — O(1) vs O(n))
# In kube-proxy ConfigMap:
# mode: "ipvs"
# ipvs:
#   scheduler: "lc"   # least connections

# Check current IPVS rules on a node
ssh node1 sudo ipvsadm -Ln

# View HPA scaling events (horizontal pod autoscaler adjusts replica count)
kubectl describe hpa my-api-hpa -n production

Key Takeaways

L4 balancers forward TCP streams without parsing payload — fast, protocol-agnostic, but no URL routing.
L7 balancers parse HTTP — enable URL/host routing, TLS termination, header manipulation, and smarter retries.
Algorithm choice matters: least connections for long-lived or variable-cost requests; round robin only when all requests and backends are uniform.
Health checks must validate actual application readiness, not just port availability. Distinguish liveness from readiness.
Design for failure: slow start, asymmetric rise/fall thresholds, and connection pool caps prevent cascades.
In Kubernetes, the Service abstraction + kube-proxy handles L4 LB automatically; add an Ingress controller (nginx, Envoy/Contour) for L7.