Load Balancing
Load Balancing
A single server can handle only so many concurrent connections before latency climbs and requests time out. Load balancing solves this by distributing incoming traffic across a pool of backend servers (called upstreams or origin servers). At big-tech scale, load balancers are not just performance tools — they are the primary mechanism for zero-downtime deployments, fault isolation, and horizontal scaling.
Service object combines both: kube-proxy maintains iptables/IPVS rules that load-balance across healthy pod IPs automatically.
L4 vs L7: Where in the Stack Does the Decision Happen?
The most important design choice is whether your load balancer operates at Layer 4 (Transport) or Layer 7 (Application).
Layer 4 load balancing works at the TCP/UDP level. The balancer forwards byte streams without inspecting the payload. It sees the destination IP and port, picks a backend, and establishes two TCP connections — one from the client to the LB (the "frontend"), one from the LB to the backend (the "backend connection") — stitching them together. It is extremely fast and works for any TCP/UDP protocol.
Layer 7 load balancing terminates the application protocol (HTTP, gRPC, WebSocket). The balancer fully parses the HTTP request before forwarding it, which allows decisions based on URL path, host header, cookies, query parameters, or request body. This extra power comes with extra latency (~1–5 ms per hop at high load) and extra complexity.
In practice, production stacks often chain both: an L4 load balancer (AWS NLB, GCP TCP Proxy) at the network edge for raw throughput and sticky TLS, feeding into L7 reverse proxies (nginx, Envoy, HAProxy) that do content-based routing and TLS termination.
Load Balancing Algorithms
The algorithm determines which backend receives the next request. The wrong choice causes hot-spots even in a "balanced" cluster.
- Round Robin — requests rotate through the pool in order. Simple and fair when all backends are identical and requests have uniform cost. Default in most proxies.
- Weighted Round Robin — backends get a fraction of traffic proportional to an assigned weight. Use during canary deployments (10% to new version, 90% to stable) or when instances have different hardware capacity.
- Least Connections — the next request goes to the backend with the fewest active connections. Better than round robin for long-lived connections (WebSocket, gRPC streaming, database connections) where one slow backend can accumulate a large queue.
- IP Hash / Sticky Sessions — hashes the client IP (or a cookie) to always map a client to the same backend. Required for stateful applications that store session data in memory. A significant anti-pattern in stateless microservices because it prevents even load distribution when a single heavy client appears.
- Random with Two Choices (Power of Two) — pick two backends at random and forward to the one with fewer connections. Achieves near-optimal load distribution with O(1) overhead. Used internally by Envoy and many large-scale systems.
- Least Response Time — routes to the backend with the lowest average latency. Useful when backends have heterogeneous performance (mixed instance types).
Health Checks: The Heart of a Working LB
A load balancer is only as good as its health checks. Without them, it will happily send traffic to a backend that is listening on a port but returning 500s — the worst failure mode because it is silent from the client's perspective.
There are three types of health checks, in increasing sophistication:
- TCP check — the LB opens a TCP connection to the backend. If it connects, the backend is "up." This proves the process is listening but not that it can actually serve requests.
- HTTP check — the LB sends an HTTP request (usually
GET /healthz) and expects a 2xx response within a timeout. This is the standard for HTTP services. - Application-level check — the LB sends a request to a
/readyzendpoint that your application actively validates: checks if the database connection pool is healthy, if the cache is reachable, if background workers are running. This is what Kubernetes readiness probes implement.
Real nginx L7 Config: Upstream Pool with Health Checks
HAProxy: A Complete L4 + L7 Example
Key parameters to understand: inter 2s — probe every 2 seconds; fall 3 — mark down after 3 consecutive failures; rise 2 — mark up again after 2 consecutive successes. The asymmetry (3 down / 2 up) is intentional: you want to be confident a backend is really healthy before sending production traffic to it.
Balanced Tier Architecture
Common Production Failure Modes
- Thundering herd after backend restart — when a backend comes back online, all LBs simultaneously send it pent-up traffic. Mitigate with a slow start ramp: HAProxy
slowstart 30s, nginx Plus slow start, or Envoyslow_start_window. - Health check flapping — a backend oscillates between healthy and unhealthy, causing traffic to splash unpredictably. Tune
rise(min 2–3 successes before restoring) and add jitter to probe intervals so all LBs do not check simultaneously. - Sticky sessions masking capacity — with IP hash, a single heavy client locks onto one backend. Under high traffic from a CDN (all traffic appears to come from a handful of CDN IPs), this can overload one server to 100% while others idle at 10%. Solution: use cookie-based stickiness or, better, eliminate server-side session state.
- Connection pool exhaustion — the LB holds 1,000 keepalive connections to a backend that only accepts 100. Set
keepaliveon the upstream block to a value below the backend'smax_connectionslimit.
Kubernetes: Load Balancing Without Thinking About It
In Kubernetes, every ClusterIP Service is a virtual IP backed by iptables or IPVS rules maintained by kube-proxy. When you create a Deployment with 5 replicas, kube-proxy automatically adds rules that round-robin (iptables random probability chain or IPVS least-connections) across the 5 healthy pod IPs. The Service endpoint controller removes a pod from the endpoint slice as soon as its readiness probe fails — before the pod is terminated.
Key Takeaways
- L4 balancers forward TCP streams without parsing payload — fast, protocol-agnostic, but no URL routing.
- L7 balancers parse HTTP — enable URL/host routing, TLS termination, header manipulation, and smarter retries.
- Algorithm choice matters: least connections for long-lived or variable-cost requests; round robin only when all requests and backends are uniform.
- Health checks must validate actual application readiness, not just port availability. Distinguish liveness from readiness.
- Design for failure: slow start, asymmetric rise/fall thresholds, and connection pool caps prevent cascades.
- In Kubernetes, the Service abstraction + kube-proxy handles L4 LB automatically; add an Ingress controller (nginx, Envoy/Contour) for L7.