Scaling & Load Balancing

Load Balancers: What & Why

18 min Lesson 3 of 10

Load Balancers: What & Why

Imagine your application is suddenly hit by 50,000 simultaneous users after a viral social-media post. If all that traffic slams into a single server, it will exhaust CPU, memory, and network bandwidth within seconds — users see timeouts, the server crashes, and revenue evaporates. A load balancer is the traffic cop that stands in front of your server fleet, distributes incoming requests intelligently, and prevents any single machine from becoming the bottleneck.

What Is a Load Balancer?

A load balancer is a networking component — hardware, software, or cloud-managed — that accepts incoming connections and forwards them to one of many backend servers (called a server pool or upstream group). From the outside, clients see a single endpoint (one IP or hostname). Behind that endpoint, a fleet of servers does the real work in parallel.

Beyond simple traffic distribution, a modern load balancer also provides:

  • Health checking — it automatically removes unhealthy servers from rotation.
  • SSL/TLS termination — it decrypts HTTPS traffic once, so backend servers handle plain HTTP, saving CPU cycles on each server.
  • Connection reuse (keep-alive pooling) — it maintains persistent connections to backends, reducing TCP handshake overhead.
  • Observability — it records latencies, error rates, and bytes transferred per upstream, giving you a single vantage point for metrics.
Key insight: The load balancer itself must not become a single point of failure. Production deployments always run two load balancers in an active-passive or active-active pair, promoted via a shared virtual IP (VIP) using protocols like VRRP or a cloud provider's managed LB with built-in redundancy.
Load balancer distributing traffic across a server pool Client A Browser Client B Mobile App Client C API Consumer Load Balancer Single public endpoint e.g. api.example.com App Server 1 10.0.0.11 App Server 2 10.0.0.12 App Server 3 10.0.0.13 All traffic enters via one endpoint — LB fans it out across the pool
A load balancer presents a single public endpoint while distributing requests across a pool of backend servers.

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the OSI model. Understanding which layer your balancer works at determines what information it can act on and what features it can offer.

Layer 4 — Transport Layer

An L4 load balancer makes routing decisions based purely on TCP/UDP headers: source IP, destination IP, and port. It does not inspect the payload — it simply forwards byte streams. Because it never parses HTTP, it is extremely fast and adds very little latency (often under 100 µs). This makes L4 ideal for:

  • Non-HTTP protocols: SMTP, DNS, raw TCP game servers, database proxies.
  • Very high throughput scenarios where even a few microseconds matter.
  • Simple IP-hash stickiness (a client's IP always routes to the same backend).

The trade-off is that L4 balancers are blind to application content. You cannot route /api/* to one server group and /static/* to another. Every request looks identical at the transport layer.

Layer 7 — Application Layer

An L7 load balancer terminates the TCP connection, fully parses the HTTP (or gRPC, WebSocket, etc.) request, and then opens a new connection to a chosen backend. Because it reads headers, URLs, cookies, and even request bodies, it can make content-aware routing decisions:

  • Path-based routing: /checkout → payment service cluster; /images/* → media servers.
  • Host-based routing: api.example.com → API pool; www.example.com → web pool.
  • Header-based routing: X-Beta-User: true → canary deployment servers.
  • SSL termination: decrypt once at the LB, use plain HTTP internally.
  • Sticky sessions via cookie: inject a routing cookie so a user's session always lands on the same backend.
  • Rate limiting and WAF: inspect and reject malicious requests before they reach any server.

The cost is slightly higher latency (typically 0.5–2 ms extra to parse HTTP) and more CPU than L4. In practice, for most web services this overhead is negligible compared to application processing time.

Rule of thumb: Default to L7 for web APIs and microservices — the routing power and TLS offload pay for themselves immediately. Drop to L4 only for raw TCP protocols or when you need sub-millisecond balancing overhead.
L4 vs L7 load balancer comparison: information available and routing capabilities L4 — Transport Load Balancer Sees only: • Source IP / Destination IP • TCP/UDP port number • TCP flags (SYN, ACK, FIN…) Routing decisions: • Round-robin by IP • IP-hash stickiness • Least connections Characteristics: ✓ Extremely fast (<100 µs overhead) ✓ Works with any TCP/UDP protocol ✗ Cannot inspect HTTP content ✗ No path/host/header routing L7 — Application Load Balancer Sees everything: • HTTP method, URL path, query string • Request headers, Host, Cookie • TLS SNI, gRPC service name Routing decisions: • Path: /api/* → API pool • Host: cdn.example.com → media pool • Header: X-Canary → beta servers Characteristics: ✓ Content-aware routing ✓ TLS termination, WAF, rate-limit ✗ ~0.5–2 ms extra latency vs L4 ✗ More CPU to parse HTTP
Side-by-side: what L4 and L7 load balancers can see and act upon.

Real-World Architecture: Multi-Tier Balancing

Large systems often use both tiers together. A high-performance L4 balancer (like AWS Network Load Balancer or Google's Maglev) sits at the internet edge, absorbing raw TCP connections and spreading them across a cluster of L7 proxies (like NGINX, HAProxy, or AWS Application Load Balancer). The L7 layer then performs content routing to dozens of microservice clusters behind it.

This design combines the raw speed of L4 with the intelligence of L7, and means the L7 layer is never the single point of receiving internet traffic.

Practical Examples by Stack

  • AWS: Network Load Balancer (L4) → Application Load Balancer (L7) → ECS/EC2 tasks
  • Self-hosted: NGINX (L7) or HAProxy (L4 + L7) in front of app servers — both are free, battle-tested, and serve millions of RPS on modest hardware
  • Kubernetes: Ingress controller (e.g., NGINX Ingress or Traefik) is an L7 load balancer; Service objects of type LoadBalancer provision an L4 entry point from the cloud provider
  • Cloudflare/Fastly: Global L7 balancing at the edge, with built-in DDoS protection and caching — often the first layer before any of your own infrastructure
Common pitfall — sticky sessions at L7: When you inject a routing cookie for session stickiness, you partially undo load distribution. If one "sticky" server handles heavy users, it can still become overloaded. Prefer stateless application design (store sessions in Redis/DB) so any backend can serve any request, and you get true uniform distribution.

Key Metrics to Watch

Once a load balancer is in place, monitor these metrics per upstream server:

  • Requests per second (RPS) — is distribution actually uniform?
  • Active connections — reveals slow backends accumulating work.
  • Response time (p50 / p95 / p99) — p99 latency spikes reveal unhealthy nodes before health checks trip.
  • Error rate (5xx) — automatic health-check removal kicks in here; also track false positives.
  • Backend queue depth — if the LB queues requests waiting for a free connection slot, you need more capacity.
Terminology check: The terms "reverse proxy" and "load balancer" are often used interchangeably in practice. Technically, a reverse proxy is any component that forwards requests on behalf of the server side (and can cache, compress, or terminate TLS). A load balancer is a reverse proxy whose primary job is to distribute requests. NGINX configured in upstream mode is both simultaneously.