Scaling & Load Balancing

Load Balancers: What & Why

18 min Lesson 3 of 10

Load Balancers: What & Why

Imagine your application is suddenly hit by 50,000 simultaneous users after a viral social-media post. If all that traffic slams into a single server, it will exhaust CPU, memory, and network bandwidth within seconds — users see timeouts, the server crashes, and revenue evaporates. A load balancer is the traffic cop that stands in front of your server fleet, distributes incoming requests intelligently, and prevents any single machine from becoming the bottleneck.

What Is a Load Balancer?

A load balancer is a networking component — hardware, software, or cloud-managed — that accepts incoming connections and forwards them to one of many backend servers (called a server pool or upstream group). From the outside, clients see a single endpoint (one IP or hostname). Behind that endpoint, a fleet of servers does the real work in parallel.

Beyond simple traffic distribution, a modern load balancer also provides:

Health checking — it automatically removes unhealthy servers from rotation.
SSL/TLS termination — it decrypts HTTPS traffic once, so backend servers handle plain HTTP, saving CPU cycles on each server.
Connection reuse (keep-alive pooling) — it maintains persistent connections to backends, reducing TCP handshake overhead.
Observability — it records latencies, error rates, and bytes transferred per upstream, giving you a single vantage point for metrics.

Key insight: The load balancer itself must not become a single point of failure. Production deployments always run two load balancers in an active-passive or active-active pair, promoted via a shared virtual IP (VIP) using protocols like VRRP or a cloud provider's managed LB with built-in redundancy.

A load balancer presents a single public endpoint while distributing requests across a pool of backend servers.

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the OSI model. Understanding which layer your balancer works at determines what information it can act on and what features it can offer.

Layer 4 — Transport Layer

An L4 load balancer makes routing decisions based purely on TCP/UDP headers: source IP, destination IP, and port. It does not inspect the payload — it simply forwards byte streams. Because it never parses HTTP, it is extremely fast and adds very little latency (often under 100 µs). This makes L4 ideal for:

Non-HTTP protocols: SMTP, DNS, raw TCP game servers, database proxies.
Very high throughput scenarios where even a few microseconds matter.
Simple IP-hash stickiness (a client's IP always routes to the same backend).

The trade-off is that L4 balancers are blind to application content. You cannot route /api/* to one server group and /static/* to another. Every request looks identical at the transport layer.

Layer 7 — Application Layer

An L7 load balancer terminates the TCP connection, fully parses the HTTP (or gRPC, WebSocket, etc.) request, and then opens a new connection to a chosen backend. Because it reads headers, URLs, cookies, and even request bodies, it can make content-aware routing decisions:

Path-based routing: /checkout → payment service cluster; /images/* → media servers.
Host-based routing: api.example.com → API pool; www.example.com → web pool.
Header-based routing: X-Beta-User: true → canary deployment servers.
SSL termination: decrypt once at the LB, use plain HTTP internally.
Sticky sessions via cookie: inject a routing cookie so a user's session always lands on the same backend.
Rate limiting and WAF: inspect and reject malicious requests before they reach any server.

The cost is slightly higher latency (typically 0.5–2 ms extra to parse HTTP) and more CPU than L4. In practice, for most web services this overhead is negligible compared to application processing time.

Rule of thumb: Default to L7 for web APIs and microservices — the routing power and TLS offload pay for themselves immediately. Drop to L4 only for raw TCP protocols or when you need sub-millisecond balancing overhead.

Side-by-side: what L4 and L7 load balancers can see and act upon.

Real-World Architecture: Multi-Tier Balancing

Large systems often use both tiers together. A high-performance L4 balancer (like AWS Network Load Balancer or Google's Maglev) sits at the internet edge, absorbing raw TCP connections and spreading them across a cluster of L7 proxies (like NGINX, HAProxy, or AWS Application Load Balancer). The L7 layer then performs content routing to dozens of microservice clusters behind it.

This design combines the raw speed of L4 with the intelligence of L7, and means the L7 layer is never the single point of receiving internet traffic.

Practical Examples by Stack

AWS: Network Load Balancer (L4) → Application Load Balancer (L7) → ECS/EC2 tasks
Self-hosted: NGINX (L7) or HAProxy (L4 + L7) in front of app servers — both are free, battle-tested, and serve millions of RPS on modest hardware
Kubernetes: Ingress controller (e.g., NGINX Ingress or Traefik) is an L7 load balancer; Service objects of type LoadBalancer provision an L4 entry point from the cloud provider
Cloudflare/Fastly: Global L7 balancing at the edge, with built-in DDoS protection and caching — often the first layer before any of your own infrastructure

Common pitfall — sticky sessions at L7: When you inject a routing cookie for session stickiness, you partially undo load distribution. If one "sticky" server handles heavy users, it can still become overloaded. Prefer stateless application design (store sessions in Redis/DB) so any backend can serve any request, and you get true uniform distribution.

Key Metrics to Watch

Once a load balancer is in place, monitor these metrics per upstream server:

Requests per second (RPS) — is distribution actually uniform?
Active connections — reveals slow backends accumulating work.
Response time (p50 / p95 / p99) — p99 latency spikes reveal unhealthy nodes before health checks trip.
Error rate (5xx) — automatic health-check removal kicks in here; also track false positives.
Backend queue depth — if the LB queues requests waiting for a free connection slot, you need more capacity.

Terminology check: The terms "reverse proxy" and "load balancer" are often used interchangeably in practice. Technically, a reverse proxy is any component that forwards requests on behalf of the server side (and can cache, compress, or terminate TLS). A load balancer is a reverse proxy whose primary job is to distribute requests. NGINX configured in upstream mode is both simultaneously.