Scaling & Load Balancing

Project: Scale a System from 1 to 1M Users

18 min Lesson 10 of 10

Project: Scale a System from 1 to 1M Users

This capstone lesson ties together every concept from the tutorial. Rather than studying techniques in isolation, we will walk through the complete lifecycle of a real product — a social bookmarking service where users save, tag, and share URLs — and evolve its architecture stage by stage as the user base grows from a single developer running it on a laptop to one million active users generating millions of requests per day.

At each stage we will identify the bottleneck that limits further growth, apply the right technique from our toolkit, and acknowledge the trade-offs introduced. This is exactly how production systems are built in the real world: not designed perfectly upfront, but grown deliberately in response to measured pressure.

The golden rule of scaling: Do not add complexity until you have evidence you need it. Every layer of infrastructure you add is a layer you must operate, monitor, and debug at 3 AM. Scale in response to real load, not hypothetical load.

Stage 1 — Single Server (0 to ~1,000 Users)

You launch. The entire stack lives on one cloud VM: an Nginx web server, a PHP/Node/Python application process, and a Postgres database — all on a single t3.medium (2 vCPU, 4 GB RAM). This handles roughly 50–200 requests per second with ease.

What works: Dead-simple deployment, cheap (~$30/month), zero operational overhead. You can iterate fast because there is nothing to coordinate.

What breaks first: The database and the web server are competing for the same CPU and memory. A slow query starves the web process. A traffic spike OOMs the machine and takes everything offline. There is a single point of failure — one hardware fault and the service is gone.

Stage 1: the entire stack on one VM — simple, cheap, and fragile.

Stage 2 — Separate the Database (1,000 to ~10,000 Users)

Bottleneck identified: The application and database are resource-contending on the same host. A surge in write operations slows query execution which slows page renders.

Fix: Move Postgres to a dedicated database VM (e.g. r6i.2xlarge — memory-optimized). The application VM now focuses solely on processing HTTP requests. Connection from app to DB goes over the private VPC network — low latency, no internet exposure.

Gain: Each tier can be sized independently. You can give the database more RAM (more buffer pool = more hot data in memory = fewer disk reads) without wasting money on the application tier.

Trade-off: You now have two machines to operate and monitor. The database is still a single point of failure. A hardware fault on the DB VM takes the whole service down.

Right-size early: When separating the DB, immediately enable automated backups and point-in-time recovery on your managed database service (e.g. AWS RDS). The cost is negligible and the protection is enormous. Losing the database means losing the entire product.

Stage 3 — Add a Cache Layer (10,000 to ~100,000 Users)

Bottleneck identified: Read traffic is hammering the database. A popular user's bookmark list is being loaded from Postgres thousands of times per minute — the same rows, over and over.

Fix: Deploy a Redis cluster between the application servers and the database. Cache the results of expensive or frequently-repeated queries. Implement cache-aside: on a cache miss, the app reads from DB, writes the result to Redis with a TTL (e.g. 60 seconds), and subsequent reads hit Redis for that window.

Gain: A well-tuned cache absorbs 80–95% of read traffic. Database query load drops dramatically, latency falls from ~20ms (DB round trip) to ~1ms (Redis), and throughput scales with the cache fleet rather than the DB.

Trade-off: Cache invalidation is one of the hardest problems in computer science. Stale data is now possible — a user updates their profile but another user sees the old name for 60 seconds. Choose TTLs and invalidation strategies carefully based on the freshness requirements of each data type.

Stage 4 — Horizontal Application Servers + Load Balancer (100,000 to ~500,000 Users)

Bottleneck identified: The single application server is saturating its CPU. During peak hours (evenings and weekends) response times climb and the process queue backs up.

Fix: Ensure the application is fully stateless (sessions in Redis, no local file writes), then launch two or three identical application servers behind a load balancer. Use round-robin or least-connections algorithm. Enable health checks so the load balancer routes around a failed instance automatically.

Gain: Horizontal scaling of the application tier is nearly linear — 3 servers handle roughly 3× the throughput. Deployments become zero-downtime: take one server out of rotation, upgrade it, bring it back, repeat.

Trade-off: The load balancer is now a critical component — it must itself be redundant (active-passive pair, or a managed service like AWS ALB that is HA by design). Session state must be externalized before this step, or users will experience random session loss as requests land on different servers.

Stage 4: load-balanced stateless app servers, a Redis cache layer, and a dedicated Postgres instance.

Stage 5 — Read Replicas for Database Reads (500,000 to ~1M Users)

Bottleneck identified: Even with Redis absorbing the majority of reads, the cache miss path still hits Postgres. Complex search queries (full-text search across bookmarks, tag aggregations) bypass the cache entirely and strain the primary. Replication lag on the single DB is growing as writes increase.

Fix: Add two read replicas of Postgres. Route all read queries (SELECT) to the replicas via a lightweight read router or a library convention. Keep all writes (INSERT, UPDATE, DELETE) on the primary. This is primary-replica replication — the primary streams its write-ahead log (WAL) to replicas, which apply it asynchronously.

Gain: Read throughput scales with the number of replicas. Search and reporting queries that used to burden the primary now run on dedicated read-only machines. The primary focuses exclusively on writes, reducing contention and improving write latency.

Trade-off: Replication is asynchronous — there is a small lag (typically 10–100ms, but can grow under write pressure). A user who writes a bookmark and immediately reads it back might see stale data if their read is served by a replica that has not yet applied the write. Mitigate by reading from the primary immediately after a write for that user session ("read-your-writes" consistency), or using synchronous replication for critical paths at the cost of write latency.

Stage 6 — CDN for Static Assets & Global Edge Caching

Bottleneck identified: ~60% of your HTTP responses are static assets — JavaScript bundles, CSS, images, favicons — that never change between requests. These are served by your application servers, consuming bandwidth and CPU cycles that should be reserved for dynamic requests.

Fix: Place a CDN (CloudFront, Fastly, Cloudflare) in front of the entire application. Static assets (JS, CSS, images) are cached at CDN edge nodes worldwide. Dynamic responses can use CDN with short TTLs or bypass caching entirely. Configure Cache-Control headers carefully: max-age=31536000, immutable for versioned assets, no-store for authenticated API responses.

Gain: Static assets are delivered from an edge node a few milliseconds from the user, regardless of where your origin servers are. Origin server load drops 60–80%. Users in distant regions (Southeast Asia, Africa) experience the same fast load times as users near your data center.

CDN cache poisoning: Never serve authenticated or user-specific content through a CDN cache without scoping cache keys by Authorization header or cookie. A misconfigured CDN can serve one user's private data to another user — a serious security breach. Always add Vary: Cookie or use a distinct URL namespace for authenticated API routes.

Stage 7 — Message Queue for Async Work

Bottleneck identified: Certain operations — sending email confirmations, generating link previews, recalculating tag popularity rankings — are synchronous and slow. They block the HTTP response thread for seconds, degrading perceived performance and wasting expensive application server resources on non-interactive work.

Fix: Introduce a message queue (RabbitMQ or SQS). When a user saves a bookmark, the HTTP handler enqueues a job message and returns a 202 Accepted immediately. A separate pool of worker processes consumes the queue, performs the slow work (fetch page metadata, send confirmation email), and retries on failure. The web tier and the worker tier scale independently.

Gain: HTTP response time drops from 2–3 seconds to 50ms. Transient failures in downstream services (email provider down) no longer cause HTTP errors for users — jobs are queued and retried. Workers can be scaled independently of web servers based on queue depth.

Stage 8 — Auto-Scaling Groups & Elasticity

Bottleneck identified: Traffic is not uniform. The service has sharp peaks on weekday mornings (users catching up on saved links) and valleys at 4 AM. Running enough servers to handle peak load 24/7 is wasteful and expensive.

Fix: Convert your application server fleet to an Auto Scaling Group (ASG). Define minimum (2), desired (4), and maximum (12) instance counts. Configure scale-out triggers (add 2 instances when average CPU > 70% for 2 minutes) and scale-in triggers (remove 1 instance when average CPU < 30% for 10 minutes). Use predictive scaling to pre-warm capacity before known peak periods.

Gain: Infrastructure cost drops 40–60% because you are only paying for capacity you actually need. The system automatically absorbs viral traffic spikes without human intervention.

Trade-off: New instances take 2–4 minutes to bootstrap (install packages, warm JVM, build code caches). If a spike is sudden, you may experience degraded performance for the warm-up period. Mitigate with pre-baked AMIs (machine images with the app already installed) to reduce boot time to 30–60 seconds.

The Full 1M-User Architecture at a Glance

Full 1M-user architecture: CDN, HA load balancer, auto-scaling app tier, Redis cache, message queue, Postgres primary with two read replicas.

Scaling Decision Cheat Sheet

Every architecture evolution followed the same pattern: measure, identify the bottleneck, apply the minimum change that resolves it, and measure again. Here is a condensed reference:

App server CPU saturated? Add more app servers behind the load balancer (horizontal scale). Ensure statelessness first.
Database read traffic high? Add Redis cache. If cache hit rate is already high, add read replicas.
Database writes are the bottleneck? Vertical scale the primary. If that ceiling is hit, consider sharding by user ID or tenant.
Static assets consuming origin bandwidth? Add a CDN. Configure aggressive cache headers on immutable assets.
Slow synchronous jobs degrading HTTP latency? Push them to a message queue and process asynchronously.
Traffic is spiky / unpredictable? Configure auto-scaling with pre-baked machine images for fast boot times.
Global users experiencing high latency? Multi-region deployment with data residency considerations, global load balancing, or geo-DNS routing.

Observability is not optional: Every scaling decision in this lesson was triggered by a measured bottleneck. You cannot make those measurements without proper observability: metrics (CPU, memory, request rate, error rate, latency percentiles), distributed tracing, and centralized logs. Before you scale, instrument. You are flying blind without it.

What Comes After 1M Users?

Reaching one million users with the architecture above is a significant milestone, but the journey does not end there. At tens or hundreds of millions of users, the next set of challenges emerge: database sharding to spread write load across partitions, multi-region deployments to survive data-center outages, service decomposition (microservices) to allow teams to deploy independently, and event-driven architectures to decouple components at scale. These are advanced topics — but by applying the incremental, evidence-driven approach you have practised here, each of them can be tackled systematically, one bottleneck at a time.

Course complete: This lesson concludes the Scaling & Load Balancing tutorial. You have covered the full spectrum — from the theoretical difference between vertical and horizontal scaling, through load balancing algorithms and health checks, all the way to a concrete 8-stage architecture evolution. The most important skill you have built is not any individual technique, but the disciplined habit of identifying bottlenecks before adding complexity.

Previous Lesson Auto-Scaling & Elasticity

Back to Course System Design

Back to Course

Project: Scale a System from 1 to 1M Users

Project: Scale a System from 1 to 1M Users

Stage 1 — Single Server (0 to ~1,000 Users)

Stage 2 — Separate the Database (1,000 to ~10,000 Users)

Stage 3 — Add a Cache Layer (10,000 to ~100,000 Users)

Stage 4 — Horizontal Application Servers + Load Balancer (100,000 to ~500,000 Users)

Stage 5 — Read Replicas for Database Reads (500,000 to ~1M Users)

Stage 6 — CDN for Static Assets & Global Edge Caching

Stage 7 — Message Queue for Async Work

Stage 8 — Auto-Scaling Groups & Elasticity

The Full 1M-User Architecture at a Glance

Scaling Decision Cheat Sheet

What Comes After 1M Users?

Tutorial Complete!