Caching & CDNs

Why Cache?

18 min Lesson 1 of 10

Why Cache?

Every large-scale system eventually hits the same two walls: latency and load. A database query that takes 80 ms is fine for ten users. For ten million users it can bring your entire infrastructure to its knees. Caching is the single most impactful technique for breaking through both walls simultaneously.

This lesson explains the fundamental problem that caching solves, gives you concrete numbers to reason about, and shows — with diagrams — how a well-placed cache transforms a system's performance profile.

The Latency Problem

Different layers of a computer system have wildly different access speeds. These numbers are approximate but well-established industry benchmarks:

  • CPU L1 cache read: ~0.5 ns
  • RAM read: ~100 ns
  • SSD random read: ~100 µs (100,000 ns)
  • Database query (network + disk): 1 – 100 ms
  • Cross-region network round-trip: 150 – 300 ms

A database round-trip is roughly one million times slower than reading from CPU cache. When your application executes that query on every single HTTP request, the slowest operation dominates the user-visible response time. Caching moves frequently-read data closer to where it is consumed — ideally to RAM — so the application skips the expensive trip entirely.

Key Insight: Caching is fundamentally about trading storage space for time. You keep a copy of computed or fetched data in a fast medium so you do not have to recompute or re-fetch it on the next request.

The Load Problem

Latency is only half the story. The other half is throughput: how many requests your database can handle per second. A typical PostgreSQL instance on a cloud VM can sustain around 5,000–15,000 simple read queries per second before CPU or I/O becomes the bottleneck. A popular API endpoint that is hit 100,000 times per minute (about 1,667 req/s) sounds manageable — until a traffic spike triples that number, or a slow query (a join across three large tables) drops the sustainable throughput to 500 req/s.

A cache absorbs the majority of read traffic before it ever reaches the database. If 90% of requests hit the cache (cache hit ratio = 90%), the database only sees 10% of the original load — a 10× reduction. At a 99% hit ratio, the reduction is 100×. This is why companies like Twitter, Facebook, and Netflix describe their caching tiers as essential infrastructure, not an optimisation afterthought.

Before Caching: Every Request Hits the Database

Consider a product page on an e-commerce site. Each page load fires several queries: fetch product details, fetch reviews, fetch related items, fetch inventory count. Without a cache:

Request flow without caching — every request hits the database Client A Client B Client C App Server (no cache) Primary DB ~40 ms / query 3 queries each Response time: ~120–200 ms DB handles ALL 3× N req/s
Without caching: every client request triggers multiple DB queries, accumulating latency and saturating the database under load.

With three clients each firing three queries, the database handles nine queries. At 1,000 concurrent users it handles 3,000 queries per second — for a single page. Add a traffic spike and the database CPU pins at 100%, query times balloon, connection pools exhaust, and pages start returning errors.

After Caching: Reads Served from Memory

Now add a cache layer — typically an in-memory store like Redis — between the application and the database. The first request for a product page still hits the database. But the result is stored in cache. Every subsequent request for the same data is served directly from memory in under 1 ms, without touching the database at all.

Request flow with caching — most requests served from cache Client A Client B Client C App Server Cache < 1 ms reads Primary DB ~40 ms / query HIT (90%) MISS (10%) store Response time: ~1–5 ms DB load reduced 10×
With caching: 90% of reads are served from in-memory cache in under 1 ms; only cache misses reach the database.

The transformation is dramatic. The same page that took 120–200 ms now returns in 1–5 ms for the vast majority of users. The database, which was handling 3,000 queries per second, now handles 300. It has headroom for complex analytical queries, writes, and traffic spikes.

A Concrete Real-World Example

Twitter's timeline is one of the most-read data structures on the internet. Early Twitter fetched timelines by querying the database in real time — joining a user's follows with their recent tweets. At scale this was unsustainable. The solution was to pre-compute and cache each user's timeline in Redis. Reading a cached timeline takes under 1 ms. Without cache, the same read would require a multi-table join that took 200–500 ms under load — a 200–500× improvement.

Facebook's Memcached cluster (called Tao) serves over one billion reads per second at peak, absorbing a workload that no relational database fleet could handle directly.

Rule of Thumb: If the same data is read more than it is written, and the data does not need to be perfectly real-time, it is a candidate for caching. Start with your most-read, most-expensive-to-compute data first.

What Caching Does NOT Solve

Caching is powerful but not a silver bullet. It is important to understand its limits from the start:

  • Write-heavy workloads: Caching primarily helps reads. If your bottleneck is write throughput, you need different techniques (write-ahead logs, sharding, async writes).
  • Unique, non-repeating queries: A cache only saves time if the same data is requested more than once. One-time reports or personalised queries with no re-use have a 0% hit rate.
  • Stale data risk: A cache holds a copy of data. If the source changes and the cache is not updated, clients read outdated information. Managing this — cache invalidation — is notoriously hard and a core topic of this tutorial.
  • Memory cost: Caches live in RAM, which is expensive. You must make deliberate decisions about what to cache and for how long.
Classic Pitfall: Caching a query that is never repeated gives you zero benefit but adds system complexity. Always measure your hit rate after introducing a cache. A hit rate below 50% is a sign you are caching the wrong things.

Summary

Caching solves two fundamental problems in large-scale systems: latency (slow data access) and load (database saturation). By keeping a copy of frequently-read data in a fast in-memory store, you can reduce response times from hundreds of milliseconds to under a millisecond, and cut database load by 90–99%. The rest of this tutorial builds on this foundation — exploring where to cache, which strategies to use, how to handle invalidation, and how to avoid the many pitfalls along the way.