Caching & CDNs

Cache Invalidation

18 min Lesson 5 of 10

Cache Invalidation

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — is funny precisely because it is true. Serving stale data can be just as catastrophic as serving no data at all. A product page showing yesterday's sold-out price, a permissions check returning a cached "allowed" after the user was banned, a DNS record still pointing at a decommissioned server — all are cache invalidation failures in the wild.

Cache invalidation is the discipline of deciding when and how to remove or replace cached entries so that clients never observe data that is more out-of-date than the system can tolerate. It is genuinely hard because it requires coordinating two or more data stores (the cache and the source of truth) in the face of concurrent writes, partial failures, and network partitions.

The Root Cause: Dual-State Problem

Once you have a cache, the same logical record exists in two places: the origin store (database, object storage, microservice) and the cache. Every write to the origin creates a window of inconsistency. How long that window lasts — and whether it matters — is the central question of invalidation strategy.

Key idea: Cache invalidation is not just a technical problem — it is a product decision. The acceptable staleness for a bank balance (near-zero) differs from the acceptable staleness for a blog post view count (minutes). Always establish the staleness budget before choosing a strategy.

After a write to the origin DB, the cache still holds the old value — readers see stale data until the cache is invalidated.

Strategy 1: TTL-Based Expiry (Passive Invalidation)

The simplest approach is to let every cache entry carry a Time-To-Live. When the TTL expires, the next read triggers a cache miss and the fresh value is fetched. No explicit coordination is needed between the writer and the cache.

Pros: Zero coupling between writer and cache; trivially simple to implement; built into Redis, Memcached, HTTP headers (Cache-Control: max-age), and every CDN.
Cons: Staleness is bounded by the TTL, not by the actual write. A TTL of 60 seconds means every cached value can be up to 60 seconds wrong. Choosing the right TTL requires knowing the write frequency and the acceptable staleness — often domain knowledge that is hard to quantify.

A practical rule: set the TTL to roughly half the expected change interval. A product catalog updated twice a day can tolerate a TTL of 3–6 hours. A stock ticker should use a TTL of 1–2 seconds at most, or skip TTL-only caching entirely.

Tip — TTL jitter: When thousands of cache entries were created at the same time (e.g., a bulk import), they will all expire simultaneously, causing a thundering herd. Add random jitter to the TTL (base_ttl + rand(0, jitter)) so expirations spread out over time.

Strategy 2: Write-Through Invalidation (Active / Eager)

The writer is responsible for keeping the cache consistent. Whenever it writes to the origin, it immediately either deletes the cache entry (invalidation) or updates it in place (write-through). The next read will either be a cache miss (and refresh from origin) or will find the freshly written value.

Delete-on-write (invalidate): Simpler and safer. The writer calls cache.del(key) after persisting to DB. The next reader pays a cache miss but always gets fresh data. This is the most common pattern in application-layer caches.
Update-on-write (write-through): The writer both persists to DB and sets the cache to the new value. Eliminates the miss, but adds complexity: what if the DB write succeeds and the cache write fails? You now have inconsistency in the opposite direction.

Race condition — delete-on-write: Thread A reads a stale value and is about to rewrite it into cache. Thread B writes the real new value and deletes the cache key. Thread A then overwrites the cache with the old value. The fix is to use a cache-aside with version checking or to apply the delete after a short delay (double-delete pattern: delete on write, then delete again ~500 ms later to catch in-flight reads).

Strategy 3: Event-Driven Invalidation (CDC / Pub-Sub)

Rather than coupling every writer to cache-clearing logic, you can use a change-data-capture (CDC) pipeline or a message bus. When the database commits a row change, an event is emitted (via Kafka, Redis Pub/Sub, or a CDC tool like Debezium) and all cache nodes subscribed to that key pattern clear or refresh their entries.

This is how large platforms keep caches consistent across many microservices without every service knowing about every cache. Netflix, Airbnb, and Shopify all use CDC-based cache invalidation at scale.

CDC agent reads the database binlog, publishes change events to a message bus, and all cache nodes subscribe to invalidate affected keys automatically.

Strategy 4: Cache Tags (Dependency-Based Invalidation)

Many application frameworks (Laravel, Symfony, Varnish) support cache tags: you attach logical group labels to entries when you write them, and you can invalidate all entries sharing a tag with a single call.

// Tag a cache entry with multiple logical groups
cache()->tags(['product:42', 'category:electronics'])->put('product_42_detail', $data, 3600);

// When product 42 changes — clear everything tagged to it
cache()->tags(['product:42'])->flush();

// When the whole electronics category is restructured
cache()->tags(['category:electronics'])->flush();

This is powerful for object graphs: a cached rendered page might depend on a product, its category, its author profile, and the current promotional banner. Tagging the page entry with all four means any change to any of those objects automatically invalidates the page — no manual bookkeeping.

Not all cache stores support tags: Redis and Memcached both support tag-based invalidation (via application-layer tag sets), but simple in-process caches or basic HTTP caches do not. The implementation stores a tag-to-key map alongside the cache; flushing a tag just removes all keys in that set.

The Hardest Case: Distributed Cache Invalidation

When you run multiple application servers, each may have its own local in-process cache (L1). A write on server 1 invalidates server 1's local cache but leaves server 2, 3, and 4 with stale data. Solutions include:

Skip L1, use only a shared L2 (Redis): All servers share one cache. Simpler, but adds network RTT to every cache hit.
Invalidation broadcast: On write, publish an invalidation message to a Pub/Sub channel all servers subscribe to. Each server evicts its local copy. This is what Facebook's Memcache paper describes at their scale.
Short TTL on L1: Accept that local caches can be stale for up to N seconds. Simple, and often sufficient for read-heavy, eventual-consistency-tolerant data.

Choosing a Strategy

In practice most systems layer multiple strategies: TTL as a safety net (data is never stale forever), combined with explicit invalidation on write (data is usually fresh immediately). Event-driven invalidation and cache tags add precision for complex object graphs. The right choice depends on your consistency requirements, write frequency, and operational complexity budget.

Summary of trade-offs:

TTL only — Simple, but bounded staleness regardless of actual writes.
Delete-on-write — Fresh after writes, but adds coupling; race conditions under high concurrency.
Write-through — No miss after write, but dual-write failure risk.
CDC / event-driven — Decoupled and scalable, but operationally complex.
Cache tags — Powerful for object graphs, but requires tag discipline and a supporting store.