Cache Invalidation
Cache Invalidation
Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — is funny precisely because it is true. Serving stale data can be just as catastrophic as serving no data at all. A product page showing yesterday's sold-out price, a permissions check returning a cached "allowed" after the user was banned, a DNS record still pointing at a decommissioned server — all are cache invalidation failures in the wild.
Cache invalidation is the discipline of deciding when and how to remove or replace cached entries so that clients never observe data that is more out-of-date than the system can tolerate. It is genuinely hard because it requires coordinating two or more data stores (the cache and the source of truth) in the face of concurrent writes, partial failures, and network partitions.
The Root Cause: Dual-State Problem
Once you have a cache, the same logical record exists in two places: the origin store (database, object storage, microservice) and the cache. Every write to the origin creates a window of inconsistency. How long that window lasts — and whether it matters — is the central question of invalidation strategy.
Strategy 1: TTL-Based Expiry (Passive Invalidation)
The simplest approach is to let every cache entry carry a Time-To-Live. When the TTL expires, the next read triggers a cache miss and the fresh value is fetched. No explicit coordination is needed between the writer and the cache.
- Pros: Zero coupling between writer and cache; trivially simple to implement; built into Redis, Memcached, HTTP headers (
Cache-Control: max-age), and every CDN. - Cons: Staleness is bounded by the TTL, not by the actual write. A TTL of 60 seconds means every cached value can be up to 60 seconds wrong. Choosing the right TTL requires knowing the write frequency and the acceptable staleness — often domain knowledge that is hard to quantify.
A practical rule: set the TTL to roughly half the expected change interval. A product catalog updated twice a day can tolerate a TTL of 3–6 hours. A stock ticker should use a TTL of 1–2 seconds at most, or skip TTL-only caching entirely.
base_ttl + rand(0, jitter)) so expirations spread out over time.
Strategy 2: Write-Through Invalidation (Active / Eager)
The writer is responsible for keeping the cache consistent. Whenever it writes to the origin, it immediately either deletes the cache entry (invalidation) or updates it in place (write-through). The next read will either be a cache miss (and refresh from origin) or will find the freshly written value.
- Delete-on-write (invalidate): Simpler and safer. The writer calls
cache.del(key)after persisting to DB. The next reader pays a cache miss but always gets fresh data. This is the most common pattern in application-layer caches. - Update-on-write (write-through): The writer both persists to DB and sets the cache to the new value. Eliminates the miss, but adds complexity: what if the DB write succeeds and the cache write fails? You now have inconsistency in the opposite direction.
Strategy 3: Event-Driven Invalidation (CDC / Pub-Sub)
Rather than coupling every writer to cache-clearing logic, you can use a change-data-capture (CDC) pipeline or a message bus. When the database commits a row change, an event is emitted (via Kafka, Redis Pub/Sub, or a CDC tool like Debezium) and all cache nodes subscribed to that key pattern clear or refresh their entries.
This is how large platforms keep caches consistent across many microservices without every service knowing about every cache. Netflix, Airbnb, and Shopify all use CDC-based cache invalidation at scale.
Strategy 4: Cache Tags (Dependency-Based Invalidation)
Many application frameworks (Laravel, Symfony, Varnish) support cache tags: you attach logical group labels to entries when you write them, and you can invalidate all entries sharing a tag with a single call.
This is powerful for object graphs: a cached rendered page might depend on a product, its category, its author profile, and the current promotional banner. Tagging the page entry with all four means any change to any of those objects automatically invalidates the page — no manual bookkeeping.
The Hardest Case: Distributed Cache Invalidation
When you run multiple application servers, each may have its own local in-process cache (L1). A write on server 1 invalidates server 1's local cache but leaves server 2, 3, and 4 with stale data. Solutions include:
- Skip L1, use only a shared L2 (Redis): All servers share one cache. Simpler, but adds network RTT to every cache hit.
- Invalidation broadcast: On write, publish an invalidation message to a Pub/Sub channel all servers subscribe to. Each server evicts its local copy. This is what Facebook's Memcache paper describes at their scale.
- Short TTL on L1: Accept that local caches can be stale for up to N seconds. Simple, and often sufficient for read-heavy, eventual-consistency-tolerant data.
Choosing a Strategy
In practice most systems layer multiple strategies: TTL as a safety net (data is never stale forever), combined with explicit invalidation on write (data is usually fresh immediately). Event-driven invalidation and cache tags add precision for complex object graphs. The right choice depends on your consistency requirements, write frequency, and operational complexity budget.
- TTL only — Simple, but bounded staleness regardless of actual writes.
- Delete-on-write — Fresh after writes, but adds coupling; race conditions under high concurrency.
- Write-through — No miss after write, but dual-write failure risk.
- CDC / event-driven — Decoupled and scalable, but operationally complex.
- Cache tags — Powerful for object graphs, but requires tag discipline and a supporting store.