Caching & Messaging Infrastructure

Operating Redis: Fundamentals

18 min Lesson 1 of 30

Operating Redis: Fundamentals

Redis is a single-threaded, in-memory data structure server. That sentence contains three load-bearing words that shape every operational decision you will make. Single-threaded means one CPU core saturated equals a fully blocked server — you cannot throw more cores at a Redis bottleneck the way you can with a stateless service. In-memory means your working set must fit in RAM; the moment it does not, Redis either evicts data or hard-crashes depending on your policy. Data structure server means it is not merely a key-value cache — it natively supports strings, hashes, lists, sorted sets, streams, bitmaps, and HyperLogLog, each with different memory footprints and time complexities that matter at scale.

Before you configure anything, you must decide: is this instance a cache or a store? The answer determines every subsequent choice — persistence, memory policy, replication topology, and alerting thresholds.

Cache vs. Store: The Operational Divide

A cache is a read-acceleration layer in front of a source of truth (PostgreSQL, object storage, an upstream API). Data loss is acceptable — a cache miss hits the origin. This means:

  • Persistence can be disabled entirely, saving CPU and I/O.
  • An aggressive eviction policy (allkeys-lru) is correct — Redis must free memory without blocking the application.
  • RTO approaches zero: on restart the cache is empty, it warms automatically.
  • Replication is a performance concern, not a durability one.

A store (sometimes called a primary data store or a persistent Redis) holds authoritative state: session tokens, rate-limit counters, distributed locks, queued jobs, leaderboards, real-time analytics. Data loss is a bug, not a cache miss. This means:

  • Persistence is mandatory — at minimum AOF with appendfsync everysec, ideally AOF + RDB for both fast recovery and point-in-time snapshots.
  • Eviction policy is noeviction — the application should receive OOM command not allowed and handle it explicitly rather than silently lose data.
  • Replication is a durability concern; always run at least one replica with min-replicas-to-write 1.
Production pitfall — the mixed-use instance: teams frequently start with a single Redis for both caching and session storage to reduce cost. This is a reliability trap. When memory pressure triggers eviction, Redis cannot distinguish between a throwaway CDN cache entry and a user session — both are evicted under LRU. Run separate instances (or at minimum separate logical databases) with different policies. At big-tech scale, separate instances also isolate noisy-neighbor effects: a bulk cache invalidation storm will not cause session read latency spikes.

Persistence: RDB Snapshots

RDB (Redis Database) persistence writes a compact binary snapshot of the entire dataset to disk at configured intervals. The snapshot is created by a fork() — the parent keeps serving requests while the child writes the .rdb file using copy-on-write (CoW). On a 50 GB instance with a write-heavy workload, CoW can double RSS momentarily; ensure your OS overcommit setting (vm.overcommit_memory=1) is configured and you have headroom.

# redis.conf — RDB snapshot schedule # save <seconds> <changes> save 3600 1 # snapshot if at least 1 key changed in 1 hour save 300 100 # snapshot if at least 100 keys changed in 5 min save 60 10000 # snapshot if at least 10,000 keys changed in 1 min dir /var/lib/redis dbfilename dump.rdb # Abort writes if child process fails — prevents silent data loss stop-writes-on-bgsave-error yes # LZF compression (CPU trade-off; worthwhile on large datasets) rdbcompression yes rdbchecksum yes # CRC64 integrity check at load time

RDB is optimal for cache-only instances and disaster-recovery archives — BGSAVE produces a single portable file you can copy offsite with rsync or push to object storage. The trade-off is granularity: if Redis crashes between snapshots you lose everything written in that window. On a pure cache that is acceptable; on a store it is not.

Persistence: AOF (Append-Only File)

AOF writes every write command to an append-only log. On restart, Redis replays the log to reconstruct state. The critical knob is appendfsync, which governs the durability-vs-latency trade-off at the OS level:

  • alwaysfsync() after every write. Durable to single-command granularity, but adds ~1–2 ms per write. Rarely justified; use a relational database if you need this.
  • everysecfsync() once per second in a background thread. Production default for stores. Maximum data loss: ~1 second of writes.
  • no — OS decides when to flush. Highest throughput, worst durability. Equivalent to RDB in practice.
# redis.conf — AOF configuration appendonly yes appendfilename "appendonly.aof" appendfsync everysec # Prevent fsync during heavy RDB saves (avoids 30-second latency spikes) no-appendfsync-on-rewrite yes # Auto-rewrite when AOF file grows 100% above its last rewrite size auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # RDB-format preamble + incremental AOF = dramatically faster restarts aof-use-rdb-preamble yes
RDB + AOF together (the recommended store configuration): AOF provides high-frequency durability; RDB provides a compact recovery baseline and enables faster restarts via aof-use-rdb-preamble yes. On a 100 GB instance, replaying a pure AOF from epoch could take 20+ minutes; the RDB preamble collapses that to under 60 seconds.

Memory Policies

When Redis reaches maxmemory, it must decide what to do. This is configured via maxmemory-policy. Choosing the wrong policy is one of the most common causes of silent data corruption in production.

Redis maxmemory-policy decision tree maxmemory reached Is this a cache or a store? Cache allkeys-lru Evict any key by LRU order Store noeviction Return OOM error to client Other cache variants allkeys-lfu — evict by frequency volatile-lru — evict only keys with TTL allkeys-random — uniform random eviction Alert on used_memory Alert at 75% maxmemory; page at 90%. Scale before OOM, not after.
Choosing the right maxmemory-policy based on whether Redis is used as a cache or an authoritative store.

The full policy matrix:

  • noeviction — no keys are removed; write commands return an error. Correct for stores.
  • allkeys-lru — evict the least recently used key across all keyspace. Correct for pure caches.
  • allkeys-lfu — evict the least frequently used key (Redis 4+). Superior to LRU for skewed access patterns (hot-key workloads); a key accessed once last week beats an idle key in LRU, but LFU handles this correctly.
  • volatile-lru / volatile-lfu / volatile-ttl — evict only keys that have an expiry set. Use when Redis holds both persistent and ephemeral keys and you want to protect the persistent ones. Fragile: if all keys have TTLs, behavior degrades to allkeys-*; if none do, Redis falls back to noeviction.
  • allkeys-random — evict a random key. Rarely correct; only for uniform access patterns where LRU offers no benefit.
# redis.conf — memory configuration maxmemory 12gb maxmemory-policy allkeys-lru # change to noeviction for stores # LRU/LFU sampling accuracy (higher = more CPU, more accurate) maxmemory-samples 10 # default 5; 10 is a good production value # LFU decay period (minutes). Controls how quickly frequency counts age. lfu-decay-time 1

Key Operational Commands

Every Redis operator must know these commands cold:

  • INFO memoryused_memory_rss vs used_memory; the difference (fragmentation ratio) spikes after mass deletes.
  • INFO statsevicted_keys counter; any non-zero value on a store is a P0 alert.
  • INFO persistencerdb_last_bgsave_status, aof_last_write_status; failures here are often silent.
  • MEMORY USAGE <key> — per-key memory in bytes including metadata overhead.
  • OBJECT FREQ <key> — LFU counter (requires maxmemory-policy allkeys-lfu).
  • DEBUG SLEEP 0 — verify the event loop is not blocked (returns immediately if healthy).
  • SLOWLOG GET 20 — last 20 commands exceeding slowlog-log-slower-than (default 10 ms); KEYS * appearing here means someone is running an O(N) scan on production.
Never run KEYS * on a production instance. It is O(N) and blocks the event loop for the entire duration of the scan. On a 10-million-key instance this can cause multi-second pauses. Always use SCAN with a cursor and a reasonable COUNT hint (e.g., SCAN 0 MATCH session:* COUNT 200). Similarly, SMEMBERS on a large set and HGETALL on a large hash block the loop — use SSCAN and HSCAN instead.

The maxmemory Sizing Rule

A common mistake is setting maxmemory equal to the instance's total RAM. Redis itself uses memory beyond your data: the output buffers for each connected client can reach tens of MB under load, the replication backlog buffer (repl-backlog-size, default 1 MB) grows during replica lag, AOF rewrite buffers spike during BGREWRITEAOF. The practical rule at big-tech companies: set maxmemory to 75% of available RAM and alert at 75% of that. This leaves headroom for fork CoW, replication buffers, and the OS page cache needed for AOF writes.

You now have the mental model and the configuration knobs to run Redis correctly in production. The next lesson covers High Availability — Sentinel and Cluster — which builds directly on these persistence and eviction fundamentals.