Caching & Messaging Infrastructure

Redis in Production

18 min Lesson 3 of 30

Redis in Production

Running Redis in development is trivial. Running it at scale — where it serves millions of requests per second, backs session state for every authenticated user, and sits in the critical path of your checkout flow — is a different discipline entirely. This lesson covers the four pillars that separate a stable production Redis from one that pages you at 3 AM: eviction policy tuning, hot key and big key detection, latency spike diagnosis, and a monitoring stack that gives you actionable signal before users notice.

Eviction Policy Tuning

When Redis approaches its maxmemory limit, it must decide what to do. The answer is controlled by maxmemory-policy, and choosing the wrong one for your workload is one of the most common causes of subtle production bugs.

The eight policies split across two axes: what to evict (LRU — least recently used, LFU — least frequently used, TTL — closest to expiry, or random) and which key space to sample (all keys, or only keys that already have a TTL set). The practical matrix:

noeviction — reject writes once memory is full. Use for primary data you cannot afford to lose (e.g. a queue backed by Redis lists). Your application will get OOM command not allowed errors and must handle them gracefully.
allkeys-lru — evict the least-recently-used key across all keys. The right default for a cache where all keys are equally eligible for eviction. Used at Netflix for their EVCache layer.
volatile-lru — evict only from keys that have a TTL. Safe for mixed workloads where some keys are persistent (counters, rate limiters) and some are ephemeral (session tokens). Only the ephemeral keys get evicted.
allkeys-lfu — evict least-frequently-used keys. Superior to LRU for workloads with a high scan-once pattern (large datasets that are iterated periodically rather than hot-accessed). Redis 4.0+.
volatile-ttl — evict the key closest to its TTL expiry. Useful when your TTL already encodes business priority — keys expiring soonest are least valuable.

Pitfall: noeviction with unbounded growth. Operators sometimes set noeviction thinking it means "never delete data." What it actually means is "stop accepting writes when full." If your application does not catch the OOM error and retry with backoff, it will silently drop events, fail requests, or crash. If you truly need durability, use persistence (RDB + AOF) and size your instance so it never hits maxmemory, or use a queue-backed architecture.

The LFU eviction policies rely on two configuration knobs that control how quickly the frequency counter decays and how much probabilistic insertion randomness is applied:

# /etc/redis/redis.conf — eviction tuning
maxmemory 12gb
maxmemory-policy allkeys-lfu

# LFU decay: halve the frequency counter every 10 minutes
# Lower value = faster forgetting = LFU behaves more like LRU
# Range: 1 (fast decay) to 255 (very slow)
lfu-decay-time 10

# LFU log factor: how aggressively to increment the counter per access
# Higher = harder to reach max (255) = more differentiation between keys
# Default 10 is good for most workloads
lfu-log-factor 10

# How many keys Redis samples when looking for eviction candidates
# Higher = better approximation of true LRU/LFU, more CPU per eviction
# 10 is a good balance; 5 is the default
maxmemory-samples 10

Sizing maxmemory: Set it to 75–80% of the instance's physical RAM. Leave headroom for Redis's internal metadata (each key has ~50–90 bytes of overhead regardless of value size), replication buffers, and OS page cache. On a 16 GB instance, set maxmemory 12gb. Monitor used_memory_rss in your metrics stack — if RSS consistently exceeds physical RAM, you are swapping, which destroys latency.

Hot Keys

A hot key is a single key receiving a disproportionate share of traffic — often thousands of times more requests per second than the average key. Because Redis is single-threaded (for command processing), a hot key can saturate CPU on a single shard, creating a queue of waiting commands and causing latency spikes across the entire instance even for keys that are not hot.

Classic hot key scenarios: a viral tweet's like counter, a sale item's inventory count, a feature flag that every API request checks, or a session key for a shared service account.

Redis 7.4+ includes redis-cli --hotkeys which uses the LFU counter to identify candidates without additional overhead. For older instances or higher precision, use the keyspace notification sampling approach:

# Identify hot keys using the built-in LFU scanner (requires LFU policy active)
redis-cli --hotkeys -h 10.0.1.50 -p 6379 -a "$REDIS_PASSWORD"

# On older Redis or for real-time monitoring: sample the command stream
# --bigkeys and --hotkeys use SCAN internally and are safe to run in prod
redis-cli --bigkeys -h 10.0.1.50 -p 6379

# For real-time hot-key detection without LFU: use the MONITOR command
# WARNING: MONITOR drops throughput ~50% — run against a replica, not primary
# Use it briefly (seconds), never leave running
redis-cli --no-auth-warning -h 10.0.1.50 -p 6379 MONITOR | \
  head -n 50000 | \
  awk '{print $4}' | \
  sort | uniq -c | sort -rn | head -20

# Check per-key access count directly (LFU must be active)
redis-cli OBJECT FREQ session:user:12345
# Returns: (integer) 148  — the logarithmic frequency counter

Mitigating hot keys at production scale requires a strategy, not just detection:

Client-side caching (Redis 6+): Use the client-tracking protocol so your application servers maintain a local in-memory copy. The Redis server sends invalidation messages when the key changes. This eliminates the network round-trip entirely for read-heavy hot keys.
Key sharding: For a counter-like hot key, split it into N shards (counter:foo:0 through counter:foo:N-1), increment a random shard on write, and sum all shards on read. Facebook uses this at 10x+ key multiplication for their most contended counters.
Read replicas with client-side routing: Route hot-key reads to replicas. Redis Cluster does not do this automatically — you must implement it at the client or proxy layer (Envoy, Twemproxy, or a custom client pool).
Local in-process cache (L1): Cache the hot key in application memory with a short TTL (1–5 seconds). Stale for a second is acceptable for most read-mostly hot keys and eliminates Redis traffic entirely.

Big Keys

A big key is a key whose value is large enough to cause operational problems: slow commands that block the event loop, large replication payload spikes, and slow RDB serialization. Redis's single-threaded model means a single DEL or LRANGE on a 10 MB value blocks every other client for tens of milliseconds.

The thresholds that matter at production scale: strings > 1 MB, lists/sets/hashes/sorted sets > 5,000 elements (or > 1 MB serialized). Anything beyond these is a candidate for redesign.

# Scan for big keys without blocking (uses SCAN cursor, safe in prod)
redis-cli --bigkeys -h 10.0.1.50 -p 6379

# Get the serialized size of a specific key in bytes
redis-cli MEMORY USAGE mykey:12345
# Returns: (integer) 2097152  (2 MB — that is too big)

# Check element count for collection types
redis-cli LLEN mylist
redis-cli HLEN myhash
redis-cli SCARD myset
redis-cli ZCARD myzset

# Safely delete a big key without blocking: use UNLINK (async, O(1))
# UNLINK reclaims memory in a background thread; DEL blocks
redis-cli UNLINK bigkey:session:all_events

# For big hashes/sets: delete incrementally via HSCAN/SSCAN + HDEL/SREM
redis-cli HSCAN bigkey:user:prefs 0 COUNT 100
# Process 100 elements at a time, then HDEL them in a pipeline

Always use UNLINK instead of DEL for any key larger than ~100 KB. DEL is synchronous and blocks the main thread while freeing memory. UNLINK (Redis 4.0+) does the same logical operation but defers the actual memory reclamation to a background thread. For very large keys (tens of MB), the difference can be 50–200 ms of blocking. In a production environment where your p99 SLO is 10 ms, a single DEL call can blow through your entire error budget.

The four primary operational concerns that affect the Redis single-threaded event loop and the key metrics to monitor for each.

Latency Spike Diagnosis

Redis latency spikes come from a finite set of causes. Knowing the taxonomy lets you go from "Redis is slow" to root cause in minutes, not hours.

1. Slow commands. O(N) commands like KEYS *, SMEMBERS on a large set, SORT, and LRANGE with large ranges block the event loop. The slowlog captures them automatically:

# Redis slowlog — commands that exceeded the threshold (default: 10ms)
# Tune threshold lower in production: 1ms catches the long tail
redis-cli CONFIG SET slowlog-log-slower-than 1000   # microseconds (1ms)
redis-cli CONFIG SET slowlog-max-len 256

# Read the slowlog
redis-cli SLOWLOG GET 20
# Output: ID, timestamp, microseconds, [command, args...]

# Use the latency monitoring framework (Redis 2.8.13+)
redis-cli CONFIG SET latency-monitor-threshold 10   # ms
redis-cli LATENCY LATEST         # latest spike per event class
redis-cli LATENCY HISTORY command  # time-series for a specific event

# Reset latency samples (useful after fixing the root cause)
redis-cli LATENCY RESET

2. Fork latency (RDB/AOF rewrite). When Redis forks for persistence, the OS must copy the page table. On a 20 GB instance with 512 MB page tables, the fork itself can take 200–500 ms and block the main thread. Symptoms: a regular latency spike every save interval or AOF rewrite cycle. Mitigation: use smaller Redis instances (keep datasets under 10 GB per shard), use THP (Transparent Huge Pages) disabled — Redis explicitly recommends echo never > /sys/kernel/mm/transparent_hugepage/enabled — and prefer replicas for RDB saves (bgsave on replica, not primary).

3. Active memory defragmentation. Long-running Redis instances accumulate memory fragmentation. When mem_fragmentation_ratio exceeds 1.5, you are wasting significant RAM. Redis 4.0+ has online defragmentation, but it consumes CPU and can cause brief latency spikes during compaction windows. Monitor the ratio and tune aggressively only when fragmentation is confirmed:

# Check fragmentation ratio (used_memory_rss / used_memory)
redis-cli INFO memory | grep -E "used_memory:|mem_fragmentation_ratio"
# used_memory: 8589934592        (8 GB logical)
# mem_fragmentation_ratio: 1.62  (>1.5 means ~5 GB wasted on RSS)

# Enable active defragmentation (tune CPU budget carefully)
redis-cli CONFIG SET activedefrag yes
redis-cli CONFIG SET active-defrag-ignore-bytes 100mb  # start only when >100 MB fragmented
redis-cli CONFIG SET active-defrag-enabled yes
redis-cli CONFIG SET active-defrag-cpu-pct 25   # max 25% CPU for defrag
redis-cli CONFIG SET active-defrag-threshold-lower 10  # % fragmentation to start
redis-cli CONFIG SET active-defrag-threshold-upper 30  # % to go full speed

4. Network and OS-level causes. Latency that Redis itself cannot explain — check with redis-cli --latency (measures round-trip, not command execution) versus LATENCY HISTORY (measures command execution). If round-trip latency is high but command latency is low, the cause is network (NIC queues, TCP Nagle, high interrupt rate) or OS scheduling (VM steal time, NUMA imbalance). Run redis-cli --intrinsic-latency 30 on the server to establish an OS baseline.

Monitoring Redis in Production

A production Redis monitoring stack needs three layers: real-time metrics (Prometheus), structured alerting (Alertmanager rules on the metrics), and a capacity dashboard (Grafana). The redis_exporter by Oliver006 is the production standard — it exposes every INFO section as Prometheus metrics with no configuration beyond a Redis connection string.

# Deploy redis_exporter as a sidecar or DaemonSet annotation in Kubernetes
# Minimal Prometheus scrape config
- job_name: redis
  static_configs:
    - targets:
        - redis-primary:9121
        - redis-replica-0:9121
        - redis-replica-1:9121
  relabel_configs:
    - source_labels: [__address__]
      regex: "(.*):9121"
      target_label: instance

# Critical alerting rules (PrometheusRule CRD or rules file)
groups:
  - name: redis.production
    rules:
      # Memory pressure: >90% of maxmemory
      - alert: RedisHighMemoryUsage
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Redis memory >90% on {{ $labels.instance }}"

      # Eviction rate spike (evicted_keys/s > 0 is a yellow flag; >100/s is red)
      - alert: RedisHighEvictionRate
        expr: rate(redis_evicted_keys_total[2m]) > 100
        for: 2m
        labels:
          severity: warning

      # Rejected connections (client queue full)
      - alert: RedisRejectedConnections
        expr: increase(redis_rejected_connections_total[5m]) > 0
        labels:
          severity: critical

      # Replication lag >10s on replica
      - alert: RedisReplicationLag
        expr: redis_replication_offset_lag > 10
        for: 1m
        labels:
          severity: warning

Key metrics to track and their healthy ranges at production scale:

instantaneous_ops_per_sec — baseline varies; track the rate-of-change, not the absolute value. A sudden 5x spike is more informative than any threshold.
used_memory_rss — must stay below physical RAM. RSS growth without logical memory growth indicates fragmentation.
evicted_keys rate — zero is the goal for a cache with a correctly sized maxmemory. Sustained eviction under a stable load means the instance is undersized.
blocked_clients — clients waiting on BLPOP/BRPOP/BZPOPMIN. A non-zero value is expected for queue consumers; a growing value indicates producer/consumer imbalance.
rejected_connections — any value is a severity-1 incident. Redis has refused a client connection because the maxclients limit was hit.
rdb_last_bgsave_status / aof_last_rewrite_status — persistence failures are silent by default; without alerting on these, you can lose your persistence safety net unnoticed.

Use the Grafana Dashboard ID 11835 (Redis Exporter Dashboard by Perfilov) as your starting point. It covers all the critical panels out of the box. At companies running Redis at hyperscale (Twitter, Uber, Airbnb), the monitoring stack also includes per-command latency percentiles via the commandstats section of INFO, hot-key dashboards built from LFU data, and automated runbooks triggered by Alertmanager webhooks that page on-call with a pre-populated Jupyter notebook showing the relevant metrics window.

The operational practices covered in this lesson — eviction policy selection, hot-key mitigation, UNLINK for big keys, slowlog analysis, and a Prometheus/Alertmanager monitoring stack — form the baseline competency expected of any team running Redis in a production SLA. Master these before reaching for more complex solutions like Redis Cluster or external proxy layers; most Redis incidents at scale are preventable with these primitives applied correctly.