Project: A Messaging Platform Runbook
Project: A Messaging Platform Runbook
A runbook is not documentation for its own sake — it is the executable knowledge that lets any on-call engineer, at 3 a.m., restore a platform to health without escalating. This lesson synthesizes everything covered in this tutorial into a single, production-grade runbook for a platform where Redis Cluster serves the session and rate-limit tier and Apache Kafka serves the event streaming tier. The patterns apply whether you are running on bare metal, on AWS MSK + ElastiCache, or on self-managed Kubernetes operators.
Platform Architecture Snapshot
Before you can operate a platform, every team member must share a mental model of its topology. The runbook begins with an authoritative architecture diagram, pinned in your internal wiki and reviewed on every significant change.
Section 1: High-Availability Design Decisions
Redis HA contract: Run a minimum of 3 Sentinel nodes (never 2 — a split-brain between two sentinels cannot resolve quorum). For Redis Cluster, use 3 shards with 1 replica each, spread across 3 AZs, with min-replicas-to-write 1 and min-replicas-max-lag 10 on every primary. Set cluster-require-full-coverage no so a lost shard degrades the cluster rather than bringing the whole thing down. Failover time under Sentinel is 30-60 s by default; tune down-after-milliseconds 5000 and failover-timeout 10000 for sub-30-second recovery in latency-sensitive environments, but accept that aggressive values increase false-positive failovers under network blips.
Kafka HA contract: Use a replication factor of 3 (RF=3) for all topics, with min.insync.replicas=2. This tolerates one broker loss with no data loss and no producer error. For the KRaft controller quorum, 3 dedicated controller nodes is the minimum for production — they require a quorum of 2/3 to elect a new active controller, so a simultaneous loss of 2 controllers stalls metadata operations. Keep controllers on separate hosts from brokers.
Section 2: Monitoring Stack and Golden Signals
The monitoring layer for this platform is built on Prometheus and Grafana, with redis_exporter (oliver006) and JMX exporter (or the native Kafka Prometheus reporter) feeding metrics. The following alert rules are non-negotiable for production:
Section 3: Incident Response Playbook
The following runbook steps are ordered for speed. Every section maps to a PagerDuty alert rule. The on-call engineer should be able to copy-paste these commands without looking anything up.
Redis Primary Failure (Sentinel-managed)
- Confirm the alert:
redis-cli -h sentinel1 -p 26379 SENTINEL masters— check theflagsfield;s_downoro_downconfirm sentinel suspicion. - Check quorum:
redis-cli -h sentinel1 -p 26379 SENTINEL ckquorum mymaster— must returnOK. - If sentinel is healthy and failover has not yet started, trigger manually:
redis-cli -h sentinel1 -p 26379 SENTINEL failover mymaster. - Confirm the new primary:
redis-cli -h sentinel1 -p 26379 SENTINEL get-master-addr-by-name mymaster. - Verify application reconnection — most Redis client libraries (ioredis, jedis, lettuce) handle sentinel failover transparently within 1-3 s after the sentinel propagates the new master address.
- File a post-incident ticket: root cause (OOM, kernel OOM-killer, hardware), replica reattachment status, data loss window (compare
master_repl_offsetfromINFO replicationbefore and after).
Kafka Broker Loss and Partition Re-election
Consumer Group Lag Spike (not caused by broker loss)
- Isolate the root cause: slow consumer processing, producer throughput spike, or a poison-pill message causing repeated deserialization errors.
- Check consumer logs for
DeserializationExceptionor repeated rebalances. Repeated rebalances indicate consumers exceedingmax.poll.interval.ms(default 5 min) — increase it or reduce batch size. - For a poison-pill: identify the offset with
kafka-get-offsets.sh, then use a one-off consumer withauto.offset.reset=noneand aseekcall to skip past it. Never delete the message — route it to a dead-letter topic. - Scale consumers horizontally up to the number of partitions. Beyond that count, additional consumer instances idle and waste resources.
Section 4: Capacity and Scaling Decision Tree
The runbook includes a scaling decision tree that on-call engineers follow before spinning up new infrastructure. This prevents both under-provisioning (platform degradation) and over-provisioning (wasted spend).
Redis scaling triggers:
- Memory above 75% for 30 min: First audit key TTLs and eviction policy. If growth is legitimate, add a shard:
redis-cli --cluster add-node new_host:6379 existing_host:6379followed by--cluster rebalance. - CPU above 70% on the primary: Redis is single-threaded per data plane. Scale horizontally (more shards) not vertically. Identify hot keys first with
redis-cli --hotkeysorredis-cli OBJECT FREQ key(requiresmaxmemory-policy allkeys-lfu). - Read latency above 5 ms p99: Route reads to replicas. Ensure the client is configured with
READONLYon replica connections.
Kafka scaling triggers:
- Broker disk above 70%: Reduce
log.retention.hoursor increase broker count. Reassign partitions withkafka-reassign-partitions.shto spread data across the fleet. - Network throughput above 80% of NIC capacity: Add brokers and rebalance partition leaders. For AWS MSK, resize the broker instance type.
- Producer latency above 50 ms: Check
request.required.acks. Ifacks=all, check ISR size — if ISR has shrunk to 1, one slow replica is blocking all acks. Identify the lagging replica withkafka-topics.sh --under-min-isr-partitions.
Section 5: Runbook Maintenance and Drill Schedule
A runbook that is never tested is a liability, not an asset. Integrate the following into your team engineering calendar:
- Monthly chaos drill: Kill one Redis replica in staging, verify Sentinel promotes correctly, verify application RTO. Kill one Kafka broker, verify lag converges within SLO. Use the Chaos Engineering principles from the earlier tutorial in this course — document the steady-state hypothesis before each drill.
- Quarterly failover test: Perform a full Redis primary failover in production during a low-traffic window. This surfaces configuration drift (firewall rules, client library versions, DNS TTL settings) that only appear under real conditions.
- Runbook review on every incident: After each post-mortem, update the affected runbook section within 48 hours. The on-call engineer who ran the incident owns this task — they are the subject-matter expert until the next rotation.
- Alert fatigue audit: Quarterly, review firing frequency of every alert. An alert that fires more than twice per week without a P1 is either miscalibrated or pointing at a systemic problem that should be fixed, not silenced.
The measure of a mature messaging platform operation is not that incidents never happen — it is that every incident is shorter, better-contained, and better-documented than the last. This runbook is the operational contract between your platform and every engineer who depends on it.