Caching & Messaging Infrastructure

Kafka Producers & Consumers in Production

18 min Lesson 5 of 30

Kafka Producers & Consumers in Production

Running Kafka at scale demands precise control over how producers write and how consumers read. The four pillars of this lesson — acknowledgment semantics, idempotent production, consumer groups, and lag monitoring — govern every reliability and throughput decision on the critical path. This is where a cluster that "works in staging" either earns or loses production trust.

Producer Acknowledgments (acks)

The acks setting controls how many broker replicas must confirm a write before the client considers it successful. Three values exist in practice:

acks=0 — fire-and-forget. The producer sends and moves on with zero durability guarantee. Only appropriate for high-volume, loss-tolerant telemetry (click-streams, ping metrics).
acks=1 — the partition leader acknowledges. Faster than all, but if the leader fails before followers replicate, the message is silently lost. Common default in legacy code; avoid it for anything financial or audit-relevant.
acks=all (or acks=-1) — the leader waits until all in-sync replicas (ISR) acknowledge. Combined with min.insync.replicas=2 on the broker, this is the floor for production-grade durability. It adds roughly 5–15 ms of tail latency versus acks=1, a cost nearly always worth paying.

min.insync.replicas trap: Setting acks=all does nothing protective if min.insync.replicas=1 (broker default). Always set min.insync.replicas=2 (or 3 for RF=3 topics carrying SLO-bound data). A mismatch is one of the most common silent-data-loss configurations found in production Kafka clusters.

# Producer config — durability-first (Java client properties or librdkafka equivalents)
acks=all
min.insync.replicas=2           # set on the topic or broker, not the client
retries=2147483647              # retry indefinitely; rely on delivery.timeout.ms
delivery.timeout.ms=120000      # 2-minute hard cap on total retry window
enable.idempotence=true         # pairs with acks=all automatically
linger.ms=5                     # small batching window for throughput
batch.size=65536                # 64 KB batch; tune with producer metrics
compression.type=snappy         # CPU-cheap, ~3x compression for JSON payloads

Idempotent Producers

Network retries without idempotence create duplicates. When enable.idempotence=true, the broker assigns each producer a PID (Producer ID) and tracks a monotonically increasing sequence number per partition. If the broker receives a retried batch it already committed, it silently deduplicates it — exactly-once semantics within a single producer session, without any transaction overhead.

Idempotence requires acks=all, max.in.flight.requests.per.connection <= 5, and retries > 0. The Kafka client enforces these automatically when you set enable.idempotence=true; do not override them individually or the broker will reject the producer registration.

Idempotence vs. Transactions: Idempotent producers deduplicate writes to a single topic-partition from a single producer instance. If you need atomic writes across multiple partitions or topics (e.g., read-process-write pipelines), use Kafka Transactions (transactional.id + initTransactions()). Transactions add measurable latency — reserve them for payment, inventory, and ledger flows.

Consumer Groups and Partition Assignment

A consumer group is the unit of parallel consumption in Kafka. Every consumer in the group subscribes to the same topic(s); the group coordinator (a broker) assigns each partition to exactly one consumer. This gives horizontal scale-out: a 60-partition topic can be drained by up to 60 consumers in the same group simultaneously. Consumers beyond the partition count sit idle — a common over-provisioning mistake.

Six partitions distributed evenly across three consumers in a group; each consumer commits offsets independently.

Rebalancing: Causes, Cost, and Mitigation

A rebalance is triggered whenever group membership changes: a consumer joins, crashes, or fails to heartbeat within session.timeout.ms. During an eager rebalance (the pre-2.4 default), all consumers drop all partitions and reassign from scratch — a stop-the-world pause that halts consumption cluster-wide for seconds to tens of seconds on large groups.

Production mitigations, in order of impact:

Cooperative Sticky Rebalancing — set partition.assignment.strategy=CooperativeStickyAssignor. Only partitions that need to move are revoked; the rest continue processing. Available since Kafka 2.4 and the only acceptable default for new services.
Static Group Membership — assign a stable group.instance.id (e.g. the pod name). The broker holds the consumer's partition assignment for session.timeout.ms after a disconnect before triggering a rebalance. Pod restarts in rolling deployments no longer cause a full group reshuffle.
Tune heartbeat/session timeouts — default session.timeout.ms=45000 is too long for fast failure detection. A common production tuning is session.timeout.ms=15000, heartbeat.interval.ms=4000, max.poll.interval.ms=300000. Raise max.poll.interval.ms if processing a single batch legitimately takes minutes (ML inference, slow DB writes).

# Consumer config — production baseline
group.id=order-processors
group.instance.id=${HOSTNAME}            # static membership; set per pod
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
session.timeout.ms=15000
heartbeat.interval.ms=4000
max.poll.interval.ms=300000
max.poll.records=500                     # tune with processing latency
auto.offset.reset=earliest               # replay from start on new group; use 'latest' only for throw-away consumers
enable.auto.commit=false                 # explicit commits only in production
isolation.level=read_committed           # required when upstream uses transactions

Offset Commits: Manual vs Auto

enable.auto.commit=true commits offsets every auto.commit.interval.ms (default 5000 ms) regardless of whether your application has successfully processed those messages. A crash between the auto-commit and the completion of processing means those messages are silently skipped on restart. This is the primary cause of data loss in Kafka consumer applications.

With manual commits, use commit-after-process: call commitSync() or commitAsync() only after your downstream write (DB upsert, downstream API call, etc.) has succeeded. For exactly-once semantics end-to-end, the only reliable option is Kafka Transactions or idempotent downstream writes combined with manual commits.

At-least-once with idempotent sinks: Most production systems use manual commitAsync with a synchronous fallback on failure, paired with idempotent downstream writes (upsert by event ID). This is simpler than Kafka Transactions and handles the 99.99% case of single-message duplicates gracefully.

Consumer Lag: The Most Important Operational Metric

Consumer lag is the difference between the latest offset in a partition (log-end offset) and the last committed offset for a consumer group. It is measured in number of messages, not time — a lag of 50,000 at 10,000 messages/second means 5 seconds of backlog; the same lag at 100 messages/second means 8 minutes.

Key lag signals and their thresholds (tune per SLO):

Growing lag — consumption rate < production rate. Scale consumers (up to partition count) or increase max.poll.records.
Flat non-zero lag — consumer is keeping up with the live stream but never draining the backlog. Usually caused by a burst that never fully recovered, or a consumer restart that fell behind.
Sudden lag spike — rebalance in progress, consumer crash, or a slow downstream dependency. Alert immediately if above 60-second equivalent.

# Monitor lag from the CLI (Kafka 3.x, KRaft or ZK mode)
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group order-processors \
  --describe

# Output columns: TOPIC | PARTITION | CURRENT-OFFSET | LOG-END-OFFSET | LAG | CONSUMER-ID

# Reset offsets to earliest (drain backlog from the beginning — use with caution)
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group order-processors \
  --topic orders \
  --reset-offsets --to-earliest --execute

# Prometheus JMX metric to alert on
# kafka.consumer:type=consumer-fetch-manager-metrics,
#   attribute=records-lag-max
# Alert: records-lag-max > 50000 for > 2 minutes

At large scale, per-partition lag is exposed via the JMX exporter and visualized in Grafana (see the observability tutorial). Set two alert thresholds: a warning at your 60-second equivalent lag, and a critical at your SLO breach point. Wire the critical alert directly to your on-call rotation — a consumer group that stops committing offsets in a payment pipeline is a P1 incident.

Lag in time vs. messages: Kafka does not natively expose lag in seconds. Tools like Burrow (LinkedIn) or the Confluent Control Center compute lag velocity (are consumers catching up or falling further behind?) which is more operationally actionable than a raw message count. At companies running >1 TB/day, automated lag-based autoscaling (KEDA Kafka scaler on Kubernetes) is standard — the consumer Deployment scales out when lag exceeds a threshold and back in when drained.

Putting It Together: Production Checklist

Set acks=all + min.insync.replicas=2 on all SLO-bound topics.
Enable enable.idempotence=true on all producers by default; add transactions only where cross-partition atomicity is required.
Use CooperativeStickyAssignor and static group membership for all long-lived consumer services.
Disable auto-commit; commit offsets only after successful downstream processing.
Alert on consumer lag velocity (not just absolute count) and wire critical alerts to on-call.
Test rebalance behavior in staging: kill a consumer pod mid-batch and verify no messages are lost or permanently skipped.