Kafka Producers & Consumers in Production
Kafka Producers & Consumers in Production
Running Kafka at scale demands precise control over how producers write and how consumers read. The four pillars of this lesson — acknowledgment semantics, idempotent production, consumer groups, and lag monitoring — govern every reliability and throughput decision on the critical path. This is where a cluster that "works in staging" either earns or loses production trust.
Producer Acknowledgments (acks)
The acks setting controls how many broker replicas must confirm a write before the client considers it successful. Three values exist in practice:
acks=0— fire-and-forget. The producer sends and moves on with zero durability guarantee. Only appropriate for high-volume, loss-tolerant telemetry (click-streams, ping metrics).acks=1— the partition leader acknowledges. Faster thanall, but if the leader fails before followers replicate, the message is silently lost. Common default in legacy code; avoid it for anything financial or audit-relevant.acks=all(oracks=-1) — the leader waits until all in-sync replicas (ISR) acknowledge. Combined withmin.insync.replicas=2on the broker, this is the floor for production-grade durability. It adds roughly 5–15 ms of tail latency versusacks=1, a cost nearly always worth paying.
acks=all does nothing protective if min.insync.replicas=1 (broker default). Always set min.insync.replicas=2 (or 3 for RF=3 topics carrying SLO-bound data). A mismatch is one of the most common silent-data-loss configurations found in production Kafka clusters.
Idempotent Producers
Network retries without idempotence create duplicates. When enable.idempotence=true, the broker assigns each producer a PID (Producer ID) and tracks a monotonically increasing sequence number per partition. If the broker receives a retried batch it already committed, it silently deduplicates it — exactly-once semantics within a single producer session, without any transaction overhead.
Idempotence requires acks=all, max.in.flight.requests.per.connection <= 5, and retries > 0. The Kafka client enforces these automatically when you set enable.idempotence=true; do not override them individually or the broker will reject the producer registration.
transactional.id + initTransactions()). Transactions add measurable latency — reserve them for payment, inventory, and ledger flows.
Consumer Groups and Partition Assignment
A consumer group is the unit of parallel consumption in Kafka. Every consumer in the group subscribes to the same topic(s); the group coordinator (a broker) assigns each partition to exactly one consumer. This gives horizontal scale-out: a 60-partition topic can be drained by up to 60 consumers in the same group simultaneously. Consumers beyond the partition count sit idle — a common over-provisioning mistake.
Rebalancing: Causes, Cost, and Mitigation
A rebalance is triggered whenever group membership changes: a consumer joins, crashes, or fails to heartbeat within session.timeout.ms. During an eager rebalance (the pre-2.4 default), all consumers drop all partitions and reassign from scratch — a stop-the-world pause that halts consumption cluster-wide for seconds to tens of seconds on large groups.
Production mitigations, in order of impact:
- Cooperative Sticky Rebalancing — set
partition.assignment.strategy=CooperativeStickyAssignor. Only partitions that need to move are revoked; the rest continue processing. Available since Kafka 2.4 and the only acceptable default for new services. - Static Group Membership — assign a stable
group.instance.id(e.g. the pod name). The broker holds the consumer's partition assignment forsession.timeout.msafter a disconnect before triggering a rebalance. Pod restarts in rolling deployments no longer cause a full group reshuffle. - Tune heartbeat/session timeouts — default
session.timeout.ms=45000is too long for fast failure detection. A common production tuning issession.timeout.ms=15000,heartbeat.interval.ms=4000,max.poll.interval.ms=300000. Raisemax.poll.interval.msif processing a single batch legitimately takes minutes (ML inference, slow DB writes).
Offset Commits: Manual vs Auto
enable.auto.commit=true commits offsets every auto.commit.interval.ms (default 5000 ms) regardless of whether your application has successfully processed those messages. A crash between the auto-commit and the completion of processing means those messages are silently skipped on restart. This is the primary cause of data loss in Kafka consumer applications.
With manual commits, use commit-after-process: call commitSync() or commitAsync() only after your downstream write (DB upsert, downstream API call, etc.) has succeeded. For exactly-once semantics end-to-end, the only reliable option is Kafka Transactions or idempotent downstream writes combined with manual commits.
commitAsync with a synchronous fallback on failure, paired with idempotent downstream writes (upsert by event ID). This is simpler than Kafka Transactions and handles the 99.99% case of single-message duplicates gracefully.
Consumer Lag: The Most Important Operational Metric
Consumer lag is the difference between the latest offset in a partition (log-end offset) and the last committed offset for a consumer group. It is measured in number of messages, not time — a lag of 50,000 at 10,000 messages/second means 5 seconds of backlog; the same lag at 100 messages/second means 8 minutes.
Key lag signals and their thresholds (tune per SLO):
- Growing lag — consumption rate < production rate. Scale consumers (up to partition count) or increase
max.poll.records. - Flat non-zero lag — consumer is keeping up with the live stream but never draining the backlog. Usually caused by a burst that never fully recovered, or a consumer restart that fell behind.
- Sudden lag spike — rebalance in progress, consumer crash, or a slow downstream dependency. Alert immediately if above 60-second equivalent.
At large scale, per-partition lag is exposed via the JMX exporter and visualized in Grafana (see the observability tutorial). Set two alert thresholds: a warning at your 60-second equivalent lag, and a critical at your SLO breach point. Wire the critical alert directly to your on-call rotation — a consumer group that stops committing offsets in a payment pipeline is a P1 incident.
Putting It Together: Production Checklist
- Set
acks=all+min.insync.replicas=2on all SLO-bound topics. - Enable
enable.idempotence=trueon all producers by default; add transactions only where cross-partition atomicity is required. - Use
CooperativeStickyAssignorand static group membership for all long-lived consumer services. - Disable auto-commit; commit offsets only after successful downstream processing.
- Alert on consumer lag velocity (not just absolute count) and wire critical alerts to on-call.
- Test rebalance behavior in staging: kill a consumer pod mid-batch and verify no messages are lost or permanently skipped.