Asynchronous Processing & Messaging

Backpressure & Dead-Letter Queues

18 min Lesson 9 of 10

Backpressure & Dead-Letter Queues

Even the best-designed async pipeline eventually faces two hard operational realities: consumers that cannot keep up with the rate of incoming messages, and messages that can never be processed successfully no matter how many times they are retried. Backpressure is the set of techniques that handle the first problem; dead-letter queues (DLQs) are the safety net for the second. Together they turn a fragile pipeline into a resilient, observable one.

What Is Backpressure?

Backpressure is a signal that travels upstream — from a slow consumer toward the producer — to slow or stop the flow of new messages until the consumer can catch up. The name comes from fluid dynamics: water flowing through a pipe generates back-pressure when it meets resistance downstream.

In messaging systems the problem manifests differently depending on whether you use a push model (the broker delivers messages to the consumer) or a pull model (the consumer fetches its own batch):

Push model (e.g., RabbitMQ with automatic delivery): the broker keeps pushing regardless of how busy the consumer is. Without backpressure, the consumer's in-memory buffer fills up, memory spikes, and the process crashes or starts dropping messages silently.
Pull model (e.g., Kafka, SQS): the consumer asks for the next batch only when it is ready. Backpressure is implicit — the consumer simply does not issue the next poll() until it has finished processing the current batch. This is one reason Kafka is inherently more backpressure-friendly than push brokers.

Symptoms of Missing Backpressure

When a system lacks proper backpressure mechanisms, failure is progressive and often catastrophic:

Consumer lag grows. A Kafka consumer group that processes 1,000 messages/second but receives 2,000/second accumulates a lag of ~86 million messages per day.
Memory pressure builds. Each unprocessed message occupies memory in the consumer process or in the broker's in-flight buffer.
The consumer crashes. A restart does not help — the backlog is still there, and the restarted consumer faces the same overload immediately.
Retry storms worsen the overload. If messages have a short visibility timeout (SQS) or are negatively acknowledged (RabbitMQ), they re-enter the queue and compete with new arrivals.

The thundering herd: A common anti-pattern is deploying many consumer replicas hoping to "just scale out" under load. Without a coordinated backpressure strategy, every replica races to claim messages, each one also under memory pressure. You have replaced one crashed consumer with 20 crashed consumers.

Backpressure Strategies

There is no single backpressure implementation — the right approach depends on your broker and SLAs:

Prefetch / QoS limits: In RabbitMQ, set basic.qos(prefetch_count=N) so the broker delivers at most N unacknowledged messages to a single consumer channel at once. The broker blocks further delivery until acknowledgements arrive. A prefetch of 10–50 is a common starting point for CPU-bound tasks; 100–500 for I/O-bound ones.
Batch size control (pull model): Kafka consumers call poll(max.poll.records=500). Reducing that number reduces per-poll work and gives the consumer time to breathe. Pair it with max.poll.interval.ms so the broker does not consider the consumer dead during long-running processing.
Token-bucket / rate limiter in the producer: If you control the producer, instrument it to check the queue depth (via broker metrics or a counter in Redis) and throttle itself when depth exceeds a threshold. Useful for batch ingestion jobs.
Reactive Streams / async back-channel: In code, frameworks like Project Reactor, RxJava, or Node.js streams propagate demand signals. A downstream operator that can process 100 items/second advertises that demand; the upstream source only produces 100 items/second.
Horizontal auto-scaling with lag alerts: When Kafka consumer lag exceeds a threshold (e.g., > 100,000 messages), trigger an auto-scaling event to add more consumer pods. AWS SQS + Lambda does this automatically; Kubernetes KEDA provides similar behaviour for any queue.

Backpressure: the slow consumer signals the broker/producer to throttle, preventing queue overflow and consumer crash.

What Is a Dead-Letter Queue?

A dead-letter queue (DLQ) is a dedicated queue that receives messages that could not be delivered or processed successfully after a configured number of attempts. Instead of losing the message or blocking the main queue forever, the broker (or consumer logic) routes it to the DLQ so it can be inspected, fixed, and replayed later.

Messages end up in a DLQ for several common reasons:

Max delivery exceeded: SQS moves a message to the DLQ after maxReceiveCount (e.g., 5) failed processing attempts. RabbitMQ does the same when x-death count surpasses the policy limit.
Message TTL expired: A message sat unprocessed in the queue longer than its time-to-live. This often signals a consumer outage longer than anticipated.
Deserialization / schema error: The consumer cannot parse the message body because the schema changed in a breaking way. A classic poison-pill scenario.
Business logic exception: The consumer code throws an unhandled exception (e.g., a referenced database record no longer exists).

DLQ vs. retry queue: A retry queue is a staging area with a delay (using SQS visibility timeout, RabbitMQ TTL + dead-letter exchange, or Kafka consumer offset rewinding) before the message is re-attempted. A DLQ is the final destination after all retries are exhausted. You typically chain them: main queue → retry queue (with exponential back-off) → DLQ.

Designing a DLQ Pipeline

A production DLQ setup involves more than just "a separate queue". You need:

Alerting: Any non-zero DLQ depth should page your on-call engineer. A message in the DLQ means real data was not processed — treat it like an error in your service, not a background curiosity.
Message metadata: Store the failure reason, the original queue, the timestamp of each attempt, and the consumer that failed. AWS SQS automatically includes this in the message attributes. For custom brokers, wrap the original payload in an envelope with these fields before routing to the DLQ.
Requeue tooling: Write or adopt tooling to move messages from the DLQ back to the main queue after you have fixed the root cause. Without this, your DLQ is a black hole.
DLQ retention: Set a retention period long enough for your incident response cycle (7–14 days is common). Beyond that, archive to S3 / object storage before deletion so you have an audit trail.

Full retry-to-DLQ pipeline: failed messages get exponential back-off retries before landing in the DLQ for human inspection and replay.

Poison Messages and How to Handle Them

A poison message is one that consistently causes consumer failures regardless of retry count — the message itself is broken. Without a DLQ, a poison message will continuously cycle through the queue, consuming retries, blocking healthy messages (especially in FIFO queues), and crashing or hanging consumers.

Concrete examples of poison messages:

A JSON payload where a required field is null and your consumer code calls .toString() on it without null-checking.
A binary Protobuf message encoded with a newer schema version the consumer does not understand.
An event referencing an entity (e.g., order ID 9999) that was hard-deleted from the database between publish and consume.

The DLQ quarantines these problematic messages so the main queue can continue processing healthy messages while you investigate. Once the bug is fixed — either in the consumer code or by correcting the data — you replay the DLQ messages through the fixed consumer.

Log before you DLQ: Before routing a message to the DLQ, log the full payload, the exception stack trace, and the consumer version. Without this, debugging the root cause days later is nearly impossible. Store structured logs in a tool like Datadog, Elasticsearch, or CloudWatch Logs and correlate them by a message ID field embedded in every event payload.

SQS Redrive Policy — A Concrete Example

AWS SQS makes DLQ configuration explicit via a redrive policy:

// Terraform — SQS queue with DLQ and redrive policy
resource "aws_sqs_queue" "orders_queue" {
  name                       = "orders-process"
  visibility_timeout_seconds = 30
  message_retention_seconds  = 86400  // 1 day

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5  // move to DLQ after 5 failed attempts
  })
}

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-process-dlq"
  message_retention_seconds = 1209600  // 14 days
}

When a consumer receives a message and does not delete it within the visibility timeout, SQS makes it visible again for another consumer. After maxReceiveCount receives with no successful delete, SQS automatically moves it to the DLQ. The ApproximateNumberOfMessagesNotVisible + ApproximateNumberOfMessages CloudWatch metrics on the DLQ are the alert triggers you set up in step 1.

Kafka and Dead-Letter Topics

Kafka does not have a built-in DLQ concept because the consumer controls its own offset. Failed messages require an explicit pattern from the consumer application:

Consumer catches a processing exception.
Consumer produces the failed record to a dedicated orders.process.DLT topic (dead-letter topic), enriched with failure metadata headers.
Consumer advances its offset past the failed record so the main consumer group is not blocked.
A separate DLQ consumer monitors the DLT topic and either alerts, archives, or replays records.

Spring Kafka's DeadLetterPublishingRecoverer and the Confluent Schema Registry together automate steps 1–3 with a few lines of configuration.

Ordering trade-off: In a FIFO SQS queue or a Kafka topic with a single partition, a poison message blocks all later messages with the same key. Moving it to a DLQ unblocks the stream but means later messages are processed before the failed one is replayed — you must design your consumer to be idempotent and order-tolerant if you enable DLQ-based unblocking.

Backpressure + DLQ Together

In a well-designed pipeline, backpressure and DLQs complement each other:

Backpressure handles temporary overload — it slows the system down gracefully and buys time for the consumer to catch up.
DLQs handle permanent failures — they quarantine broken messages so they do not consume system resources indefinitely.
Without backpressure, your consumer floods with messages, crashes, and the crash turns healthy messages into DLQ entries because they were never acked.
Without DLQs, poison messages cycle forever, consuming retries, occupying prefetch slots, and effectively applying unwanted backpressure on legitimate messages.

A practical rule of thumb: set your DLQ alert threshold to zero and your prefetch count to a value your consumer can comfortably process in under one visibility timeout period. Those two constraints alone eliminate the most common production queue disasters.