Serverless & Event-Driven Operations

Event-Driven Architecture Operations

18 min Lesson 5 of 28

Event-Driven Architecture Operations

Event-driven architecture (EDA) shifts the reliability contract from synchronous request-response to asynchronous message delivery. In a serverless context this means your Lambda function is no longer the server — the event broker is. That distinction has profound operational consequences. A synchronous HTTP call fails immediately and visibly; a failed event can sit silently in a queue, retry invisibly, duplicate, or arrive out of order. Understanding how to operate these systems with production discipline — dead-letter queues, retry policies, idempotency contracts, and ordering guarantees — is the difference between an EDA that scales elegantly and one that corrupts data under load.

Dead-Letter Queues: Your Operational Safety Net

A dead-letter queue (DLQ) is the destination for events that exhausted their retry budget without successful processing. It is not an error log — it is a recoverable queue of unprocessed work. At Amazon scale, DLQs are non-negotiable: every async Lambda event source (SQS, SNS, EventBridge, Kinesis, DynamoDB Streams) must have a configured DLQ, and the DLQ itself must be monitored. An alarm-free DLQ that silently fills means orders are not fulfilling, payments are not posting, and inventory counts are drifting.

Configure a DLQ on an SQS event source mapping and wire an alarm in one Terraform block:

# event_source_mapping.tf — SQS trigger with DLQ and alarm

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-processor-dlq"
  message_retention_seconds = 1209600   # 14 days — enough time for on-call to triage
  kms_master_key_id         = aws_kms_key.sqs.arn
}

resource "aws_sqs_queue" "orders" {
  name                       = "orders-processor"
  visibility_timeout_seconds = 300       # must be >= Lambda timeout
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 3              # move to DLQ after 3 failed receives
  })
}

resource "aws_lambda_event_source_mapping" "orders" {
  event_source_arn                   = aws_sqs_queue.orders.arn
  function_name                      = aws_lambda_function.order_processor.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5
  function_response_types            = ["ReportBatchItemFailures"]
}

# Alert when anything lands in the DLQ
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "orders-dlq-has-messages"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  dimensions          = { QueueName = aws_sqs_queue.orders_dlq.name }
  comparison_operator = "GreaterThanThreshold"
  threshold           = 0
  evaluation_periods  = 1
  period              = 60
  statistic           = "Sum"
  alarm_actions       = [aws_sns_topic.pagerduty.arn]
}

DLQs are write-once; replaying them requires tooling. The AWS console "redrive" button on SQS (available since 2021) re-sends DLQ messages back to the source queue with controllable concurrency. For Kinesis and DynamoDB Streams, there is no native redrive — you must write a replay Lambda that reads the DLQ and re-publishes events. Build and test this tooling before you need it at 2 AM.

The ReportBatchItemFailures response type is critical at production scale. Without it, if a batch of 10 SQS messages is processed and message 7 fails, Lambda reports the entire batch as failed and all 10 messages become visible again. With partial batch failure reporting, only message 7 is returned to the queue:

# Python Lambda handler with partial batch failure reporting

import json

def handler(event, context):
    failures = []

    for record in event["Records"]:
        try:
            body = json.loads(record["body"])
            process_order(body)
        except Exception as exc:
            # Report only the failed item; the rest are deleted
            failures.append({"itemIdentifier": record["messageId"]})
            print(f"Failed to process {record['messageId']}: {exc}")

    return {"batchItemFailures": failures}

Misconfigured visibility timeouts are the most common DLQ-filling bug. If your Lambda timeout is 60 seconds but the queue visibility timeout is 30 seconds, the message becomes visible again while Lambda is still processing it — a second consumer picks it up, both fail, and the receive count doubles per invocation. The SQS visibility timeout must be at least 6x your Lambda timeout to account for retries, cold starts, and Lambda extension overhead.

Retry Policies: Designing for Inevitable Failure

Every async event invocation in AWS has a configurable retry policy. The defaults are designed for correctness, not cost — understanding them is essential before they fire in production.

Async Lambda invocations (SNS, S3, EventBridge): AWS retries twice with exponential backoff before sending to the Lambda destination or DLQ. Total retry window is up to 6 hours. Use aws lambda put-function-event-invoke-config to tune this.
SQS: controlled by the queue's maxReceiveCount (redrive policy). Each failed receive increments the count. A message stuck in processing for longer than the visibility timeout increments the count without the Lambda reporting failure — a silent retry burn.
Kinesis / DynamoDB Streams: retries until success or data expiry (default 24 hours for Kinesis, 24 hours for DynamoDB Streams). A poison-pill record that always fails will block the shard for the full retention period. Configure bisectBatchOnFunctionError and destinationConfig.onFailure to isolate and route failures.
EventBridge Pipes: configurable retry attempts (0–185) and maximum age (60 seconds–24 hours). Enrichment failures do not retry; filter mismatches are dropped silently.

Tune the async Lambda event invoke config for any function triggered by SNS or S3:

aws lambda put-function-event-invoke-config \
  --function-name order-processor \
  --maximum-retry-attempts 2 \
  --maximum-event-age-in-seconds 3600 \
  --destination-config \
    '{"OnFailure":{"Destination":"arn:aws:sqs:us-east-1:123456789:order-processor-dlq"},
      "OnSuccess":{"Destination":"arn:aws:sqs:us-east-1:123456789:order-audit"}}'

Exponential backoff with jitter is the standard for retry intervals. AWS applies it automatically for async invocations, but if you are building a custom retry loop (e.g. re-publishing from a DLQ), use random.uniform(0, min(cap, base * 2 ** attempt)). Without jitter, a thundering herd of retries after a downstream outage re-saturates the recovering service simultaneously. This is a well-documented failure mode at Netflix, Amazon, and Google.

Idempotency: The Contract That Makes Retries Safe

Retries are only safe if your handlers are idempotent: processing the same event twice produces the same side effects as processing it once. At production scale this is not optional — AWS itself documents "at-least-once delivery" for every async event source. The question is not if your handler receives a duplicate; it is when.

There are three standard idempotency patterns used in production EDA systems:

Idempotency key in the datastore: write the event ID to a DynamoDB table with a conditional put. If the item already exists, skip processing. This is the most reliable pattern and survives Lambda restarts.
AWS Lambda Powertools idempotency: a DynamoDB-backed decorator that handles the conditional write, in-progress locking, and TTL expiry with three lines of code.
Idempotent operations at the target: upserts (INSERT ... ON CONFLICT DO UPDATE in PostgreSQL; UpdateItem with a conditional expression in DynamoDB) are naturally idempotent for the record update itself, though side effects (sending an email, publishing a downstream event) still need the key pattern.

# Lambda Powertools idempotency — DynamoDB-backed (Python)
# pip install aws-lambda-powertools

from aws_lambda_powertools.utilities.idempotency import (
    idempotent_function,
    IdempotencyConfig,
    DynamoDBPersistenceLayer,
)

persistence_store = DynamoDBPersistenceLayer(table_name="idempotency-store")
config = IdempotencyConfig(
    event_key_jmespath="body",        # use the SQS message body as the key
    expires_after_seconds=3600,       # deduplicate within 1 hour
    raise_on_no_idempotency_key=True, # fail fast if key field is missing
)

@idempotent_function(data_keyword_argument="event",
                     config=config,
                     persistence_store=persistence_store)
def handler(event, context):
    order = json.loads(event["Records"][0]["body"])
    charge_card(order["payment_token"], order["amount_cents"])
    fulfill_inventory(order["sku"], order["quantity"])
    return {"status": "fulfilled", "order_id": order["id"]}

The DynamoDB table for the idempotency store needs a TTL attribute and a hash key of id (the SHA-256 of the event key). Provision it with on-demand capacity — bursts of retries cause bursty writes, and provisioned capacity here will throttle and defeat the purpose.

Ordering: When Sequence Matters

EDA systems have a spectrum of ordering guarantees. Choosing the wrong event source for a use case that requires ordering is a design defect that appears only under load.

Ordering and delivery guarantees vary widely across AWS event sources. Choose the source based on whether your workload requires strict ordering, and design your handler for the delivery semantics (at-least-once always requires idempotency).

For SQS FIFO, the message group ID is the ordering unit — all messages with the same group ID are processed in strict FIFO order by a single Lambda concurrency slot. This means a slow or failing message in one group does not block other groups, but concurrency within a group is always 1. For an order management system this is exactly right: order ID as the group ID ensures all state transitions for a given order (placed → paid → fulfilled → shipped) are processed sequentially without blocking unrelated orders.

# Publishing to SQS FIFO with message group ID (Python / boto3)
import boto3, uuid, json

sqs = boto3.client("sqs")

def publish_order_event(order_id: str, event_type: str, payload: dict):
    sqs.send_message(
        QueueUrl="https://sqs.us-east-1.amazonaws.com/123456789/orders.fifo",
        MessageGroupId=order_id,           # strict ordering per order
        MessageDeduplicationId=str(uuid.uuid4()),  # idempotency key (content-based dedup is alternative)
        MessageBody=json.dumps({
            "event_type": event_type,
            "order_id": order_id,
            "payload": payload,
            "schema_version": "1.2",
        }),
    )

Kinesis partition key design determines throughput and ordering simultaneously. A partition key with low cardinality (e.g. a country code) routes too many events to too few shards, creating hot shards at 1 MB/s and 1,000 records/s. A partition key with high cardinality (e.g. a UUID per event) distributes load evenly but destroys ordering for the same entity. The correct pattern for CDC and event-sourcing workloads: use the entity ID (user ID, order ID, account ID) as the partition key — it distributes reasonably and preserves per-entity ordering within a shard.

Production Failure Modes in EDA Operations

Three failure patterns appear at scale that do not surface in staging environments:

The poison-pill event on Kinesis: a malformed record that always throws an exception will block a shard indefinitely until the retention period expires (24 hours–365 days). Configure bisectBatchOnFunctionError: true and a destinationConfig.onFailure (SQS DLQ) on the Kinesis event source mapping. The bisect option cuts the failing batch in half recursively until the single poison record is isolated and routed to the DLQ — shard processing resumes for the rest of the stream.
Clock-skew ordering failures: events published from multiple producers with wall-clock timestamps are not reliably ordered by timestamp in SQS Standard or EventBridge because clock skew between producers can be 100 ms or more and SQS does not re-sort. Use a monotonic sequence number from the source database (e.g. Postgres xmin, DynamoDB stream sequence number) not a client-generated timestamp for events where order matters.
Idempotency table becoming a hot partition: if all events for a high-traffic entity map to the same DynamoDB partition key (e.g. a global idempotency table with a flat hash key), you will hit the 1,000 write/second per-partition limit. Shard the idempotency table by prefixing the key with a random 0–9 digit, and use a global secondary index if you need point-in-time lookup.

Operating event-driven serverless systems at scale demands a different mental model than operating synchronous APIs. Failures are deferred and silent; retries are automatic and hidden; duplicates are expected, not exceptional. The engineers who run EDA systems well treat the DLQ not as an alarm but as a first-class operational surface — they have runbooks for every DLQ, metrics for retry rates and idempotency cache hit rates, and they practice DLQ redrive in game days before they need it in an incident.