Project: Design an Async Pipeline
Project: Design an Async Pipeline
Everything in this tutorial has been building toward a single skill: sitting down in front of a blank whiteboard and designing a production-grade async pipeline from scratch. This lesson walks through that process end-to-end using a realistic workload — a video-upload and transcoding platform — roughly similar to what YouTube, Vimeo, or Mux operate at scale. You will make every architectural decision explicitly, justify the trade-offs, and produce a design you could defend in a system-design interview or present to an engineering team.
The Sample Workload: Video Upload & Transcoding
The requirements are deliberately concrete:
- Users upload raw video files (up to 10 GB each).
- Each upload must be transcoded into five quality levels: 2160p, 1080p, 720p, 480p, 360p.
- Transcoding one minute of 4K footage takes roughly 3–8 minutes of CPU time per output quality.
- A 30-minute raw video produces about 2.5 hours of combined CPU work.
- Peak load: 500 simultaneous uploads during a live event or product launch.
- The uploader must receive a confirmation within 2 seconds. Transcoding may take up to 30 minutes.
- Users and downstream systems (CDN, search indexer, analytics) must be notified when each quality tier is ready.
- A failed transcode must not be silently lost — it must be retried and, after exhausting retries, routed for human review.
Step 1 — Identify the Producers and Consumers
Start every pipeline design by mapping who produces events and who consumes them. This prevents you from prematurely over-engineering the messaging layer.
- Producers: The upload API service (one event per completed upload), and later the Transcode Workers themselves (one event per completed quality tier).
- Consumers: Transcode Workers (primary heavy consumers), Notification Service, Search Indexer, Analytics Ingestor, CDN Purge Service.
Notice that the Transcode Workers are both consumers (they consume upload events) and producers (they emit transcode-complete events). This producer/consumer duality is common in pipeline architectures and is a strong signal that an event streaming platform like Kafka is the right backbone — not a simple point-to-point queue.
Step 2 — Choose the Messaging Technology
With five consumers of the transcode-complete event, a single queue such as RabbitMQ in a competing-consumer pattern would mean only one consumer receives each message. That is exactly wrong here — every consumer (notification, search, analytics, CDN, monitoring) needs its own copy. The choice narrows to:
- Kafka topics with multiple consumer groups — each consumer group gets every message independently, messages are retained on disk for hours or days, replay is trivial. This is the correct choice.
- SNS + SQS fan-out (AWS) — SNS topic fans out to multiple SQS queues, one per consumer type. Valid if you are fully AWS-native and want managed infrastructure.
For this design, use Kafka because the retention and replay features are critical: if the Search Indexer is down for 20 minutes during a deploy, it must be able to catch up from the last committed offset rather than lose those events forever.
Step 3 — Define the Topics and Partitioning Strategy
Three Kafka topics suffice for this pipeline:
video.uploads.raw— one message per completed upload, keyed byvideo_id. Partition count: 50 (supports 500 concurrent uploads; each partition is processed by one Transcode Worker).video.transcodes.progress— one message per completed quality tier (so 5 messages per video). Keyed byvideo_idso all five events for the same video land in the same partition (guaranteed ordering per video).video.transcodes.dlq— the Dead-Letter Topic. Failed transcode jobs land here after 3 retry attempts for manual triage.
Step 4 — Design the Full Pipeline Flow
With the technology and topics decided, the full end-to-end flow becomes clear. The diagram below shows all components and the data path from raw upload to CDN delivery.
Step 5 — Delivery Guarantees and Idempotency
The Transcode Workers must use at-least-once delivery — we absolutely cannot lose a transcode job. This means a worker commits its Kafka offset only after successfully uploading the transcoded file to object storage and publishing the progress event. If the worker crashes mid-job, Kafka will re-deliver the message to another worker. This is safe only because transcoding is idempotent: encoding the same source file to the same quality level always produces the same output. The worker should check whether the output file already exists in storage before starting work to avoid redundant CPU cost.
Step 6 — Backpressure and Scaling
With 50 Kafka partitions, the worker pool can scale horizontally up to 50 instances without any code changes — Kafka's consumer group protocol automatically rebalances partitions. During a spike of 500 concurrent uploads, messages accumulate in the topic (Kafka's consumer lag grows). This is intentional backpressure: the queue absorbs burst load, and workers drain it at a sustainable rate. Configure consumer lag alerts at 200 messages to trigger auto-scaling of the worker pool.
Transcode jobs are CPU-heavy and long-running (up to 30 minutes). Set the max.poll.interval.ms Kafka consumer config to at least 35 minutes to prevent the broker from kicking out a worker that is legitimately busy processing a large 4K file.
Step 7 — Dead-Letter Queue and Operations
After 3 retry attempts (with exponential back-off: 1 min, 5 min, 25 min), a failed transcode is published to the video.transcodes.dlq topic. An Ops Dashboard consumes this topic and creates a Jira/PagerDuty ticket with the video_id, error message, and stack trace. An operator can inspect the source file in object storage, fix the issue, and manually re-publish the message to video.uploads.raw to retry.
Capacity Estimation
A quick back-of-the-envelope check validates the design:
- 500 uploads/peak × 2.5 CPU-hours each = 1,250 CPU-hours of transcode work to drain.
- One transcode worker uses ~2 vCPUs. A single
c5.2xlarge(8 vCPUs) runs 4 parallel transcode jobs. - To drain 1,250 CPU-hours in 30 minutes you need roughly 42 such instances (1,250 / 0.5 hrs / 4 jobs × safety margin ≈ 40–50 nodes). This matches the 50-partition design neatly.
- At $0.34/hr per
c5.2xlarge, a 30-minute peak burst costs about $7.14 — use spot instances to reduce this by 70%.
Design Summary
The final design applies every concept from this tutorial:
- Async decoupling — the Upload API returns in under 2 seconds regardless of transcode time.
- Kafka topics with intentional partitioning strategy for even load distribution.
- Publish/subscribe fan-out — five consumer groups each receive every progress event independently.
- At-least-once delivery with idempotent workers (check output existence before processing).
- Backpressure via consumer lag and auto-scaling triggers.
- Dead-letter queue with human-review workflow and manual re-injection.
- Event-driven downstream services — search indexing, analytics, and CDN invalidation are all triggered by events, not cron jobs or polling.
This is what a production async pipeline looks like. The specific technology stack will vary — RabbitMQ instead of Kafka for simpler fan-out needs, SQS+SNS on AWS, or Pub/Sub on GCP — but the structural decisions remain the same: identify producers and consumers, choose a messaging technology that matches your fan-out and retention requirements, partition for parallelism, commit offsets only after durable completion, and always build a DLQ path.