Capstone: Design a System End-to-End
Capstone: Design a System End-to-End
This capstone lesson ties every concept from the course into a single, realistic design exercise. We will design a social video platform — think YouTube at reduced but still serious scale — by running the full five-phase framework from scratch: clarify, estimate, API, high-level architecture, and deep-dive trade-offs. Every decision is explained with why, not just what.
By the end you will have one complete reference design you can adapt to any system design interview or greenfield project.
Phase 1 — Clarify Requirements
Functional requirements (the must-haves):
- Users can upload videos (up to 4 GB, any common format).
- Videos are transcoded to multiple resolutions (360p, 720p, 1080p) and served adaptively.
- Users can search videos by title and tag.
- A personalised home feed shows recommended videos.
- Users can like, comment, and subscribe to channels.
- View counts are eventually consistent — real-time accuracy is not required.
Non-functional requirements (these drive the architecture):
- Scale: 50 million daily active users (DAU); 500,000 video uploads per day; 5 billion video view-seconds per day.
- Latency: Video playback must start within 2 seconds globally (p95).
- Availability: 99.95 % uptime for the playback path; upload can tolerate brief degradation.
- Durability: Uploaded source videos must never be lost (no data loss SLA).
- Consistency: Like counts and view counts are eventually consistent. Subscriptions are strongly consistent (you must never miss a sub notification).
Phase 2 — Estimate Scale
- Upload QPS: 500,000 uploads/day ÷ 86,400 s ≈ 6 uploads/second average; 30/s peak (5x factor).
- Playback QPS: 5 B view-seconds/day ÷ average video length 300 s = 16.7 M concurrent streams; divided by 86,400 s ≈ 193 k view starts/second peak.
- Storage — raw uploads: 500,000 × avg 500 MB = 250 TB/day of raw source video. At 3 resolutions each ≈ 0.4× the original, transcode output adds ~300 TB/day. Over 1 year: ~200 PB. Object storage (S3 / GCS) is the only practical answer.
- Bandwidth — egress: 193 k streams × 2 Mbps avg bitrate = ~386 Gbps. A CDN is not optional — it is a hard architectural requirement at this scale.
- Metadata DB: 500,000 new video rows/day × 2 KB per row = 1 GB/day of metadata. After 5 years ≈ 1.8 TB — easily fits in a sharded relational DB or a distributed document store.
Phase 3 — Core API Surface
Handing the client a pre-signed URL for direct upload to object storage is a critical design decision: it removes the app servers entirely from the upload data path. The app server only handles metadata; the raw bytes go straight from the client to S3.
Phase 4 — High-Level Architecture
The system decomposes naturally into two independent planes:
- Write plane (upload): Client → App Server (metadata) + direct S3 upload → S3 event → Transcode Queue → Transcode Workers → Transcoded videos back to S3 → CDN invalidation.
- Read plane (playback): Client → CDN → Origin (for cache misses only) → App Server → Metadata DB / Cache → HLS manifest → CDN segments.
Phase 5 — Deep-Dive: The Five Hardest Problems
Every large system has a small number of genuinely hard sub-problems. For our video platform they are: transcode pipeline reliability, CDN cache hit rate, view-count accuracy vs performance, feed generation at scale, and search freshness. We tackle each briefly.
1. Transcode Pipeline Reliability
An S3 event triggers a message on a Kafka topic. A pool of stateless transcode workers consumes jobs. Each worker downloads the raw source, runs FFmpeg to produce HLS segments at three resolutions, and uploads them back to S3. The worker then publishes a transcode.completed event that the metadata service consumes to flip the video status from processing to published.
Key trade-off: at-least-once delivery (Kafka default) means a worker can crash mid-job and the message is redelivered. Workers must be idempotent — re-transcoding and overwriting S3 output is safe because the final write is atomic from S3's perspective.
2. CDN Cache Hit Rate
Video segments are immutable once transcoded (a 2-second HLS chunk never changes). Set a Cache-Control: max-age=31536000, immutable header. The CDN will cache them for a year at every edge node. Hit rates above 90 % are achievable for popular content. For the long tail (videos with few views), cache misses are acceptable — the origin (S3) can handle the load since it is object storage, not a database.
/v/{video_id}/{quality}/manifest.m3u8?gen=3) so that re-transcodes do not serve stale manifests. The manifest itself should have a short TTL (60 s) while the segments have a year-long TTL.
3. View Count — Eventual Consistency Done Right
Incrementing a SQL counter on every view at 193 k/s would kill a relational database. The classic solution is a two-stage pipeline:
- Each app server buffers view events locally and flushes them in batches every 10 seconds to a stream (Kafka).
- A Flink or Spark Streaming job aggregates counts from the stream and writes periodic snapshots back to the database (every 30 seconds).
The user sees a count that is at most 40 seconds stale. For a social video platform, that is perfectly acceptable. The database never sees more than a few hundred writes per second regardless of view volume.
4. Feed Generation — Fan-out on Write vs Fan-out on Read
5. Search Freshness
When a video is published, its metadata must appear in search results within a few seconds. The metadata service publishes a video.published event; a separate indexing worker consumes it and calls the Elasticsearch bulk API to index title, tags, description, and a pre-computed quality score (views, likes, age). Elasticsearch handles inverted-index updates within ~1 second, meeting the freshness requirement without touching the metadata DB from the search path.
Putting It All Together — Key Design Decisions Table
| Decision | Choice | Why |
|---|---|---|
| Video storage | Object store (S3) | Unlimited scale, 11-nine durability, cheap at petabyte scale |
| Upload path | Pre-signed URL (client → S3 direct) | Removes app servers from the data path; scales upload bandwidth independently |
| Transcode coordination | Kafka queue + stateless workers | Decouples upload from transcode; workers are horizontally scalable; at-least-once delivery with idempotent jobs |
| Video delivery | CDN + HLS (Adaptive Bitrate) | Global low-latency; client adapts quality to network; immutable segments cache forever |
| Metadata storage | Sharded MySQL + Redis cache | ACID for writes; cache absorbs read QPS (video metadata is read-heavy, write-light) |
| View counts | Buffered batch writes via Kafka | Avoids hot-row contention; eventual consistency is acceptable for counters |
| Feed generation | Hybrid push/pull | Push for normal users (fast reads); pull for celebrities (avoids fan-out storms) |
| Search | Elasticsearch (event-driven indexing) | Full-text and faceted search; decoupled from DB; ~1 s indexing latency is acceptable |
What You Have Now
You have walked a complete system design — from a vague prompt to a production-credible architecture — using the same five-phase methodology you can apply to any problem. The patterns that appeared here (pre-signed uploads, CDN-first delivery, async transcode pipelines, hybrid fan-out, event-driven search indexing, eventual-consistent counters) are the same patterns that power YouTube, Netflix, TikTok, and every other large-scale video platform in production today.
Take this reference design, adapt the constraints, swap the components for the ones your problem demands, and you will have a rigorous, trade-off-driven answer every time.