Design a Video Streaming Service
Design a Video Streaming Service
YouTube serves over 500 hours of video uploaded every minute and delivers more than 1 billion hours of watch time per day across 100+ countries. Designing a system at this scale forces you to solve three distinct, hard problems simultaneously: ingestion (how do raw uploads become streamable files?), storage (where do petabytes of video live?), and delivery (how does every viewer on every device get smooth playback?). Each problem has its own architecture story, and together they compose one of the most instructive distributed-systems case studies you can study.
Requirements
Functional requirements:
- Users can upload videos (up to 10 GB each).
- Uploaded videos are transcoded into multiple resolutions (360p, 720p, 1080p, 4K) and formats (MP4/H.264, WebM/VP9, HLS segments).
- Users can stream videos with adaptive bitrate — quality adjusts automatically to available bandwidth.
- Users can search, like, comment, and subscribe (out of scope for this lesson; we focus on upload and streaming).
Non-functional requirements:
- Availability: 99.99 % — even a 0.01 % outage affects millions of concurrent viewers.
- Latency: Video playback must start within 2 seconds; upload acknowledgment within 500 ms.
- Throughput: 500 hours/minute upload; peak egress bandwidth in the tens of terabits per second globally.
- Durability: Uploaded video must never be lost (3+ geo-redundant copies).
- Scalability: Must scale horizontally for both ingestion and delivery without re-architecting.
Scale Estimation
- Storage: 500 hours × 60 min × ~1 GB/min raw ≈ 30 TB raw video per hour. After transcoding into 5 renditions, storage cost per raw GB is roughly 2–3×. At 30 TB/hour, that is ~70–90 TB of final storage per hour, or ~650 PB per year — why YouTube needs to own its own data centres and negotiate aggressive cloud storage deals.
- Bandwidth (egress): 1 billion watch-hours/day ÷ 86,400 seconds ≈ ~11.6 million concurrent streams. At 2 Mbps average bitrate: ~23 Tbps total egress. No single CDN POP can serve this; it requires a global, multi-tier CDN hierarchy.
- Transcoding workers: Processing 1 hour of 4K video takes ~20–40 minutes of CPU time per rendition. 500 hours/minute of upload × 5 renditions × 30 min/job = 75,000 CPU-minutes of transcoding per minute of real time — a massively parallel compute problem.
High-Level Architecture
The system has two distinct planes: the upload and processing pipeline (write-heavy, async, latency-tolerant) and the playback pipeline (read-heavy, synchronous, latency-critical). Never mix them — processing load must not starve viewers.
Deep Dive 1 — The Upload Pipeline
A raw video upload can be gigabytes. Sending it as a single HTTP POST is fragile — a 30-second network hiccup fails the whole upload. Instead, the client chunks the file (typically 5–20 MB pieces) and uploads each chunk independently. The server reassembles them. This enables resumable uploads: if the connection drops at chunk 47, the client resumes from chunk 48.
YouTube uses the GCS Resumable Upload protocol. AWS offers S3 Multipart Upload. The pattern is always: (1) initiate upload → get an upload ID, (2) upload parts with part numbers, (3) complete upload → storage stitches the parts.
Deep Dive 2 — Transcoding at Scale
One uploaded video must become many files — different resolutions (360p, 480p, 720p, 1080p, 2160p) and different codecs (H.264 for broadest compatibility; VP9/AV1 for ~30–50% better compression). Multiplied across 500 upload-hours per minute, this is a massively parallel compute problem.
YouTube built Transcoder, a proprietary system that breaks each video into segments of a few seconds, distributes segments across thousands of machines in parallel, and re-assembles the results. The key architectural idea: you can parallelize transcoding along the time axis — segment 1 can transcode on machine A while segment 2 transcodes on machine B. This shrinks a 2-hour video's transcode time from ~hours to ~minutes.
At the industry level this pattern maps to: a Directed Acyclic Graph (DAG) of tasks where each node is a transformation (split → encode → thumbnail → merge → publish). A workflow engine (Apache Airflow, AWS Step Functions, or a custom DAG runner) orchestrates the DAG.
Deep Dive 3 — Storage Architecture
Video bytes live in an object store (Amazon S3, Google Cloud Storage, or YouTube's homegrown Colossus file system). Object stores are ideal because:
- Files are immutable once written — no update conflicts, no locking.
- They scale to exabytes with no schema migrations.
- They have a flat key-value interface:
bucket/videoId/720p/segment_042.ts. - They support lifecycle policies: move infrequently accessed videos to cheaper cold storage (Glacier, Coldline) automatically.
Metadata (title, description, owner, view count, status, rendition URLs) lives in a relational database (MySQL with read replicas at YouTube). View counts, likes, and comments use a separate counter service that accepts high-write traffic and periodically flushes to the main DB — you cannot afford row-level locking on the videos table for every play event.
Deep Dive 4 — CDN Delivery and Adaptive Bitrate
Serving 23 Tbps from a single origin is physically impossible. The solution is a Content Delivery Network: a global mesh of hundreds of Points of Presence (PoPs) that cache video segments close to viewers. YouTube operates its own CDN (Google Global Cache / GGC) supplemented by ISP-embedded caches.
The protocol that enables smooth streaming across variable-bandwidth connections is HTTP Live Streaming (HLS). How it works:
- The HLS packager produces a master manifest (
.m3u8) listing all available quality levels. - Each quality level has its own media manifest listing short segments (
.tsfiles, typically 6–10 seconds each). - The player downloads the master manifest, picks an initial quality based on current bandwidth, and starts fetching segments.
- After each segment, the player measures download throughput. If bandwidth drops, it switches to a lower-quality manifest for the next segment — seamlessly, mid-video.
The CDN caches both the static video segments (long TTL, essentially forever) and the manifests (shorter TTL so quality-level changes propagate). A signed URL or token protects premium content: the API server issues a time-limited, HMAC-signed URL; the CDN edge validates the signature before serving the segment.
Trade-offs Summary
| Decision | Chosen approach | Key trade-off accepted |
|---|---|---|
| Upload reliability | Chunked / resumable upload | More client complexity; no failed multi-GB uploads |
| Transcoding latency | Async queue + parallel workers | Video is not instantly available; creator sees processing delay |
| Video storage | Immutable object store | No in-place edits; re-transcode on quality change |
| Streaming protocol | HLS (adaptive bitrate) | Manifests add complexity; better than fixed-bitrate buffering |
| Metadata storage | Relational DB + read replicas | Eventual-consistent read replicas; view counts on separate service |
| Delivery | Multi-tier CDN | Cache invalidation complexity; massive bandwidth savings |