System Design Fundamentals

Back-of-the-Envelope Estimation

18 min Lesson 4 of 10

Back-of-the-Envelope Estimation

Before you draw a single architecture box, you need to know the scale of the problem you are solving. Back-of-the-envelope estimation is the skill of quickly calculating order-of-magnitude numbers — queries per second, storage needs, bandwidth, and memory — using simple arithmetic and a handful of memorised constants. In a system design interview, and in real engineering planning, these calculations set the constraints that drive every decision that follows.

Why it matters: A system handling 100 QPS and one handling 100,000 QPS look completely different. One fits on a single server; the other needs a load balancer, a cache cluster, and probably a message queue. Getting the scale wrong upfront means designing the wrong system.

The Reference Numbers You Must Memorise

Good estimators do not look numbers up mid-conversation. They keep a small table in their head:

Latency landmarks: L1 cache ~0.5 ns | RAM read ~100 ns | SSD read ~100 µs | HDD seek ~10 ms | cross-datacenter RTT ~150 ms
Throughput landmarks: SSD sequential ~500 MB/s | network (1 Gbps NIC) ~125 MB/s | typical DB row read ~1 µs of CPU
Data sizes: ASCII char = 1 B | UUID = 36 B | average tweet ≈ 280 B | thumbnail ≈ 50 KB | HD photo ≈ 3 MB | 4-min MP3 ≈ 4 MB | 720p video minute ≈ 50 MB
Powers of 2 / 10 equivalences: 2¹⁰ ≈ 10³ (1 KB ≈ 1,000 B), 2²⁰ ≈ 10⁶ (1 MB), 2³⁰ ≈ 10⁹ (1 GB), 2⁴⁰ ≈ 10¹² (1 TB)
Time conversions: 1 day ≈ 86,400 s ≈ 10⁵ s | 1 month ≈ 2.5 × 10⁶ s | 1 year ≈ 3 × 10⁷ s

Rounding is correct behaviour. You are estimating, not invoicing. Round aggressively — 86,400 becomes 100,000, 3.14 becomes 3 — then add a safety factor of 2–3× at the end. Precision at this stage wastes time and creates false confidence.

The Four Pillars of Estimation

Every back-of-the-envelope exercise produces four numbers. Let us walk through each one using a concrete example: a Twitter-like microblogging service with 300 million monthly active users (MAU), of whom 10% post once per day and 100% read.

1. Queries Per Second (QPS)

QPS is the heartbeat of your system. Everything — thread pools, connection limits, rate limiters, load balancer capacity — is sized against it.

Write QPS (new tweets):

Daily active writers = 300M × 10% = 30 M users
Writes per day       = 30 M × 1 tweet = 30 M tweets/day
Write QPS            = 30,000,000 / 86,400 ≈ 350 writes/s
Peak write QPS       = 350 × 3 (peak factor) ≈ 1,000 writes/s

Read QPS: Assume each active reader reads a timeline of 20 tweets, triggering 1 DB or cache read per tweet shown.

Daily active readers = 300M × 50% (rough DAU) = 150 M
Timeline loads/day   = 150 M × 5 sessions = 750 M reads/day
Read QPS             = 750,000,000 / 86,400 ≈ 8,700 reads/s
Peak read QPS        = 8,700 × 3 ≈ 26,000 reads/s

A single well-tuned relational database can handle ~10,000 simple reads per second. At 26,000 peak reads per second, we already know we need a read replica or a cache layer — and we have not even talked about the schema yet. That is the power of this calculation.

2. Storage

Storage estimation tells you what kind of data infrastructure you need and how fast you will grow.

Tweet text (avg 140 chars)     =  140 B
Metadata (user_id, timestamp)  =   30 B
Total per tweet                ≈  170 B  → round to 200 B

Daily storage (new tweets)     = 30 M × 200 B = 6 GB/day
Monthly storage                = 6 GB × 30    = 180 GB/month
5-year storage (text only)     = 180 GB × 60  ≈ 10.8 TB

That 10.8 TB fits on a single large NVMe array, but now add media: suppose 10% of tweets carry a 1 MB photo.

Photo writes/day = 30 M × 10% = 3 M photos
Storage/day      = 3 M × 1 MB = 3 TB/day
5-year photos    = 3 TB × 365 × 5 ≈ 5.5 PB

5.5 petabytes of photos in five years instantly tells us: we need an object store (S3, GCS, or similar), not a relational database BLOB column. Media storage is always separated from transactional storage at scale.

3. Bandwidth

Bandwidth governs your network infrastructure costs, CDN requirements, and whether a single region is enough.

--- Inbound (upload) ---
Text writes/day   = 6 GB/day    →   6 GB / 86,400 s ≈  70 KB/s inbound (text)
Photo uploads/day = 3 TB/day    →   3 TB / 86,400 s ≈  35 MB/s inbound (media)
Total inbound     ≈ 35 MB/s

--- Outbound (serving reads) ---
Timeline: each load shows 20 tweets at 200 B text each = 4 KB text/load
With thumbnail (50 KB each, 5 per load): 5 × 50 KB = 250 KB/load
Total per load ≈ 254 KB

Loads/s = 8,700 (avg QPS from above)
Outbound = 8,700 × 254 KB ≈ 2.2 GB/s

2.2 GB/s average outbound means you need a CDN for media and likely multiple PoPs (Points of Presence) around the world. A single datacenter network card at 10 Gbps (~1.25 GB/s) would already be saturated — another design driver you discovered purely from arithmetic.

4. Memory (Cache Sizing)

Caches are effective only when the hot dataset fits in RAM. The question is: how much RAM do you actually need?

The classic 80/20 rule: 20% of the content generates 80% of the reads. Cache that 20%.

Tweets stored per day     = 30 M
"Hot" tweets (20%)        = 6 M tweets
Size per tweet in cache   ≈ 300 B (text + serialisation overhead)
Memory for hot tweets     = 6 M × 300 B = 1.8 GB/day

Keep 3 days of hot data   = 1.8 × 3 = 5.4 GB → round to 10 GB (safety margin)
Redis instance size       ≈ 10–20 GB RAM (fits comfortably on one r5.large)

10–20 GB of Redis RAM can absorb the 26,000 peak read QPS almost entirely, leaving the database free for writes and cache-miss reads. That is the architectural insight the memory estimate delivers.

Putting It All Together — The Estimation Summary Diagram

The estimation pipeline: starting from user scale, each calculation step reveals a distinct architectural constraint.

A Visual Guide to Data Size Intuition

Data size ladder: memorising these reference points lets you instantly classify storage and bandwidth requirements.

Common Mistakes and How to Avoid Them

Forgetting the peak factor. Average QPS is rarely what breaks systems. Always multiply by 2–5× to model peak traffic (events, viral moments, midnight cron bursts).
Ignoring replication overhead. If you write to a primary database that replicates to two replicas, your write amplification is 3×. Factor this into storage and I/O estimates.
Mixing compression and uncompressed numbers. Be explicit: "3 TB/day uncompressed; with 3:1 compression ratio, we store ~1 TB/day." Never mix the two silently.
Treating DAU = MAU. Typically DAU is 20–50% of MAU for consumer apps. Using MAU for daily calculations inflates every number.
Not stating assumptions. In interviews and engineering docs, state every assumption explicitly — "I am assuming 10% of users post once per day" — so reviewers can challenge the inputs, not your arithmetic.

Estimates are hypotheses, not facts. Real traffic patterns, user behaviour, and data distributions almost always differ from estimates. Treat your numbers as inputs to architecture decisions and as triggers for capacity alerts — not as precise targets. Build monitoring and autoscaling so the system self-adjusts when reality diverges from the estimate.

The Estimation in an Interview

In a 45-minute system design interview, spend 5–8 minutes on estimation. The goal is not to get a mathematically perfect answer — it is to demonstrate structured thinking. A good pattern:

State the scale input: "With 300 M MAU and 10% posting daily…"
Derive write QPS, then read QPS (with assumed read/write ratio).
Size storage for 5 years. Note whether media separates into an object store.
Sanity-check bandwidth. Note CDN requirement if outbound exceeds ~1 GB/s.
Size the cache. State the hot-data percentage assumption.
Summarise: "So we are looking at roughly 1K write QPS, 26K read QPS, ~10 TB text storage, ~5 PB media over 5 years, 2+ GB/s outbound, and a 10–20 GB cache."

That six-line summary already tells an experienced engineer what the architecture must include before you sketch a single box.