Real-World System Design Case Studies

How to Approach a Case Study

18 min Lesson 1 of 10

How to Approach a Case Study

A system design case study — whether in a technical interview or a real engineering project — is not a test of memorisation. It is a structured conversation about trade-offs. The engineer who "wins" is not the one who recites a correct architecture, but the one who methodically clarifies constraints, makes explicit decisions, and defends those decisions with reasoning.

This lesson gives you a repeatable, five-phase framework you can apply to every case study in this tutorial and to every design problem you face on the job.

The Five-Phase Framework

Think of every design session as moving through these five phases in order:

Clarify requirements — functional and non-functional
Estimate scale — traffic, storage, bandwidth
Define the API — what the system promises its callers
Design the high-level architecture — components and data flow
Deep-dive and justify trade-offs — the decisions that make or break the design

Below is a visual overview of the flow:

The five-phase framework: move left to right, but revisit earlier phases when you discover new constraints.

Phase 1 — Clarify Requirements

Every case study starts with an intentionally vague prompt: "Design Twitter" or "Design a URL shortener". Resist the urge to jump straight to architecture. Spend the first few minutes asking clarifying questions. Requirements fall into two buckets:

Functional requirements (FR) — what the system does. Example for a URL shortener: shorten a URL, redirect to the original, let users view click statistics.
Non-functional requirements (NFR) — how the system performs. Typical NFRs: availability (99.9% uptime = ~8.7 h downtime/year), latency (redirect in < 10 ms p99), consistency (eventual vs strong), durability (no data loss), security (authentication).

Key idea: Non-functional requirements usually drive architecture more than functional ones. A URL shortener that must redirect in under 10 ms p99 globally demands CDN edge caches and a different storage strategy than one that accepts 200 ms.

Write your agreed requirements on a whiteboard or doc before drawing any boxes. In an interview, this step also signals to the interviewer that you communicate before you code — a critical senior-engineer trait.

Phase 2 — Estimate Scale

Back-of-envelope numbers constrain your design choices. You do not need precise figures; an order-of-magnitude estimate is enough to determine whether you need a single database, horizontal sharding, or a distributed cache.

A practical template (using a URL shortener as example):

Users: 100 M monthly active users (MAU) → ~3 M daily active (30 % DAU ratio)
Write QPS: 100 new URLs/second at peak
Read QPS: 10 000 redirects/second (100:1 read:write ratio)
Storage: 100 bytes/row × 100 URLs/s × 86 400 s/day × 365 days × 5 years ≈ 16 TB
Bandwidth (reads): 10 000 req/s × 500 B avg response ≈ 5 GB/s egress

These numbers immediately tell you: you need read replicas or a cache layer (10 k read QPS from one DB is risky), and 16 TB over five years is table-sharding territory or a managed NoSQL store.

Interview tip: State your assumptions aloud ("I am assuming a 100:1 read-to-write ratio"). The interviewer will correct you if they have something different in mind, and it shows rigorous thinking.

Phase 3 — Define the API

Before drawing components, define what the system promises its clients. This prevents scope creep and makes Phase 4 concrete. Keep it simple — just the verb, the path, the key inputs, and the outputs:

POST /shorten
  body: { original_url, custom_alias?, ttl_days? }
  response: { short_code, short_url, expires_at }

GET /{short_code}
  response: 301/302 redirect to original_url

GET /stats/{short_code}
  response: { total_clicks, last_clicked_at, top_countries[] }

Defining the API surface also reveals hidden requirements: the custom_alias parameter forces your storage key scheme to handle collisions differently than auto-generated codes.

Phase 4 — High-Level Architecture

Now draw the components. A good first pass has: clients, a load balancer, application servers, a cache, a primary database, and any async workers. Connect them with arrows labelled with the protocol (HTTPS, gRPC, AMQP). Every arrow is a potential failure point and a conversation topic.

A generic starting-point architecture: clients, CDN, load balancer, stateless app servers, cache, primary DB with read replica, and an async queue for side-effects.

Start generic, then specialise. Ask: Where is the bottleneck? For the URL shortener it is the redirect path (10 k QPS). The cache hit rate for that path is what you need to defend.

Phase 5 — Deep-Dive and Justify Trade-offs

Pick two or three components that are architecturally interesting and go deep. Show you understand the trade-offs. Classic deep-dive topics per problem type:

Storage: SQL vs NoSQL — ACID guarantees vs horizontal write scale. Which does this problem actually need?
Caching: Cache-aside vs write-through; TTL choice; what happens on a cache miss under high load (thundering herd)?
Consistency: Strong (sync replication, higher latency) vs eventual (async, faster, but stale reads). Which operations can tolerate staleness?
Failure modes: What happens when the cache goes down? When the primary DB is unavailable? Does the system degrade gracefully?

Common pitfall: Jumping straight to microservices. Start with a monolith or a small number of services. Prematurely decomposing adds operational complexity before you understand the real bottlenecks. You can always split later; merging microservices back is painful.

Putting It All Together — A One-Page Cheat Sheet

Before every case study in this tutorial, re-read these five questions. They keep you on track:

What are the top 3 functional requirements? (The must-haves, not the nice-to-haves)
What is the read:write ratio and the peak QPS? (Drives caching and sharding decisions)
What is the latency SLA for the critical path? (Drives where you put the cache and how you route traffic)
What is the consistency requirement? (Drives whether you use sync replication, two-phase commit, or eventual consistency)
What is the single biggest failure risk? (The component whose loss causes the most user pain — make it redundant first)

Key idea: In a 45-minute interview, you will not finish a production-grade design. That is fine — the interviewer is watching your process, not your final diagram. A clear framework executed methodically scores higher than a brilliant but unexplained architecture.