How to Approach a Case Study
How to Approach a Case Study
A system design case study — whether in a technical interview or a real engineering project — is not a test of memorisation. It is a structured conversation about trade-offs. The engineer who "wins" is not the one who recites a correct architecture, but the one who methodically clarifies constraints, makes explicit decisions, and defends those decisions with reasoning.
This lesson gives you a repeatable, five-phase framework you can apply to every case study in this tutorial and to every design problem you face on the job.
The Five-Phase Framework
Think of every design session as moving through these five phases in order:
- Clarify requirements — functional and non-functional
- Estimate scale — traffic, storage, bandwidth
- Define the API — what the system promises its callers
- Design the high-level architecture — components and data flow
- Deep-dive and justify trade-offs — the decisions that make or break the design
Below is a visual overview of the flow:
Phase 1 — Clarify Requirements
Every case study starts with an intentionally vague prompt: "Design Twitter" or "Design a URL shortener". Resist the urge to jump straight to architecture. Spend the first few minutes asking clarifying questions. Requirements fall into two buckets:
- Functional requirements (FR) — what the system does. Example for a URL shortener: shorten a URL, redirect to the original, let users view click statistics.
- Non-functional requirements (NFR) — how the system performs. Typical NFRs: availability (99.9% uptime = ~8.7 h downtime/year), latency (redirect in < 10 ms p99), consistency (eventual vs strong), durability (no data loss), security (authentication).
Write your agreed requirements on a whiteboard or doc before drawing any boxes. In an interview, this step also signals to the interviewer that you communicate before you code — a critical senior-engineer trait.
Phase 2 — Estimate Scale
Back-of-envelope numbers constrain your design choices. You do not need precise figures; an order-of-magnitude estimate is enough to determine whether you need a single database, horizontal sharding, or a distributed cache.
A practical template (using a URL shortener as example):
- Users: 100 M monthly active users (MAU) → ~3 M daily active (30 % DAU ratio)
- Write QPS: 100 new URLs/second at peak
- Read QPS: 10 000 redirects/second (100:1 read:write ratio)
- Storage: 100 bytes/row × 100 URLs/s × 86 400 s/day × 365 days × 5 years ≈ 16 TB
- Bandwidth (reads): 10 000 req/s × 500 B avg response ≈ 5 GB/s egress
These numbers immediately tell you: you need read replicas or a cache layer (10 k read QPS from one DB is risky), and 16 TB over five years is table-sharding territory or a managed NoSQL store.
Phase 3 — Define the API
Before drawing components, define what the system promises its clients. This prevents scope creep and makes Phase 4 concrete. Keep it simple — just the verb, the path, the key inputs, and the outputs:
Defining the API surface also reveals hidden requirements: the custom_alias parameter forces your storage key scheme to handle collisions differently than auto-generated codes.
Phase 4 — High-Level Architecture
Now draw the components. A good first pass has: clients, a load balancer, application servers, a cache, a primary database, and any async workers. Connect them with arrows labelled with the protocol (HTTPS, gRPC, AMQP). Every arrow is a potential failure point and a conversation topic.
Start generic, then specialise. Ask: Where is the bottleneck? For the URL shortener it is the redirect path (10 k QPS). The cache hit rate for that path is what you need to defend.
Phase 5 — Deep-Dive and Justify Trade-offs
Pick two or three components that are architecturally interesting and go deep. Show you understand the trade-offs. Classic deep-dive topics per problem type:
- Storage: SQL vs NoSQL — ACID guarantees vs horizontal write scale. Which does this problem actually need?
- Caching: Cache-aside vs write-through; TTL choice; what happens on a cache miss under high load (thundering herd)?
- Consistency: Strong (sync replication, higher latency) vs eventual (async, faster, but stale reads). Which operations can tolerate staleness?
- Failure modes: What happens when the cache goes down? When the primary DB is unavailable? Does the system degrade gracefully?
Putting It All Together — A One-Page Cheat Sheet
Before every case study in this tutorial, re-read these five questions. They keep you on track:
- What are the top 3 functional requirements? (The must-haves, not the nice-to-haves)
- What is the read:write ratio and the peak QPS? (Drives caching and sharding decisions)
- What is the latency SLA for the critical path? (Drives where you put the cache and how you route traffic)
- What is the consistency requirement? (Drives whether you use sync replication, two-phase commit, or eventual consistency)
- What is the single biggest failure risk? (The component whose loss causes the most user pain — make it redundant first)