System Design Fundamentals

Key Metrics: Latency, Throughput & Availability

18 min Lesson 5 of 10

Key Metrics: Latency, Throughput & Availability

Before you can design a system, you need a shared vocabulary for measuring how well it performs. Three numbers dominate every system-design conversation: latency, throughput, and availability. Understanding what they mean — and the hard trade-offs between them — is the foundation of every architectural decision you will ever make.

Latency: How Fast Is a Single Request?

Latency is the time elapsed from the moment a client sends a request to the moment it receives the full response. It is measured in milliseconds (ms) or microseconds (µs). Lower is better.

A few reference points that every engineer should have memorised:

L1 cache read: ~1 ns
L2 cache read: ~4 ns
RAM read: ~100 ns
SSD random read: ~100 µs (0.1 ms)
HDD seek: ~10 ms
Round-trip within same data-centre: ~0.5 ms
Cross-region round-trip (e.g. US → EU): ~150 ms
Mobile 4G round-trip: ~50–100 ms

Key idea: Disk and network are orders of magnitude slower than memory. When you see a slow system, the bottleneck is almost always one of these two — excessive disk I/O or unnecessary network round-trips.

Percentiles: Why the Average Lies

Reporting average (mean) latency is one of the most dangerous habits in engineering. Imagine 99 requests that complete in 10 ms and one that takes 10 000 ms. The average is ~109 ms — yet 99 % of users are fine and one user is furious. The average hides the outlier entirely.

The industry uses percentiles instead:

p50 — the median; 50 % of requests are faster than this value.
p95 — 95 % of requests are faster. 5 % are slower.
p99 — 99 % of requests are faster. Only 1 in 100 is slower.
p999 — 99.9 % of requests are faster. The worst 1-in-1000.

In production you will hear teams say things like "our p99 is 200 ms." That means 1 % of users wait longer than 200 ms. At 10 000 requests per second, that is 100 users every second experiencing a slow response.

Best practice: Always track p99 (and sometimes p999) in your dashboards. Optimise for p99 first, not the mean. Tools like Prometheus, Datadog, and New Relic all support percentile histograms out of the box.

A cumulative distribution of request latency. The p99 and p999 "tails" reveal the slow outliers that the average hides.

Throughput: How Much Work Can the System Do?

Throughput is the number of requests (or operations, or bytes) a system can process per unit of time. Common units are requests per second (RPS), queries per second (QPS), or bytes per second (Bps). Higher is better.

Throughput tells you about capacity. A system that handles 500 RPS at p99 = 50 ms is a very different beast from one that handles 50 000 RPS at the same latency. You need both numbers.

The Latency vs. Throughput Trade-off

Latency and throughput are related but opposite levers. Batching is the classic illustration: if you send database writes individually, each write is fast (low latency) but total throughput is limited by round-trip overhead. If you buffer and batch 1 000 writes into a single transaction, throughput skyrockets but each individual write now waits up to the batch window — latency increases.

Common pitfall: Never optimise only one dimension. A system with 1 ms latency that can only handle 10 RPS is useless in production. A system that saturates at 100 000 RPS but has a p99 of 30 seconds will kill your user experience. Design for both.

Availability: The Nines

Availability is the fraction of time a system is correctly serving requests. It is expressed as a percentage — usually written as the famous "nines":

The "nines" of availability and their real downtime budgets. Most SLAs target 99.9% to 99.99%.

Notice how each additional nine costs you an order-of-magnitude more in engineering effort. Going from 99 % to 99.9 % means eliminating hours of downtime per year. Going from 99.99 % to 99.999 % means tolerating only 5 minutes of unplanned downtime annually — achieved through active-active multi-region setups, automated failover in seconds, and relentless chaos engineering.

SLI, SLO, and SLA

Three terms you will encounter constantly:

SLI (Service Level Indicator): The actual measured metric — e.g. "our p99 latency was 180 ms last hour."
SLO (Service Level Objective): The internal target — e.g. "p99 latency must be below 200 ms, measured over a 30-day rolling window."
SLA (Service Level Agreement): A legal contract with a customer — e.g. "we guarantee 99.9 % availability; if we miss it, you receive service credits."

Key idea: SLOs must be tighter than SLAs. If your SLA promises 99.9 % availability, your internal SLO should target 99.95 % — the gap is your error budget. Spend the error budget on planned maintenance and bold deployments; protect it jealously when you are close to exhaustion.

How These Metrics Interact in Practice

Consider a database cluster: adding read replicas increases throughput (more queries served in parallel) but can slightly increase latency for reads that hit a replica that is lagging behind the primary. Adding an in-memory cache (like Redis) reduces database latency dramatically — but introduces the risk of stale data and a cache-miss spike that temporarily degrades both latency and throughput. Achieving high availability requires replication and failover, which adds distributed coordination overhead and therefore touches latency and throughput again.

System design is the art of tuning these three numbers for your specific requirements. There is no globally optimal point — only the right trade-off for your workload, budget, and user expectations.

Interview tip: In a system-design interview, as soon as you hear a requirement like "users must not notice slowness," translate it immediately into concrete numbers: "so we need p99 latency under 100 ms and 99.9 % availability." Quantifying vague requirements shows senior-level thinking.