Key Metrics: Latency, Throughput & Availability
Key Metrics: Latency, Throughput & Availability
Before you can design a system, you need a shared vocabulary for measuring how well it performs. Three numbers dominate every system-design conversation: latency, throughput, and availability. Understanding what they mean — and the hard trade-offs between them — is the foundation of every architectural decision you will ever make.
Latency: How Fast Is a Single Request?
Latency is the time elapsed from the moment a client sends a request to the moment it receives the full response. It is measured in milliseconds (ms) or microseconds (µs). Lower is better.
A few reference points that every engineer should have memorised:
- L1 cache read: ~1 ns
- L2 cache read: ~4 ns
- RAM read: ~100 ns
- SSD random read: ~100 µs (0.1 ms)
- HDD seek: ~10 ms
- Round-trip within same data-centre: ~0.5 ms
- Cross-region round-trip (e.g. US → EU): ~150 ms
- Mobile 4G round-trip: ~50–100 ms
Percentiles: Why the Average Lies
Reporting average (mean) latency is one of the most dangerous habits in engineering. Imagine 99 requests that complete in 10 ms and one that takes 10 000 ms. The average is ~109 ms — yet 99 % of users are fine and one user is furious. The average hides the outlier entirely.
The industry uses percentiles instead:
p50— the median; 50 % of requests are faster than this value.p95— 95 % of requests are faster. 5 % are slower.p99— 99 % of requests are faster. Only 1 in 100 is slower.p999— 99.9 % of requests are faster. The worst 1-in-1000.
In production you will hear teams say things like "our p99 is 200 ms." That means 1 % of users wait longer than 200 ms. At 10 000 requests per second, that is 100 users every second experiencing a slow response.
Throughput: How Much Work Can the System Do?
Throughput is the number of requests (or operations, or bytes) a system can process per unit of time. Common units are requests per second (RPS), queries per second (QPS), or bytes per second (Bps). Higher is better.
Throughput tells you about capacity. A system that handles 500 RPS at p99 = 50 ms is a very different beast from one that handles 50 000 RPS at the same latency. You need both numbers.
The Latency vs. Throughput Trade-off
Latency and throughput are related but opposite levers. Batching is the classic illustration: if you send database writes individually, each write is fast (low latency) but total throughput is limited by round-trip overhead. If you buffer and batch 1 000 writes into a single transaction, throughput skyrockets but each individual write now waits up to the batch window — latency increases.
Availability: The Nines
Availability is the fraction of time a system is correctly serving requests. It is expressed as a percentage — usually written as the famous "nines":
Notice how each additional nine costs you an order-of-magnitude more in engineering effort. Going from 99 % to 99.9 % means eliminating hours of downtime per year. Going from 99.99 % to 99.999 % means tolerating only 5 minutes of unplanned downtime annually — achieved through active-active multi-region setups, automated failover in seconds, and relentless chaos engineering.
SLI, SLO, and SLA
Three terms you will encounter constantly:
- SLI (Service Level Indicator): The actual measured metric — e.g. "our p99 latency was 180 ms last hour."
- SLO (Service Level Objective): The internal target — e.g. "p99 latency must be below 200 ms, measured over a 30-day rolling window."
- SLA (Service Level Agreement): A legal contract with a customer — e.g. "we guarantee 99.9 % availability; if we miss it, you receive service credits."
How These Metrics Interact in Practice
Consider a database cluster: adding read replicas increases throughput (more queries served in parallel) but can slightly increase latency for reads that hit a replica that is lagging behind the primary. Adding an in-memory cache (like Redis) reduces database latency dramatically — but introduces the risk of stale data and a cache-miss spike that temporarily degrades both latency and throughput. Achieving high availability requires replication and failover, which adds distributed coordination overhead and therefore touches latency and throughput again.
System design is the art of tuning these three numbers for your specific requirements. There is no globally optimal point — only the right trade-off for your workload, budget, and user expectations.