System Design Fundamentals

Scalability, Reliability & Maintainability

18 min Lesson 6 of 10

Scalability, Reliability & Maintainability

Every large-scale system is ultimately judged by three qualities: can it grow without falling over, does it keep working when things go wrong, and can engineers change it without fear? These three pillars — scalability, reliability, and maintainability — are the lens through which experienced engineers evaluate every architectural decision.

1. Scalability — Handling Growth

Scalability is the ability of a system to handle an increasing amount of work by adding resources. It answers the question: "If the load doubles, what happens?"

Vertical vs. Horizontal Scaling

Vertical scaling (scale up): Give one machine more CPU, RAM, or faster disks. It is simple — no code changes needed — but hits hard limits quickly. The biggest AWS EC2 instance today has 192 vCPUs and 1.5 TB RAM. Beyond that, you cannot scale up.
Horizontal scaling (scale out): Add more machines and distribute the load. This is how web giants operate. Netflix, for example, runs across tens of thousands of cloud instances. The trade-off is added complexity: you must handle data distribution, network calls, and consistency.

Vertical scaling upgrades one machine; horizontal scaling distributes load across many machines.

Measuring Scalability: Load Parameters

Before you can optimize, you must define what load means for your system. Common load parameters include:

Requests per second to a web server
Ratio of reads to writes in a database (Twitter's feed is ~100:1 read-heavy)
Number of simultaneously active users in a chat system
Cache hit rate

Pick the numbers that matter most, then ask: if that number grows 10×, how does the system respond?

Design tip: Statefulness is the enemy of horizontal scalability. If every server stores session data locally, you cannot freely route a user to any server. Push state into a shared layer (Redis, a database) so that any server can handle any request — this is called stateless architecture.

2. Reliability — Continuing to Work Correctly

Reliability is the ability of a system to continue performing its intended function correctly, even when things go wrong. "Things going wrong" can mean hardware faults, software bugs, or even human configuration errors — the single largest cause of outages in production.

The Three Fault Categories

Hardware faults: Disk crashes, RAM errors, network card failures. At a large scale these are routine — a cluster of 10,000 nodes sees several disk failures every day. The solution is redundancy: RAID arrays, dual power supplies, hot-standby servers.
Software faults: Runaway processes, bugs triggered by specific inputs, cascading failures where service A drags down B which drags down C. These are harder to prevent because they are often systemic — the same bad assumption lives in every instance.
Human errors: Misconfigured feature flags, wrong database migrations, accidental deletions. Studies consistently find that operator mistakes cause the majority of real-world outages.

Fault Tolerance vs. Fault Prevention

You cannot prevent all faults, so reliable systems are designed to tolerate them. Key techniques:

Redundancy: Duplicate critical components (primary + replica databases, multi-AZ deployments). If one fails, another takes over.
Graceful degradation: When the recommendation service is down, show a blank recommendations panel — do not crash the entire page.
Bulkheads: Isolate subsystems so a failure in one cannot propagate. Named after the watertight compartments on ships.
Chaos engineering: Deliberately inject failures in production (Netflix's Chaos Monkey) to prove the system handles them before a real incident does.

Common pitfall: A system with five 9s of hardware uptime can still be unreliable if deployments are manual and error-prone. Reliability requires process discipline as much as technical redundancy.

3. Maintainability — Living with the System

Most of the cost of software is not in the initial build — it is in the years of operation, bug fixes, and feature additions that follow. Maintainability is about making that ongoing work manageable. It has three sub-properties:

Operability

Make it easy for operations teams to keep the system running smoothly. Good systems expose metrics and logs, provide dashboards, support graceful restarts, and make it easy to understand the current health state. A system that is a "black box" is expensive to operate.

Simplicity

Manage complexity by hiding it behind good abstractions. When a developer must hold the entire system in their head to make a change safely, velocity collapses and bugs multiply. Abstractions that expose a clean interface — databases, message queues, service APIs — let teams work on parts of a system without needing to understand everything else.

Evolvability (Extensibility)

Requirements will change. Design for it. Use patterns like modular architecture, well-defined interfaces, feature flags, and backward-compatible API versioning so that adding a new feature does not require rewriting the existing ones.

The three pillars every production system must balance: grow, stay correct, and stay changeable.

How the Three Pillars Interact

These qualities are not independent — they pull against each other in interesting ways:

Scalability vs. Reliability: Distributing data across many nodes (scales well) means you must now handle partial failures and eventual consistency. A single master database is simpler to reason about reliably, but hits scaling limits.
Scalability vs. Maintainability: A microservices architecture scales individual services independently, but a system with 200 services is vastly harder to operate and reason about than a well-structured monolith.
Reliability vs. Maintainability: Adding redundancy — multiple data centers, complex failover logic — increases operational complexity. Systems with four nines of uptime are expensive to maintain.

The key insight: There is no universally correct balance. A startup processing 1,000 requests/day should optimize for maintainability (move fast). A payment processor handling billions of transactions should optimize for reliability. A social network growing 50% month-over-month should optimize for scalability. Context determines priority.

Practical Targets

Real-world service level objectives give these pillars concrete numbers:

Scalability: "The system must handle 50,000 requests/sec at p99 latency under 200 ms, and must scale to 200,000 req/sec with no code changes — only by adding nodes."
Reliability: "The service must achieve 99.9% monthly availability (43 min downtime allowed). Any single node failure must not cause user-visible errors."
Maintainability: "Mean time to deploy a feature is under 1 hour. Any on-call engineer can diagnose and recover from a typical incident using dashboards alone, without reading source code."

Keeping these targets explicit — written in a design doc, tracked in dashboards — is what separates a system that lasts years from one that becomes unmaintainable the moment the original authors leave.