Scalability, Reliability & Maintainability
Scalability, Reliability & Maintainability
Every large-scale system is ultimately judged by three qualities: can it grow without falling over, does it keep working when things go wrong, and can engineers change it without fear? These three pillars — scalability, reliability, and maintainability — are the lens through which experienced engineers evaluate every architectural decision.
1. Scalability — Handling Growth
Scalability is the ability of a system to handle an increasing amount of work by adding resources. It answers the question: "If the load doubles, what happens?"
Vertical vs. Horizontal Scaling
- Vertical scaling (scale up): Give one machine more CPU, RAM, or faster disks. It is simple — no code changes needed — but hits hard limits quickly. The biggest AWS EC2 instance today has 192 vCPUs and 1.5 TB RAM. Beyond that, you cannot scale up.
- Horizontal scaling (scale out): Add more machines and distribute the load. This is how web giants operate. Netflix, for example, runs across tens of thousands of cloud instances. The trade-off is added complexity: you must handle data distribution, network calls, and consistency.
Measuring Scalability: Load Parameters
Before you can optimize, you must define what load means for your system. Common load parameters include:
- Requests per second to a web server
- Ratio of reads to writes in a database (Twitter's feed is ~100:1 read-heavy)
- Number of simultaneously active users in a chat system
- Cache hit rate
Pick the numbers that matter most, then ask: if that number grows 10×, how does the system respond?
2. Reliability — Continuing to Work Correctly
Reliability is the ability of a system to continue performing its intended function correctly, even when things go wrong. "Things going wrong" can mean hardware faults, software bugs, or even human configuration errors — the single largest cause of outages in production.
The Three Fault Categories
- Hardware faults: Disk crashes, RAM errors, network card failures. At a large scale these are routine — a cluster of 10,000 nodes sees several disk failures every day. The solution is redundancy: RAID arrays, dual power supplies, hot-standby servers.
- Software faults: Runaway processes, bugs triggered by specific inputs, cascading failures where service A drags down B which drags down C. These are harder to prevent because they are often systemic — the same bad assumption lives in every instance.
- Human errors: Misconfigured feature flags, wrong database migrations, accidental deletions. Studies consistently find that operator mistakes cause the majority of real-world outages.
Fault Tolerance vs. Fault Prevention
You cannot prevent all faults, so reliable systems are designed to tolerate them. Key techniques:
- Redundancy: Duplicate critical components (primary + replica databases, multi-AZ deployments). If one fails, another takes over.
- Graceful degradation: When the recommendation service is down, show a blank recommendations panel — do not crash the entire page.
- Bulkheads: Isolate subsystems so a failure in one cannot propagate. Named after the watertight compartments on ships.
- Chaos engineering: Deliberately inject failures in production (Netflix's Chaos Monkey) to prove the system handles them before a real incident does.
3. Maintainability — Living with the System
Most of the cost of software is not in the initial build — it is in the years of operation, bug fixes, and feature additions that follow. Maintainability is about making that ongoing work manageable. It has three sub-properties:
Operability
Make it easy for operations teams to keep the system running smoothly. Good systems expose metrics and logs, provide dashboards, support graceful restarts, and make it easy to understand the current health state. A system that is a "black box" is expensive to operate.
Simplicity
Manage complexity by hiding it behind good abstractions. When a developer must hold the entire system in their head to make a change safely, velocity collapses and bugs multiply. Abstractions that expose a clean interface — databases, message queues, service APIs — let teams work on parts of a system without needing to understand everything else.
Evolvability (Extensibility)
Requirements will change. Design for it. Use patterns like modular architecture, well-defined interfaces, feature flags, and backward-compatible API versioning so that adding a new feature does not require rewriting the existing ones.
How the Three Pillars Interact
These qualities are not independent — they pull against each other in interesting ways:
- Scalability vs. Reliability: Distributing data across many nodes (scales well) means you must now handle partial failures and eventual consistency. A single master database is simpler to reason about reliably, but hits scaling limits.
- Scalability vs. Maintainability: A microservices architecture scales individual services independently, but a system with 200 services is vastly harder to operate and reason about than a well-structured monolith.
- Reliability vs. Maintainability: Adding redundancy — multiple data centers, complex failover logic — increases operational complexity. Systems with four nines of uptime are expensive to maintain.
Practical Targets
Real-world service level objectives give these pillars concrete numbers:
- Scalability: "The system must handle 50,000 requests/sec at p99 latency under 200 ms, and must scale to 200,000 req/sec with no code changes — only by adding nodes."
- Reliability: "The service must achieve 99.9% monthly availability (43 min downtime allowed). Any single node failure must not cause user-visible errors."
- Maintainability: "Mean time to deploy a feature is under 1 hour. Any on-call engineer can diagnose and recover from a typical incident using dashboards alone, without reading source code."
Keeping these targets explicit — written in a design doc, tracked in dashboards — is what separates a system that lasts years from one that becomes unmaintainable the moment the original authors leave.