Sharding for Scale
Sharding for Scale
At some point, a single database server — no matter how powerful — stops being enough. You have added read replicas to handle query volume, you have tuned every index and query, you have upgraded the hardware to the biggest available instance. Yet writes keep growing, the dataset no longer fits comfortably in RAM, and a full-table scan on a 2-billion-row table takes minutes. You have reached the limit of vertical scaling for storage. The next step is sharding.
Sharding (also called horizontal partitioning) splits one logical database into multiple smaller databases called shards, each shard living on a separate machine. Every shard holds a distinct subset of the data. Together they add up to the full dataset — but no single shard needs to hold all of it, and no single server needs to serve all the writes.
Why Shard at All?
Before committing to sharding — which adds significant complexity — consider the problems it solves:
- Write throughput: A single MySQL primary can handle roughly 20,000–30,000 writes/second under ideal conditions. With 4 shards, you can sustain ~80,000–120,000 writes/second across the fleet. With 32 shards, millions per second become feasible.
- Dataset size: A dataset of 50 TB on one machine means every query scans a 50 TB index. Split across 10 shards of 5 TB each, each query touches only one shard — index scans are 10× smaller.
- Memory pressure: Databases perform best when the working set (hot data) fits in RAM. Sharding shrinks each shard's dataset so its working set can fit in the RAM of a modest instance.
- Isolation: A noisy batch job on one tenant's shard does not slow down other tenants' shards.
Choosing a Shard Key
The most critical decision in any sharded system is the shard key — the column (or combination of columns) used to decide which shard a given row belongs to. A bad shard key can cause hotspots, inefficient queries, and painful resharding. A good shard key distributes load evenly and aligns with your access patterns.
Common Shard Key Strategies
1. Hash-based sharding: A hash function is applied to the shard key value and the result is mod-divided by the number of shards to produce a shard index.
Pros: Even distribution across shards. No one shard gets a disproportionate share of data or writes by accident. Cons: Range queries (e.g. all users created in January) must scatter across all shards. Adding a shard means rehashing most keys — without consistent hashing, this is very disruptive.
2. Range-based sharding: Rows are assigned to shards by a numeric or alphanumeric range of the key — for example, user IDs 1–1,000,000 go to Shard 1, IDs 1,000,001–2,000,000 go to Shard 2, and so on.
Pros: Range queries are efficient — a query for "all orders in March" hits only the shards covering that time range. Easy to reason about. Cons: Hot spots if data is created sequentially (e.g. using an auto-increment ID as the shard key — all new writes land on the last shard).
3. Directory-based sharding: A separate lookup service (the "shard directory") maps each key to its shard. Lookups consult this directory before routing the query.
Pros: Maximum flexibility — you can move individual tenants between shards without rehashing. Cons: The directory is a new critical dependency; it must be highly available. Adds one network hop to every query path.
4. Geo/tenant-based sharding: Often used in SaaS systems. All data belonging to a given tenant (or region) lives on one shard. Queries are always single-shard. Legal compliance (data residency) is easy to enforce.
user_id, shard on user_id. Queries that align with the shard key hit one shard and are fast. Queries that do not align (cross-shard queries) must fan out to all shards, aggregate the results, and are expensive. Design to make the fan-out case rare.
Diagram: Shard Key Routing
The Hotspot Problem
A hotspot occurs when one shard receives a disproportionate share of traffic while the others sit idle. Hotspots can arise from two sources:
- Hot keys: A single key (or small set of keys) receives extremely high traffic — imagine a viral tweet by an account with 100 million followers. If all activity for that account routes to one shard, that shard is overwhelmed. Mitigation: shard at a more granular key (e.g.
tweet_idrather thanauthor_id), or add an application-level fan-out cache. - Uneven data distribution: Some shard ranges are naturally denser than others. If you shard by last name alphabetically, names starting with "S" may account for 15% of your dataset while names starting with "X" account for 0.01%. Mitigation: use hash-based sharding instead of range-based, or periodically rebalance.
Cross-Shard Queries and Joins
In a single-database world, a join between two tables is a cheap in-process operation. In a sharded world, data related by a foreign key may live on different shards. A join becomes a distributed join: fetch records from Shard A, fetch related records from Shard B, and merge in application memory. This is expensive, latency-prone, and hard to make transactional.
The practical solution is to co-locate related data on the same shard. If orders and order-items always need to be queried together, shard both on user_id so every user's orders and order-items land on the same shard. The join is now local and cheap.
Some data — like a global product catalog or a currency conversion table — cannot be meaningfully sharded. Keep such reference data on every shard (replicated), or serve it from a shared read-only cache in front of all shards.
Consistent Hashing: Adding Shards Without Full Rehashing
The fatal flaw of naive modulo hashing (hash(key) % N) is that changing N (adding or removing shards) invalidates almost every key mapping. If you go from 4 to 5 shards, ~80% of keys map to different shards — triggering a massive data migration.
Consistent hashing solves this by arranging shards on a virtual ring. Adding a new shard only displaces the keys that were previously assigned to its immediate neighbors on the ring — typically only 1/N of total keys must move. Cassandra, Amazon DynamoDB, and most modern sharded databases use consistent hashing under the hood.
Resharding and Schema Changes
Even with consistent hashing, resharding (changing the number of shards or the shard key itself) is a complex operational event. It involves:
- Running the new and old shard configurations in parallel (dual-write).
- Migrating existing data in the background without downtime.
- Verifying consistency between old and new shards.
- Atomically cutting over read traffic to the new configuration.
Platforms like Vitess (used by YouTube) and Citus (PostgreSQL sharding extension) automate much of this process. But the fundamental lesson is: choose your shard key very carefully the first time. Changing it later is one of the most expensive operations in system design.
What Sharding Cannot Fix
Sharding helps with write throughput and dataset size, but it does not automatically solve every problem:
- Cross-shard transactions: ACID guarantees across shards require distributed transactions (two-phase commit), which are slow and complex. Most sharded systems instead embrace eventual consistency for cross-shard operations.
- Global aggregations:
COUNT(*),SUM, orAVGacross the entire dataset must be computed per shard and merged. For analytics at scale, a separate analytics database (data warehouse) is usually a better fit than querying shards directly. - Joins across entities: If orders and customers are on different shards, joining them is expensive. Co-location is the fix, but not always possible.
Key Takeaways
- Sharding splits data across multiple independent database machines to overcome write and storage limits of a single node.
- The shard key is the most important design decision — choose one that distributes load evenly and aligns with your most common queries.
- Hash-based sharding spreads data evenly; range-based sharding enables efficient range queries but risks hot spots.
- Consistent hashing reduces the data migration cost when adding or removing shards.
- Cross-shard queries are expensive — co-locate related data on the same shard where possible.
- Sharding brings significant operational complexity; delay it until read replicas and caching no longer suffice.