Scaling & Load Balancing

Sharding for Scale

18 min Lesson 8 of 10

Sharding for Scale

At some point, a single database server — no matter how powerful — stops being enough. You have added read replicas to handle query volume, you have tuned every index and query, you have upgraded the hardware to the biggest available instance. Yet writes keep growing, the dataset no longer fits comfortably in RAM, and a full-table scan on a 2-billion-row table takes minutes. You have reached the limit of vertical scaling for storage. The next step is sharding.

Sharding (also called horizontal partitioning) splits one logical database into multiple smaller databases called shards, each shard living on a separate machine. Every shard holds a distinct subset of the data. Together they add up to the full dataset — but no single shard needs to hold all of it, and no single server needs to serve all the writes.

Sharding vs Partitioning: The terms are often used interchangeably, but there is a useful distinction. Partitioning splits a table within one database server (e.g. MySQL table partitioning by month). Sharding splits data across entirely separate, independent database servers. Sharding is the technique that unlocks true horizontal scale for writes and storage.

Why Shard at All?

Before committing to sharding — which adds significant complexity — consider the problems it solves:

Write throughput: A single MySQL primary can handle roughly 20,000–30,000 writes/second under ideal conditions. With 4 shards, you can sustain ~80,000–120,000 writes/second across the fleet. With 32 shards, millions per second become feasible.
Dataset size: A dataset of 50 TB on one machine means every query scans a 50 TB index. Split across 10 shards of 5 TB each, each query touches only one shard — index scans are 10× smaller.
Memory pressure: Databases perform best when the working set (hot data) fits in RAM. Sharding shrinks each shard's dataset so its working set can fit in the RAM of a modest instance.
Isolation: A noisy batch job on one tenant's shard does not slow down other tenants' shards.

Choosing a Shard Key

The most critical decision in any sharded system is the shard key — the column (or combination of columns) used to decide which shard a given row belongs to. A bad shard key can cause hotspots, inefficient queries, and painful resharding. A good shard key distributes load evenly and aligns with your access patterns.

Common Shard Key Strategies

1. Hash-based sharding: A hash function is applied to the shard key value and the result is mod-divided by the number of shards to produce a shard index.

shard_id = hash(user_id) % number_of_shards
# user_id 10001 → hash = 9283 → 9283 % 4 = 3 → Shard 3
# user_id 10002 → hash = 1147 → 1147 % 4 = 3 → Shard 3
# user_id 10003 → hash = 6721 → 6721 % 4 = 1 → Shard 1

Pros: Even distribution across shards. No one shard gets a disproportionate share of data or writes by accident. Cons: Range queries (e.g. all users created in January) must scatter across all shards. Adding a shard means rehashing most keys — without consistent hashing, this is very disruptive.

2. Range-based sharding: Rows are assigned to shards by a numeric or alphanumeric range of the key — for example, user IDs 1–1,000,000 go to Shard 1, IDs 1,000,001–2,000,000 go to Shard 2, and so on.

Pros: Range queries are efficient — a query for "all orders in March" hits only the shards covering that time range. Easy to reason about. Cons: Hot spots if data is created sequentially (e.g. using an auto-increment ID as the shard key — all new writes land on the last shard).

3. Directory-based sharding: A separate lookup service (the "shard directory") maps each key to its shard. Lookups consult this directory before routing the query.

Pros: Maximum flexibility — you can move individual tenants between shards without rehashing. Cons: The directory is a new critical dependency; it must be highly available. Adds one network hop to every query path.

4. Geo/tenant-based sharding: Often used in SaaS systems. All data belonging to a given tenant (or region) lives on one shard. Queries are always single-shard. Legal compliance (data residency) is easy to enforce.

Pick the shard key that matches your most common query pattern. If 90% of your queries filter by user_id, shard on user_id. Queries that align with the shard key hit one shard and are fast. Queries that do not align (cross-shard queries) must fan out to all shards, aggregate the results, and are expensive. Design to make the fan-out case rare.

Diagram: Shard Key Routing

Hash-based sharding: the router hashes the user_id to pick one shard. Cross-shard queries must fan out to all shards and merge results in the application.

The Hotspot Problem

A hotspot occurs when one shard receives a disproportionate share of traffic while the others sit idle. Hotspots can arise from two sources:

Hot keys: A single key (or small set of keys) receives extremely high traffic — imagine a viral tweet by an account with 100 million followers. If all activity for that account routes to one shard, that shard is overwhelmed. Mitigation: shard at a more granular key (e.g. tweet_id rather than author_id), or add an application-level fan-out cache.
Uneven data distribution: Some shard ranges are naturally denser than others. If you shard by last name alphabetically, names starting with "S" may account for 15% of your dataset while names starting with "X" account for 0.01%. Mitigation: use hash-based sharding instead of range-based, or periodically rebalance.

Never use an auto-increment ID as a range-based shard key. All new rows will always be inserted into the last shard (the highest range). That shard becomes a permanent hot-write destination while all other shards sit idle for writes. Use a hash of the ID, a UUID, or a snowflake ID that spreads writes uniformly.

Cross-Shard Queries and Joins

In a single-database world, a join between two tables is a cheap in-process operation. In a sharded world, data related by a foreign key may live on different shards. A join becomes a distributed join: fetch records from Shard A, fetch related records from Shard B, and merge in application memory. This is expensive, latency-prone, and hard to make transactional.

The practical solution is to co-locate related data on the same shard. If orders and order-items always need to be queried together, shard both on user_id so every user's orders and order-items land on the same shard. The join is now local and cheap.

Some data — like a global product catalog or a currency conversion table — cannot be meaningfully sharded. Keep such reference data on every shard (replicated), or serve it from a shared read-only cache in front of all shards.

Consistent Hashing: Adding Shards Without Full Rehashing

The fatal flaw of naive modulo hashing (hash(key) % N) is that changing N (adding or removing shards) invalidates almost every key mapping. If you go from 4 to 5 shards, ~80% of keys map to different shards — triggering a massive data migration.

Consistent hashing solves this by arranging shards on a virtual ring. Adding a new shard only displaces the keys that were previously assigned to its immediate neighbors on the ring — typically only 1/N of total keys must move. Cassandra, Amazon DynamoDB, and most modern sharded databases use consistent hashing under the hood.

Consistent hashing places shards on a virtual ring; adding a new shard only displaces nearby keys, avoiding full data migration.

Resharding and Schema Changes

Even with consistent hashing, resharding (changing the number of shards or the shard key itself) is a complex operational event. It involves:

Running the new and old shard configurations in parallel (dual-write).
Migrating existing data in the background without downtime.
Verifying consistency between old and new shards.
Atomically cutting over read traffic to the new configuration.

Platforms like Vitess (used by YouTube) and Citus (PostgreSQL sharding extension) automate much of this process. But the fundamental lesson is: choose your shard key very carefully the first time. Changing it later is one of the most expensive operations in system design.

What Sharding Cannot Fix

Sharding helps with write throughput and dataset size, but it does not automatically solve every problem:

Cross-shard transactions: ACID guarantees across shards require distributed transactions (two-phase commit), which are slow and complex. Most sharded systems instead embrace eventual consistency for cross-shard operations.
Global aggregations: COUNT(*), SUM, or AVG across the entire dataset must be computed per shard and merged. For analytics at scale, a separate analytics database (data warehouse) is usually a better fit than querying shards directly.
Joins across entities: If orders and customers are on different shards, joining them is expensive. Co-location is the fix, but not always possible.

Defer sharding as long as possible. The added complexity is significant: you lose easy joins, ACID transactions, and simple schema migrations. Many systems reaching 500 GB–1 TB of data can use a combination of read replicas, caching, and vertical scaling of the primary for years before sharding becomes genuinely necessary. Instagram ran on a single-primary PostgreSQL cluster for two years and served millions of users. Shard when the alternatives are exhausted.

Key Takeaways

Sharding splits data across multiple independent database machines to overcome write and storage limits of a single node.
The shard key is the most important design decision — choose one that distributes load evenly and aligns with your most common queries.
Hash-based sharding spreads data evenly; range-based sharding enables efficient range queries but risks hot spots.
Consistent hashing reduces the data migration cost when adding or removing shards.
Cross-shard queries are expensive — co-locate related data on the same shard where possible.
Sharding brings significant operational complexity; delay it until read replicas and caching no longer suffice.

Next up: Lesson 9 explores Auto-Scaling and Elasticity — how cloud infrastructure can add and remove capacity automatically in response to real-time demand signals, removing the need for manual capacity planning.