Partitioning & Sharding
Partitioning & Sharding
A single database server has a ceiling. At some point — whether it is 500 GB of data, 50,000 writes per second, or a table scan that now takes 40 seconds — the machine simply cannot keep up. Partitioning and sharding are the two core techniques for breaking that ceiling by splitting data across multiple storage units so that no single node holds everything.
Understanding where and how to split data — and choosing the right partition key — is one of the highest-leverage decisions in large-scale system design. Get it wrong and you trade one bottleneck for another. Get it right and your system scales almost linearly.
Partitioning vs Sharding — The Terminology
The terms are often used interchangeably, but they have a precise meaning:
- Partitioning — splitting a table into multiple segments. The segments may live on the same physical server (within one database instance) or across different servers. PostgreSQL table partitioning, MySQL partitions, and Oracle partitions are examples of intra-server partitioning.
- Sharding — a specific form of horizontal partitioning where each partition (a shard) lives on a different server (or cluster). A 10-shard setup means 10 separate database instances, each owning roughly 1/10 of the data.
In practice: sharding is always horizontal partitioning; horizontal partitioning is not always sharding. In system design interviews and engineering discussions, "sharding" almost always means splitting data across multiple database hosts.
Horizontal vs Vertical Partitioning
Before diving into sharding strategies, understand the two axes of splitting:
- Horizontal partitioning (sharding) — split rows. Each partition holds a subset of the rows but the same columns. A users table with 1 billion rows might be split so shard 1 holds users 1–100 M, shard 2 holds 100 M–200 M, etc.
- Vertical partitioning — split columns. Move hot columns (frequently accessed) to one table and cold columns (rarely accessed blobs, audit fields) to another. This reduces row width for the hot path, improves cache efficiency, and can reduce I/O dramatically on column-store engines. It is usually applied within a single database instance.
Why Shard at All?
Sharding solves four problems that arise as a dataset and traffic grow:
- Storage limits — a single SSD can hold perhaps 30–100 TB. If your dataset is 500 TB, one machine cannot hold it.
- Write throughput limits — a primary database node can handle roughly 10,000–100,000 writes/sec depending on the workload. Twitter at peak ingested ~6,000 tweets/sec; the underlying storage had to absorb fan-out writes an order of magnitude larger.
- Lock contention — on a single node, hot rows create lock hot spots. Splitting those rows across shards means each shard handles a fraction of the contention.
- Latency under load — query execution time grows as table scans or index lookups touch more data. Smaller shards mean smaller indexes and faster scans.
Sharding Strategies
There are four main strategies for deciding which shard a given row belongs to. Each trades different properties.
1. Range-Based Partitioning
Rows are assigned to shards based on a range of a key value. For example, users with id 1–1,000,000 go to shard A; 1,000,001–2,000,000 go to shard B.
- Pro: Range queries are efficient — all data for a date range or ID range is on one shard.
- Con: Creates hot spots. If IDs are assigned sequentially (auto-increment), all new writes always land on the last shard. Similarly, time-series data makes the "today" shard a hot spot.
2. Hash-Based Partitioning
Apply a hash function to the partition key: shard = hash(user_id) % N. The distribution is near-uniform, so no single shard gets all the new traffic.
- Pro: Even load distribution, no hot spots for random-access workloads.
- Con: Range queries are expensive — data for a range of user IDs is scattered across all shards. Changing
N(adding or removing shards) requires rehashing almost all data — extremely disruptive.
3. Consistent Hashing
An advanced variant of hash-based partitioning. Place shards at points on a conceptual ring (hash space 0–2³²). Each key hashes to a point on the ring and is assigned to the nearest shard clockwise. When a shard is added or removed, only the data on the adjacent arc moves — typically 1/N of the data rather than everything.
- Used by: DynamoDB, Cassandra, Amazon S3, Chord DHT.
- Pro: Minimal data movement during shard changes; elegant rebalancing.
- Con: More complex to implement; without virtual nodes (vnodes) the distribution can be uneven.
4. Directory-Based Partitioning
A lookup service (the "directory") maintains a mapping from key to shard. The routing layer queries the directory to find the correct shard for each request.
- Pro: Maximum flexibility — you can move any key to any shard at any time without a formula change.
- Con: The directory is a single point of failure and a bottleneck. Caching the directory helps, but adds staleness risk.
Choosing a Partition Key
The partition key (also called the shard key) is the single most important decision in a sharded system. It determines data distribution, query routing, and whether cross-shard operations will be needed.
A good partition key must satisfy these criteria simultaneously:
- High cardinality — there must be enough distinct values to spread data across all shards. A boolean
is_activefield has two values; you could never have more than two shards.user_idwith millions of distinct values is far better. - Even distribution — the chosen key should not cluster most rows on a few shards. If 80% of your traffic is for VIP users and you shard by user tier, most load still hits one shard.
- Query locality — the most frequent queries should be answerable by a single shard. If your application's most common query is "give me all orders for a user", sharding by
user_idmeans each user's orders live on one shard — one network hop. Sharding byorder_dateinstead scatters a user's orders everywhere, requiring fan-out. - Avoids monotonic growth — auto-increment IDs and timestamps grow in one direction, funneling new writes to the last shard. UUID v4 or hash-derived keys avoid this.
The Hotspot Problem in Detail
A hotspot occurs when one shard receives a disproportionate share of traffic. It is the most common sharding failure mode.
Cause 1 — Famous entities: On a social platform sharded by user_id, a celebrity with 100 million followers whose posts trigger fan-out writes to 100 M follower feeds will cause enormous write pressure on whichever shard holds that celebrity's data.
Mitigation: Detect celebrities/hot keys; store their data on a dedicated shard or replicate their high-read data to all shards.
Cause 2 — Temporal hot spots: A time-series table sharded by date: today's shard absorbs all writes while every other shard is idle.
Mitigation: Use a hash of (timestamp + random_salt) as the shard key, or use a write-ahead log that distributes time-series data evenly.
Cause 3 — Skewed hash: Even with a hash key, if the keyspace distribution is inherently skewed (e.g. 90% of users live in one country and the country code is part of the key), some shards will be larger.
Mitigation: Consistent hashing with virtual nodes (vnodes) — each physical shard owns multiple small arcs on the ring, smoothing out skew.
Cross-Shard Operations — The Hard Part
Once data is spread across shards, operations that touch multiple shards become expensive:
- Cross-shard queries — a
SELECTthat spans multiple shards must be issued to each shard in parallel and the results merged in the application layer (scatter-gather). This adds latency and load. - Cross-shard joins — joins across shards are not supported by the database engine; they must be done in the application. This is why sharding design emphasises query locality.
- Cross-shard transactions — true ACID transactions across shards require a distributed transaction protocol (2PC — two-phase commit), which is slow and introduces coordinator failure risk. Most sharded systems avoid distributed transactions by redesigning the data model to keep related rows on the same shard.
Rebalancing Shards
As data grows or as you add/remove shard nodes, you need to move data between shards. This is called rebalancing and it is operationally challenging:
- With simple
hash % N, adding one shard changes the formula for nearly all keys — almost every row needs to move. - With consistent hashing, only the keys on the arc adjacent to the new node migrate — roughly
1/Nof data. - During rebalancing, reads and writes must still be served. The common pattern is: start migrating data in the background, use a routing layer that reads from both old and new locations until migration is confirmed, then switch all traffic.
Sharding in the Wild
- MySQL (Vitess) — YouTube built Vitess to shard MySQL horizontally. Each shard is a standard MySQL instance; Vitess handles routing, rebalancing, and cross-shard scatter-gather. Now open-source and used by GitHub, PlanetScale, and others.
- Cassandra — uses consistent hashing with vnodes natively. Each node owns multiple token ranges on the ring. Replication factor controls how many nodes hold a copy of each token range.
- MongoDB — shard clusters distribute collections by a shard key. The
mongosrouter handles scatter-gather for queries that cannot be routed to a single shard. - DynamoDB — AWS handles sharding transparently. Your partition key choice still matters for avoiding hot partitions, but the infrastructure manages rebalancing automatically.
Summary
- Horizontal partitioning (sharding) splits rows across multiple database nodes, enabling storage and write throughput to grow beyond a single machine.
- The four main strategies are: range (simple, hot spots), hash (even load, costly resize), consistent hashing (minimal rebalancing, used in Cassandra/DynamoDB), and directory-based (flexible, adds lookup overhead).
- Choosing a good partition key requires high cardinality, even distribution, query locality, and avoidance of monotonic growth.
- Cross-shard operations (joins, transactions, range queries) are expensive and should be minimised by design.
- Shard only when necessary — the operational complexity is high; exhaust vertical scaling, replicas, caching, and indexing first.