Databases & Storage

Partitioning & Sharding

18 min Lesson 6 of 10

Partitioning & Sharding

A single database server has a ceiling. At some point — whether it is 500 GB of data, 50,000 writes per second, or a table scan that now takes 40 seconds — the machine simply cannot keep up. Partitioning and sharding are the two core techniques for breaking that ceiling by splitting data across multiple storage units so that no single node holds everything.

Understanding where and how to split data — and choosing the right partition key — is one of the highest-leverage decisions in large-scale system design. Get it wrong and you trade one bottleneck for another. Get it right and your system scales almost linearly.

Partitioning vs Sharding — The Terminology

The terms are often used interchangeably, but they have a precise meaning:

Partitioning — splitting a table into multiple segments. The segments may live on the same physical server (within one database instance) or across different servers. PostgreSQL table partitioning, MySQL partitions, and Oracle partitions are examples of intra-server partitioning.
Sharding — a specific form of horizontal partitioning where each partition (a shard) lives on a different server (or cluster). A 10-shard setup means 10 separate database instances, each owning roughly 1/10 of the data.

In practice: sharding is always horizontal partitioning; horizontal partitioning is not always sharding. In system design interviews and engineering discussions, "sharding" almost always means splitting data across multiple database hosts.

Horizontal vs Vertical Partitioning

Before diving into sharding strategies, understand the two axes of splitting:

Horizontal partitioning (sharding) — split rows. Each partition holds a subset of the rows but the same columns. A users table with 1 billion rows might be split so shard 1 holds users 1–100 M, shard 2 holds 100 M–200 M, etc.
Vertical partitioning — split columns. Move hot columns (frequently accessed) to one table and cold columns (rarely accessed blobs, audit fields) to another. This reduces row width for the hot path, improves cache efficiency, and can reduce I/O dramatically on column-store engines. It is usually applied within a single database instance.

This lesson focuses on horizontal partitioning / sharding because that is the technique that enables a system to scale writes and storage beyond the capacity of any single machine.

Why Shard at All?

Sharding solves four problems that arise as a dataset and traffic grow:

Storage limits — a single SSD can hold perhaps 30–100 TB. If your dataset is 500 TB, one machine cannot hold it.
Write throughput limits — a primary database node can handle roughly 10,000–100,000 writes/sec depending on the workload. Twitter at peak ingested ~6,000 tweets/sec; the underlying storage had to absorb fan-out writes an order of magnitude larger.
Lock contention — on a single node, hot rows create lock hot spots. Splitting those rows across shards means each shard handles a fraction of the contention.
Latency under load — query execution time grows as table scans or index lookups touch more data. Smaller shards mean smaller indexes and faster scans.

Sharding is a last resort, not a first step. It adds enormous operational complexity: cross-shard queries, distributed transactions, rebalancing, and schema changes all become hard problems. Exhaust every other option first — vertical scaling, read replicas, caching, query optimisation, and better indexes — before you reach for sharding.

Sharding Strategies

There are four main strategies for deciding which shard a given row belongs to. Each trades different properties.

1. Range-Based Partitioning

Rows are assigned to shards based on a range of a key value. For example, users with id 1–1,000,000 go to shard A; 1,000,001–2,000,000 go to shard B.

Pro: Range queries are efficient — all data for a date range or ID range is on one shard.
Con: Creates hot spots. If IDs are assigned sequentially (auto-increment), all new writes always land on the last shard. Similarly, time-series data makes the "today" shard a hot spot.

2. Hash-Based Partitioning

Apply a hash function to the partition key: shard = hash(user_id) % N. The distribution is near-uniform, so no single shard gets all the new traffic.

Pro: Even load distribution, no hot spots for random-access workloads.
Con: Range queries are expensive — data for a range of user IDs is scattered across all shards. Changing N (adding or removing shards) requires rehashing almost all data — extremely disruptive.

3. Consistent Hashing

An advanced variant of hash-based partitioning. Place shards at points on a conceptual ring (hash space 0–2³²). Each key hashes to a point on the ring and is assigned to the nearest shard clockwise. When a shard is added or removed, only the data on the adjacent arc moves — typically 1/N of the data rather than everything.

Used by: DynamoDB, Cassandra, Amazon S3, Chord DHT.
Pro: Minimal data movement during shard changes; elegant rebalancing.
Con: More complex to implement; without virtual nodes (vnodes) the distribution can be uneven.

4. Directory-Based Partitioning

A lookup service (the "directory") maintains a mapping from key to shard. The routing layer queries the directory to find the correct shard for each request.

Pro: Maximum flexibility — you can move any key to any shard at any time without a formula change.
Con: The directory is a single point of failure and a bottleneck. Caching the directory helps, but adds staleness risk.

Three sharding strategies side by side: range partitioning (simple but prone to hot spots), hash partitioning (even distribution but costly to resize), and consistent hashing (minimal rebalancing, used in production distributed systems).

Choosing a Partition Key

The partition key (also called the shard key) is the single most important decision in a sharded system. It determines data distribution, query routing, and whether cross-shard operations will be needed.

A good partition key must satisfy these criteria simultaneously:

High cardinality — there must be enough distinct values to spread data across all shards. A boolean is_active field has two values; you could never have more than two shards. user_id with millions of distinct values is far better.
Even distribution — the chosen key should not cluster most rows on a few shards. If 80% of your traffic is for VIP users and you shard by user tier, most load still hits one shard.
Query locality — the most frequent queries should be answerable by a single shard. If your application's most common query is "give me all orders for a user", sharding by user_id means each user's orders live on one shard — one network hop. Sharding by order_date instead scatters a user's orders everywhere, requiring fan-out.
Avoids monotonic growth — auto-increment IDs and timestamps grow in one direction, funneling new writes to the last shard. UUID v4 or hash-derived keys avoid this.

Real-world example — Instagram: Instagram shards its media table by a media ID that embeds the creation timestamp and shard ID in the top bits of a 64-bit integer (a technique similar to Twitter Snowflake IDs). This gives them: (a) time-sortable IDs without a central sequence generator, (b) even distribution because the time component keeps filling all shards, and (c) immediate routing — the shard is encoded in the ID, so no lookup is needed.

The Hotspot Problem in Detail

A hotspot occurs when one shard receives a disproportionate share of traffic. It is the most common sharding failure mode.

Cause 1 — Famous entities: On a social platform sharded by user_id, a celebrity with 100 million followers whose posts trigger fan-out writes to 100 M follower feeds will cause enormous write pressure on whichever shard holds that celebrity's data.

Mitigation: Detect celebrities/hot keys; store their data on a dedicated shard or replicate their high-read data to all shards.

Cause 2 — Temporal hot spots: A time-series table sharded by date: today's shard absorbs all writes while every other shard is idle.

Mitigation: Use a hash of (timestamp + random_salt) as the shard key, or use a write-ahead log that distributes time-series data evenly.

Cause 3 — Skewed hash: Even with a hash key, if the keyspace distribution is inherently skewed (e.g. 90% of users live in one country and the country code is part of the key), some shards will be larger.

Mitigation: Consistent hashing with virtual nodes (vnodes) — each physical shard owns multiple small arcs on the ring, smoothing out skew.

Cross-Shard Operations — The Hard Part

Once data is spread across shards, operations that touch multiple shards become expensive:

Cross-shard queries — a SELECT that spans multiple shards must be issued to each shard in parallel and the results merged in the application layer (scatter-gather). This adds latency and load.
Cross-shard joins — joins across shards are not supported by the database engine; they must be done in the application. This is why sharding design emphasises query locality.
Cross-shard transactions — true ACID transactions across shards require a distributed transaction protocol (2PC — two-phase commit), which is slow and introduces coordinator failure risk. Most sharded systems avoid distributed transactions by redesigning the data model to keep related rows on the same shard.

Single-shard lookup (left) routes directly using the partition key — fast, one hop. Scatter-gather (right) fans out to every shard when no partition key is available — slow and expensive at scale.

Rebalancing Shards

As data grows or as you add/remove shard nodes, you need to move data between shards. This is called rebalancing and it is operationally challenging:

With simple hash % N, adding one shard changes the formula for nearly all keys — almost every row needs to move.
With consistent hashing, only the keys on the arc adjacent to the new node migrate — roughly 1/N of data.
During rebalancing, reads and writes must still be served. The common pattern is: start migrating data in the background, use a routing layer that reads from both old and new locations until migration is confirmed, then switch all traffic.

Sharding in the Wild

MySQL (Vitess) — YouTube built Vitess to shard MySQL horizontally. Each shard is a standard MySQL instance; Vitess handles routing, rebalancing, and cross-shard scatter-gather. Now open-source and used by GitHub, PlanetScale, and others.
Cassandra — uses consistent hashing with vnodes natively. Each node owns multiple token ranges on the ring. Replication factor controls how many nodes hold a copy of each token range.
MongoDB — shard clusters distribute collections by a shard key. The mongos router handles scatter-gather for queries that cannot be routed to a single shard.
DynamoDB — AWS handles sharding transparently. Your partition key choice still matters for avoiding hot partitions, but the infrastructure manages rebalancing automatically.

For most applications, you do not need to shard your own database. Managed databases like Aurora, Cloud Spanner, and CockroachDB provide horizontal scalability while hiding the sharding details. The goal of this lesson is to understand the mechanics so you can reason about trade-offs, design partition keys correctly, and interpret system design discussions intelligently — not necessarily to implement raw sharding from scratch.

Summary

Horizontal partitioning (sharding) splits rows across multiple database nodes, enabling storage and write throughput to grow beyond a single machine.
The four main strategies are: range (simple, hot spots), hash (even load, costly resize), consistent hashing (minimal rebalancing, used in Cassandra/DynamoDB), and directory-based (flexible, adds lookup overhead).
Choosing a good partition key requires high cardinality, even distribution, query locality, and avoidance of monotonic growth.
Cross-shard operations (joins, transactions, range queries) are expensive and should be minimised by design.
Shard only when necessary — the operational complexity is high; exhaust vertical scaling, replicas, caching, and indexing first.