Real-World System Design Case Studies

Design a Ride-Sharing System

18 min Lesson 7 of 10

Design a Ride-Sharing System

Uber, Lyft, and DiDi process hundreds of thousands of trip requests per minute, match riders to drivers in under two seconds, and track millions of moving vehicles in real time. At the core of every ride-sharing platform are two uniquely hard sub-problems: geospatial indexing (where are all my drivers right now?) and real-time matching (which driver should I assign to this rider?). This lesson dissects both.

Requirements and Scale

A back-of-the-envelope sketch before touching architecture:

Daily active riders: 10 million
Daily active drivers: 1 million — all sending GPS pings every 4 seconds
Peak ride requests: ~100,000 per minute (~1,700 rps)
Driver location writes: 1,000,000 / 4 = 250,000 writes/second (the hardest number)
Match latency SLA: under 2 seconds end-to-end
Geo precision: 10–50 m accuracy is sufficient for matching; street-level turn-by-turn navigation is handled by a separate mapping service

250k location writes per second is the design pressure point. A traditional RDBMS cannot absorb this. Every architectural choice flows from this constraint.

Geospatial Indexing — The Core Problem

When a rider requests a trip at coordinates (lat, lng), the system must answer: find all drivers within 5 km, sorted by ETA. Scanning every driver record in a database is O(n) over 1 million rows — far too slow. You need a spatial index.

Approach 1 — Geohash

A geohash encodes a (lat, lng) pair into a short alphanumeric string. The encoding is a Z-order (Morton) space-filling curve that interleaves the bits of latitude and longitude. The key property: longer common prefix = closer geographic proximity.

Geohash length 6 → ~1.2 km × 0.6 km cell
Geohash length 7 → ~153 m × 153 m cell
Geohash length 8 → ~38 m × 19 m cell

A driver at (37.7749, -122.4194) (downtown San Francisco) hashes to 9q8yy at precision 5. Every driver updating their location writes their geohash to Redis as a key in a hash map: HSET drivers:geo <driver_id> "9q8yy4n". A proximity query fetches the 9 neighbouring geohash cells (the cell itself plus the 8 surrounding cells) and scans only those buckets — typically a few hundred entries, not a million.

Edge case: geohash cells are rectangular and their neighbours can be non-contiguous near the antimeridian and poles. The correct solution is always to query the cell plus its 8 neighbours.

Approach 2 — Quadtree

A quadtree recursively subdivides the map into four quadrants. Each leaf node holds a list of drivers in that cell. Cells with many drivers are subdivided further; sparse cells stay large. This gives an adaptive index — the density of nodes mirrors the density of drivers.

Uber's older architecture used a custom in-memory quadtree service (internally called the Dispatch system). It kept the entire active-driver set in RAM across a cluster of servers, allowing sub-millisecond spatial lookups. As the quadtree is mutable (drivers move), updates must propagate quickly — this is where write amplification becomes a concern at scale.

Approach 3 — Redis GEO

Redis provides a native GEO API (commands GEOADD, GEOSEARCH) backed by a sorted set whose score is a 52-bit geohash integer. This is the pragmatic modern choice:

GEOADD drivers 37.7749 -122.4194 "driver:42" — update position in O(log n)
GEOSEARCH drivers FROMMEMBER "rider:9" BYRADIUS 5 km ASC COUNT 10 — nearest 10 drivers within 5 km in O(n+log n)
Redis Cluster shards the sorted set horizontally if needed

Left: a geohash grid showing a rider's cell and the 8 neighbouring cells that must be queried. Right: how Redis stores drivers in a sorted set scored by geohash integer.

High-Level Architecture

The ride-sharing platform decomposes into several independent services:

Location Service — ingests 250k GPS pings/sec from driver apps, writes to Redis GEO (primary) and a time-series store (for trip replay/analytics).
Matching Service — stateless workers that, on a ride request, query Redis for nearby drivers, rank them by ETA (calling the ETA Service), and assign the best candidate.
Trip Service — owns the trip lifecycle state machine (REQUESTED → ACCEPTED → EN_ROUTE → ARRIVED → IN_TRIP → COMPLETED) and persists to a relational DB (Postgres/MySQL with sharding by city).
ETA Service — wraps a routing engine (OSRM, Google Maps, or a custom graph engine) to compute drive time between two points.
Notification Service — pushes real-time updates to rider and driver apps via WebSocket / APNS / FCM.
Payment Service — handles fare calculation and charges at trip completion.

High-level ride-sharing architecture. The Location Service absorbs 250k GPS writes/sec into Redis GEO. The Matching Service reads from Redis, ranks by ETA, and assigns the best driver via the Trip Service.

Rider-Driver Matching Algorithm

Matching is not simply "nearest driver". The real ranking function weighs several signals:

ETA to pickup — the primary signal. A driver 1.2 km away but stuck in traffic is worse than one 2 km away on a clear road.
Driver rating — penalise low-rated drivers slightly to protect rider experience.
Trip direction alignment — a driver already heading toward the rider's general zone incurs lower idle deadhead cost.
Surge fairness — distribute high-multiplier trips to drivers who have waited longest to avoid starvation.

The matching loop runs like this:

Rider requests trip at (lat, lng).
Matching Service queries Redis: GEOSEARCH drivers BYRADIUS 5 km ASC COUNT 30 → returns up to 30 candidates.
Batch-call ETA Service with all 30 candidates in parallel (fan-out).
Score each candidate. Select top candidate.
Send offer to driver via the Notification Service. Driver has 10 seconds to accept.
If declined or timed out, select the next candidate and repeat.

Batch ETA calls: issuing 30 ETA requests in parallel (fan-out) instead of sequentially cuts latency from ~30 × 50 ms = 1,500 ms down to ~50–100 ms (one round trip). This is the key trick that keeps matching under 2 seconds.

Handling Location Write Throughput

250,000 Redis GEOADD writes per second exceed what a single Redis node can handle (a single Redis node tops out around 100k–200k simple ops/sec under real conditions). Solutions:

Redis Cluster with geographic sharding — partition the keyspace by city or region. Each shard handles one city's drivers, cutting per-node load dramatically.
Client-side batching — the driver app sends one ping every 4 seconds; the Location Service batches pings from multiple drivers and issues GEOADD in pipelines of 100–500 commands, amortising network round-trip cost.
Write-through buffer — put a small in-memory buffer (Kafka topic or a ring buffer) in front of Redis to absorb bursts during sudden traffic spikes without dropping updates.

Do not use a relational DB as the primary location store. Even with a PostGIS extension, a Postgres instance cannot absorb 250k updates/sec at low latency without massive vertical scaling. Redis is purpose-built for high-frequency, small-payload, keyed updates.

Consistency and Failure Handling

Location data is inherently stale — by the time a match decision is made, the driver may have moved. The system accepts eventual consistency on location. What must be strongly consistent is the trip assignment (you cannot assign the same driver to two riders). The Trip Service uses an optimistic lock (a CAS update on the driver status column) or a distributed lock (Redis SETNX on driver:<id>:lock) to prevent double-assignment.

If the Location Service crashes, drivers' Redis entries expire via a TTL (set to ~10 seconds). Expired entries are automatically excluded from GEOSEARCH results, so a rider will never be matched to a driver whose app has disconnected.

Summary: Key Design Decisions

Use Redis GEO (or geohash + Redis) for all driver location reads and writes — it is the only practical option at 250k writes/sec.
The matching service is stateless — horizontal scale by adding workers behind a load balancer.
Use fan-out ETA calls in the matching loop to meet the 2-second SLA.
Shard Redis by city/region to keep per-node write rate manageable.
Use a TTL on driver location keys to automatically remove disconnected drivers from the available pool.
Protect driver assignment with a distributed lock or optimistic concurrency control — location can be eventually consistent, assignment cannot.