Project: Design the Data Layer
Project: Design the Data Layer
Everything in this tutorial — SQL vs NoSQL, ACID guarantees, indexing strategies, replication topologies, sharding approaches, denormalization trade-offs, blob storage, and database selection frameworks — converges in a single practical skill: designing a data layer for a real system from scratch. This capstone lesson walks through that process end-to-end for a concrete, non-trivial system, showing the reasoning behind every decision.
The System: a Ride-Hailing Platform
We will design the data layer for a service similar to Uber or Lyft. It has five major subsystems, each with distinct data characteristics:
- User & Driver Accounts — registration, profiles, authentication, document verification
- Trip Lifecycle — booking, matching, routing, fare calculation, completion, history
- Real-Time Location — driver GPS positions updated every 3–5 seconds for millions of active drivers
- Payments & Billing — charges, refunds, driver payouts, invoice history
- Analytics & Reporting — business dashboards, fraud detection, supply/demand heatmaps
Scale targets: 10 million daily active riders, 2 million active drivers, 5 million trips per day, peak 50,000 concurrent active trips.
Step 1 — Identify Access Patterns Before Picking Storage
The single most important step is to enumerate the dominant read and write patterns for each subsystem before touching a database name. Storage technology should follow access patterns, not the other way around.
- Accounts: Read by user ID on every authenticated request; write on registration/update. Strongly consistent. Few writes, many reads.
- Trips: Created on booking; updated multiple times during the ride (matching, start, end, rating); queried by rider/driver ID for history. Multi-step state machine — atomicity matters.
- Locations: Write-heavy — 2 million drivers × 1 update/4 seconds = ~500,000 writes/second at peak. Queries are spatial: "find all drivers within 2 km of (lat, lng)". Data is ephemeral — we only care about the current position, not history.
- Payments: Double-entry ledger. Every charge must be paired with a debit; partial failures are catastrophic. Requires the strongest consistency guarantees.
- Analytics: Heavy aggregations over large historical datasets. Latency tolerance is high (seconds to minutes). Query patterns are exploratory and unpredictable.
Step 2 — Assign Each Subsystem to the Right Storage Technology
With access patterns clear, we can now match each subsystem to the storage technology it genuinely needs:
- Accounts → PostgreSQL (primary + read replicas). Structured data, stable schema, complex queries (JOIN users → documents → preferences), strong consistency for auth flows. A single primary easily handles account writes at this scale. Add 2–3 read replicas to distribute profile lookups.
- Trips → PostgreSQL (same cluster, or a dedicated cluster for isolation). Trips are relational (riders, drivers, routes, fares, ratings all reference each other), require multi-row atomic updates (state machine transitions), and benefit from rich SQL queries for dispute resolution and support. Shard by
city_idorrider_idif write volume outgrows a single primary — at 5 million trips/day that is ~58 writes/second, well within a single PostgreSQL primary. - Real-Time Locations → Redis (cluster mode, geo commands). Redis
GEOADD/GEORADIUSprovide sub-millisecond geospatial queries. In cluster mode across 6 nodes (3 primaries, 3 replicas), Redis handles 1+ million writes/second. Positions are stored with a TTL of ~30 seconds — if a driver stops sending updates, their record expires automatically. This is purely an ephemeral cache; no durability needed. - Payments → PostgreSQL with strict serializable isolation. A double-entry accounting ledger. Every row in the
ledgertable represents one side of a transaction. A payment triggers two rows atomically (debit rider, credit driver escrow). Serializable isolation prevents phantom reads. Consider a dedicated PostgreSQL cluster isolated from the trip database so a heavy analytical query cannot starve payment writes. - Analytics → Amazon S3 + Apache Parquet + Query Engine (Athena / BigQuery / Redshift). Trip and payment records are streamed to S3 in Parquet format via a change-data-capture (CDC) pipeline. Analysts query with Athena or BigQuery — pay-per-query, no warehouse to manage. For real-time dashboards (fraud signals, supply heatmaps) add Apache Kafka → Apache Flink → a time-series store (InfluxDB or TimescaleDB).
- Driver Documents & Profile Photos → Object Storage (S3 / GCS). Immutable binary blobs; accessed infrequently after upload; CDN-distributed for fast global download. Store only the S3 key in PostgreSQL, never the binary itself in the database.
Step 3 — Model the Core Schemas
With storage technologies chosen, we model the two most critical relational schemas: Trips and Payments.
Trips schema (PostgreSQL):
Payments schema (PostgreSQL, separate cluster):
SELECT * FROM trips WHERE status = 'active' runs constantly (matching engine, ETA calculations). At any moment only ~50,000 of 5 million daily trips are active. A partial index — WHERE status NOT IN ('completed', 'cancelled') — covers only those rows. The index stays tiny (fits in L2 cache on the DB server), making driver-assignment queries near-instant regardless of total trip volume.
Step 4 — Design the Replication and Sharding Strategy
At 5 million trips/day (~58 writes/second), a single PostgreSQL primary handles trips comfortably. But we plan ahead:
- Replication first: Deploy 2 synchronous streaming replicas for the trip and accounts clusters. Replicas absorb all read traffic (history queries, support lookups, analytics exports). Automatic failover with Patroni ensures the primary is replaced within 30 seconds if it fails.
- Sharding trigger: If write throughput consistently exceeds 5,000 writes/second on the trips table, shard by
city_id. Each city cluster is an independent primary + 2 replicas. Rides in Lagos never touch the same database nodes as rides in London. Cross-city queries (global dashboards) use the analytics pipeline, not the OLTP clusters. - Redis geo cluster: Already horizontally partitioned by default. Use 6 nodes (3 primary shards, 3 replicas). Location keys are sharded by driver ID hash across the three primary shards automatically by the cluster.
- Object storage (S3): Inherently distributed; no sharding needed. Enable versioning for document audit trail; lifecycle rules move files older than 90 days to S3 Glacier Instant Retrieval to cut storage costs by ~70%.
Step 5 — Apply Denormalization Strategically
Our schemas are fully normalized — good for correctness, but some read paths benefit from selective denormalization:
- Denormalize fare_cents into the trips table (already done above). Calculating fare from route data on every trip-history read would require a join to a fare_rules table and recomputation. Storing it once at trip completion makes the rider history screen a simple single-row fetch.
- Rider trip count on the users table: A
total_trips INTcolumn, incremented on trip completion, avoids aCOUNT(*)aggregate on the trips table for every profile load. Accept eventual consistency — if the increment fails, a nightly reconciliation job fixes it. - Do NOT denormalize driver name into the trips table. Driver profiles change (legal name corrections). Joining at read time keeps trip history consistent with the current profile. The join is cheap: one row by primary key.
Step 6 — Handle the Analytics Pipeline Separately
A critical principle: never run heavy analytics on your OLTP database. A 10-minute dashboard query scanning 100 million trip rows will starve your matching engine of I/O, causing trip assignments to time out for real users.
The correct architecture separates the two planes entirely:
- Change-Data Capture (CDC): Debezium reads the PostgreSQL WAL and publishes every row change to Kafka topics (
trips.changes,payments.changes) within seconds, with zero load on the primary. - Batch path: A Kafka consumer (Spark Structured Streaming or simple Kafka Connect S3 Sink) writes Parquet files to S3, partitioned by
city/date/hour. Amazon Athena or BigQuery can query petabytes of this data for $5 per TB scanned. - Real-time path: Apache Flink consumes the Kafka stream and maintains rolling aggregations (trips per minute, revenue per hour, active driver count by city) in TimescaleDB. Dashboards query TimescaleDB with sub-second latency.
Step 7 — Validate the Design Against Requirements
Always close the loop by checking the design against your original scale targets:
- 50,000 concurrent active trips: The partial index on
trips(status)covers only ~50k rows. A matching-engine query scanning active trips completes in microseconds. - 500,000 location writes/second: Redis Cluster with 3 primary shards handles this comfortably (~167k writes/shard). Benchmark showed a 6-node cluster sustaining 800k ops/second in production at Uber.
- Payment correctness: Serializable PostgreSQL transactions prevent double-charges. The ledger model makes any imbalance detectable with a
SUM(amount_cents) GROUP BY account_idreconciliation query. - Analytics isolation: CDC → Kafka → S3 means no analytical query ever touches the OLTP primaries.
- Failure resilience: Every tier has a replica. PostgreSQL fails over in under 30 seconds. Redis Cluster promotes a replica automatically. S3 is 11-nines durable. No single point of failure.
Summary: The Design Process
- Enumerate access patterns per subsystem before naming any technology.
- Match storage to pattern: relational → ACID/joins, key-value/geo → speed, blob → immutable files, column/time-series → analytics.
- Model schemas with correctness first; add denormalization only where a specific read path demands it and you have a clear update contract.
- Plan replication (synchronous replicas for HA) before sharding; shard only when write throughput exceeds a measured single-primary limit.
- Isolate analytics from OLTP via CDC/Kafka. Never run heavy aggregates on your production primary.
- Validate every design decision against the original scale targets. Traceability from requirement to decision is what makes a design defensible in an interview or an architecture review.