Databases & Storage

Project: Design the Data Layer

18 min Lesson 10 of 10

Project: Design the Data Layer

Everything in this tutorial — SQL vs NoSQL, ACID guarantees, indexing strategies, replication topologies, sharding approaches, denormalization trade-offs, blob storage, and database selection frameworks — converges in a single practical skill: designing a data layer for a real system from scratch. This capstone lesson walks through that process end-to-end for a concrete, non-trivial system, showing the reasoning behind every decision.

The System: a Ride-Hailing Platform

We will design the data layer for a service similar to Uber or Lyft. It has five major subsystems, each with distinct data characteristics:

User & Driver Accounts — registration, profiles, authentication, document verification
Trip Lifecycle — booking, matching, routing, fare calculation, completion, history
Real-Time Location — driver GPS positions updated every 3–5 seconds for millions of active drivers
Payments & Billing — charges, refunds, driver payouts, invoice history
Analytics & Reporting — business dashboards, fraud detection, supply/demand heatmaps

Scale targets: 10 million daily active riders, 2 million active drivers, 5 million trips per day, peak 50,000 concurrent active trips.

Step 1 — Identify Access Patterns Before Picking Storage

The single most important step is to enumerate the dominant read and write patterns for each subsystem before touching a database name. Storage technology should follow access patterns, not the other way around.

Accounts: Read by user ID on every authenticated request; write on registration/update. Strongly consistent. Few writes, many reads.
Trips: Created on booking; updated multiple times during the ride (matching, start, end, rating); queried by rider/driver ID for history. Multi-step state machine — atomicity matters.
Locations: Write-heavy — 2 million drivers × 1 update/4 seconds = ~500,000 writes/second at peak. Queries are spatial: "find all drivers within 2 km of (lat, lng)". Data is ephemeral — we only care about the current position, not history.
Payments: Double-entry ledger. Every charge must be paired with a debit; partial failures are catastrophic. Requires the strongest consistency guarantees.
Analytics: Heavy aggregations over large historical datasets. Latency tolerance is high (seconds to minutes). Query patterns are exploratory and unpredictable.

Write out your access patterns as concrete sentences before designing any schema. "Find all active drivers within 2 km" is a spatial query — it immediately tells you to consider geospatial indexing or a specialised geo-store. "Deduct balance and record ledger atomically" tells you ACID is non-negotiable. Most storage mistakes happen because engineers pick a database first and then try to make access patterns fit.

Step 2 — Assign Each Subsystem to the Right Storage Technology

With access patterns clear, we can now match each subsystem to the storage technology it genuinely needs:

Accounts → PostgreSQL (primary + read replicas). Structured data, stable schema, complex queries (JOIN users → documents → preferences), strong consistency for auth flows. A single primary easily handles account writes at this scale. Add 2–3 read replicas to distribute profile lookups.
Trips → PostgreSQL (same cluster, or a dedicated cluster for isolation). Trips are relational (riders, drivers, routes, fares, ratings all reference each other), require multi-row atomic updates (state machine transitions), and benefit from rich SQL queries for dispute resolution and support. Shard by city_id or rider_id if write volume outgrows a single primary — at 5 million trips/day that is ~58 writes/second, well within a single PostgreSQL primary.
Real-Time Locations → Redis (cluster mode, geo commands). Redis GEOADD / GEORADIUS provide sub-millisecond geospatial queries. In cluster mode across 6 nodes (3 primaries, 3 replicas), Redis handles 1+ million writes/second. Positions are stored with a TTL of ~30 seconds — if a driver stops sending updates, their record expires automatically. This is purely an ephemeral cache; no durability needed.
Payments → PostgreSQL with strict serializable isolation. A double-entry accounting ledger. Every row in the ledger table represents one side of a transaction. A payment triggers two rows atomically (debit rider, credit driver escrow). Serializable isolation prevents phantom reads. Consider a dedicated PostgreSQL cluster isolated from the trip database so a heavy analytical query cannot starve payment writes.
Analytics → Amazon S3 + Apache Parquet + Query Engine (Athena / BigQuery / Redshift). Trip and payment records are streamed to S3 in Parquet format via a change-data-capture (CDC) pipeline. Analysts query with Athena or BigQuery — pay-per-query, no warehouse to manage. For real-time dashboards (fraud signals, supply heatmaps) add Apache Kafka → Apache Flink → a time-series store (InfluxDB or TimescaleDB).
Driver Documents & Profile Photos → Object Storage (S3 / GCS). Immutable binary blobs; accessed infrequently after upload; CDN-distributed for fast global download. Store only the S3 key in PostgreSQL, never the binary itself in the database.

Complete data layer architecture for the ride-hailing platform: each subsystem uses the storage technology matched to its access pattern.

Step 3 — Model the Core Schemas

With storage technologies chosen, we model the two most critical relational schemas: Trips and Payments.

Trips schema (PostgreSQL):

-- Core entities
users        (id, email, phone, name, created_at)
drivers      (id, user_id FK, vehicle_id FK, status, rating, created_at)
vehicles     (id, driver_id FK, plate, model, year)

-- Trip lifecycle
trips (
  id            BIGSERIAL PRIMARY KEY,
  rider_id      BIGINT REFERENCES users(id),
  driver_id     BIGINT REFERENCES drivers(id),
  status        TEXT CHECK(status IN ('requested','matching','active','completed','cancelled')),
  pickup_lat    DECIMAL(9,6),  pickup_lng DECIMAL(9,6),
  dropoff_lat   DECIMAL(9,6),  dropoff_lng DECIMAL(9,6),
  fare_cents    INT,
  distance_m    INT,
  requested_at  TIMESTAMPTZ,
  started_at    TIMESTAMPTZ,
  completed_at  TIMESTAMPTZ
)

-- Indexes
CREATE INDEX idx_trips_rider_id    ON trips(rider_id, completed_at DESC);
CREATE INDEX idx_trips_driver_id   ON trips(driver_id, completed_at DESC);
CREATE INDEX idx_trips_status      ON trips(status) WHERE status NOT IN ('completed','cancelled');
-- Partial index on active trips only — dramatically smaller, fits in RAM

Payments schema (PostgreSQL, separate cluster):

-- Double-entry ledger
ledger_entries (
  id          BIGSERIAL PRIMARY KEY,
  trip_id     BIGINT NOT NULL,
  account_id  BIGINT NOT NULL,       -- rider or driver account
  direction   TEXT CHECK(direction IN ('debit','credit')),
  amount_cents INT NOT NULL,
  currency    CHAR(3) DEFAULT 'USD',
  created_at  TIMESTAMPTZ DEFAULT now()
)

-- Every payment inserts TWO rows atomically (rider debit + driver credit)
-- Balance = SUM(credits) - SUM(debits) for an account_id
-- Transaction isolation: SERIALIZABLE to prevent phantom reads on balance checks

Partial indexes are a powerful optimization. The query SELECT * FROM trips WHERE status = 'active' runs constantly (matching engine, ETA calculations). At any moment only ~50,000 of 5 million daily trips are active. A partial index — WHERE status NOT IN ('completed', 'cancelled') — covers only those rows. The index stays tiny (fits in L2 cache on the DB server), making driver-assignment queries near-instant regardless of total trip volume.

Step 4 — Design the Replication and Sharding Strategy

At 5 million trips/day (~58 writes/second), a single PostgreSQL primary handles trips comfortably. But we plan ahead:

Replication first: Deploy 2 synchronous streaming replicas for the trip and accounts clusters. Replicas absorb all read traffic (history queries, support lookups, analytics exports). Automatic failover with Patroni ensures the primary is replaced within 30 seconds if it fails.
Sharding trigger: If write throughput consistently exceeds 5,000 writes/second on the trips table, shard by city_id. Each city cluster is an independent primary + 2 replicas. Rides in Lagos never touch the same database nodes as rides in London. Cross-city queries (global dashboards) use the analytics pipeline, not the OLTP clusters.
Redis geo cluster: Already horizontally partitioned by default. Use 6 nodes (3 primary shards, 3 replicas). Location keys are sharded by driver ID hash across the three primary shards automatically by the cluster.
Object storage (S3): Inherently distributed; no sharding needed. Enable versioning for document audit trail; lifecycle rules move files older than 90 days to S3 Glacier Instant Retrieval to cut storage costs by ~70%.

Step 5 — Apply Denormalization Strategically

Our schemas are fully normalized — good for correctness, but some read paths benefit from selective denormalization:

Denormalize fare_cents into the trips table (already done above). Calculating fare from route data on every trip-history read would require a join to a fare_rules table and recomputation. Storing it once at trip completion makes the rider history screen a simple single-row fetch.
Rider trip count on the users table: A total_trips INT column, incremented on trip completion, avoids a COUNT(*) aggregate on the trips table for every profile load. Accept eventual consistency — if the increment fails, a nightly reconciliation job fixes it.
Do NOT denormalize driver name into the trips table. Driver profiles change (legal name corrections). Joining at read time keeps trip history consistent with the current profile. The join is cheap: one row by primary key.

Denormalization creates write-time obligations. Every denormalized field (fare_cents, total_trips) must be updated in the same transaction as the source data — or via a reliable asynchronous process with a reconciliation job. If you denormalize without a clear update contract, you will have stale data in production within weeks.

Left: PostgreSQL primary with two synchronous replicas and Patroni failover for transactional data. Right: Redis Cluster with three geo-shards and replicas for real-time driver location.

Step 6 — Handle the Analytics Pipeline Separately

A critical principle: never run heavy analytics on your OLTP database. A 10-minute dashboard query scanning 100 million trip rows will starve your matching engine of I/O, causing trip assignments to time out for real users.

The correct architecture separates the two planes entirely:

Change-Data Capture (CDC): Debezium reads the PostgreSQL WAL and publishes every row change to Kafka topics (trips.changes, payments.changes) within seconds, with zero load on the primary.
Batch path: A Kafka consumer (Spark Structured Streaming or simple Kafka Connect S3 Sink) writes Parquet files to S3, partitioned by city/date/hour. Amazon Athena or BigQuery can query petabytes of this data for $5 per TB scanned.
Real-time path: Apache Flink consumes the Kafka stream and maintains rolling aggregations (trips per minute, revenue per hour, active driver count by city) in TimescaleDB. Dashboards query TimescaleDB with sub-second latency.

Step 7 — Validate the Design Against Requirements

Always close the loop by checking the design against your original scale targets:

50,000 concurrent active trips: The partial index on trips(status) covers only ~50k rows. A matching-engine query scanning active trips completes in microseconds.
500,000 location writes/second: Redis Cluster with 3 primary shards handles this comfortably (~167k writes/shard). Benchmark showed a 6-node cluster sustaining 800k ops/second in production at Uber.
Payment correctness: Serializable PostgreSQL transactions prevent double-charges. The ledger model makes any imbalance detectable with a SUM(amount_cents) GROUP BY account_id reconciliation query.
Analytics isolation: CDC → Kafka → S3 means no analytical query ever touches the OLTP primaries.
Failure resilience: Every tier has a replica. PostgreSQL fails over in under 30 seconds. Redis Cluster promotes a replica automatically. S3 is 11-nines durable. No single point of failure.

The complete picture of polyglot persistence in one system: this design uses five storage technologies simultaneously — PostgreSQL (OLTP, ACID), Redis (ephemeral geo cache), S3 (blob and data lake), TimescaleDB (real-time time-series), and Kafka (durable event bus). Each one does exactly what it is best at. Forcing all of this into a single PostgreSQL database would have been technically possible at small scale, but would have collapsed under the location-write load and analytics query load simultaneously. Choosing the right tool for each workload is the defining competency of a systems architect.

Summary: The Design Process

Enumerate access patterns per subsystem before naming any technology.
Match storage to pattern: relational → ACID/joins, key-value/geo → speed, blob → immutable files, column/time-series → analytics.
Model schemas with correctness first; add denormalization only where a specific read path demands it and you have a clear update contract.
Plan replication (synchronous replicas for HA) before sharding; shard only when write throughput exceeds a measured single-primary limit.
Isolate analytics from OLTP via CDC/Kafka. Never run heavy aggregates on your production primary.
Validate every design decision against the original scale targets. Traceability from requirement to decision is what makes a design defensible in an interview or an architecture review.

Previous Lesson Choosing a Database

Back to Course System Design

Back to Course

Project: Design the Data Layer

Project: Design the Data Layer

The System: a Ride-Hailing Platform

Step 1 — Identify Access Patterns Before Picking Storage

Step 2 — Assign Each Subsystem to the Right Storage Technology

Step 3 — Model the Core Schemas

Step 4 — Design the Replication and Sharding Strategy

Step 5 — Apply Denormalization Strategically

Step 6 — Handle the Analytics Pipeline Separately

Step 7 — Validate the Design Against Requirements

Summary: The Design Process

Tutorial Complete!