Databases & Storage

Choosing a Database

18 min Lesson 9 of 10

Choosing a Database

Every system design interview, every architecture review, and every greenfield project eventually arrives at the same question: which database should we use? The answer is rarely obvious, and the wrong choice can cost months of painful migration work. This lesson gives you a practical, repeatable decision framework you can apply immediately — grounded in real trade-offs and concrete examples.

Step 1 — Understand Your Data Model

Before you think about scale or availability, ask what shape your data is. The mismatch between data shape and database model is the single most common cause of painful rewrites.

  • Strongly relational data with complex joins — user accounts, orders, inventory, financial transactions. Each entity references others through foreign keys. Use a relational (SQL) database: PostgreSQL or MySQL.
  • Hierarchical or semi-structured documents — product catalogs, CMS content, user profiles with arbitrary fields. The schema varies per record. Use a document store: MongoDB, Couchbase.
  • Key-based lookups with no complex queries — session storage, feature flags, rate-limit counters, leaderboards. Use a key-value store: Redis, DynamoDB.
  • Time-ordered measurements — metrics, IoT sensor readings, application logs. Use a time-series database: InfluxDB, TimescaleDB, Prometheus.
  • Deeply connected relationships — social graphs, recommendation engines, fraud detection. Use a graph database: Neo4j, Amazon Neptune.
  • Analytical queries over large historical datasets — reporting, business intelligence, data warehousing. Use a columnar / OLAP store: Redshift, BigQuery, Snowflake, ClickHouse.
Practical rule: If you find yourself writing three-way JOINs in MongoDB aggregation pipelines, or storing JSON blobs in PostgreSQL columns to avoid schema changes, your data model and database type are mismatched.

Step 2 — Characterise Your Access Patterns

Data shape tells you the family; access patterns narrow the choice within that family. Ask:

  • Read-heavy or write-heavy? A social media feed is read-heavy (1,000 reads per write). An IoT pipeline collecting telemetry is write-heavy (100 writes per read). PostgreSQL, Cassandra, and DynamoDB all handle each direction differently.
  • What is the latency requirement? Sub-millisecond lookups (shopping cart, session) need an in-memory store like Redis. A financial report that runs once a day can tolerate seconds on Redshift.
  • What queries do you need? Full-text search, geospatial queries, graph traversal — if your query pattern is exotic, pick a database that treats it as a first-class citizen rather than bolting it on later.
  • What is the read/write ratio and the QPS? A small read replica on PostgreSQL handles 10,000 read QPS. Cassandra is designed to saturate 100,000+ write QPS across a cluster. Choosing PostgreSQL for the second workload will work — until it won't.
Database selection decision tree What is your data shape? (Pick primary pattern) Relational Document Key-Value Time-series / Graph SQL DB PostgreSQL / MySQL Document Store MongoDB / Couchbase Key-Value Redis / DynamoDB Graph / Time-Series Neo4j / InfluxDB ACID needed? Yes → Keep SQL Strong consistency Flex schema? Yes → Document Evolving fields Sub-ms latency? Yes → Cache KV Redis in-memory Traversal heavy? Yes → Graph DB Neo4j / Neptune Most systems use multiple databases — one per workload type. Polyglot persistence is the norm at scale.
Database selection decision tree: data shape → access pattern → concrete technology choice.

Step 3 — Know Your Consistency and Availability Requirements

The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance. In practice, network partitions happen, so the real trade-off is CP vs AP:

  • CP systems sacrifice availability during a partition to ensure every read returns the latest write. Use for financial transactions, inventory management, anything where stale data causes real-world harm. Examples: PostgreSQL, CockroachDB, Spanner.
  • AP systems stay available during a partition and offer eventual consistency. Use for user timelines, shopping carts, view counters — places where briefly stale data is acceptable. Examples: Cassandra, DynamoDB, CouchDB.
Concrete example: A bank transfer must be CP — you cannot show Alice $500 on one replica and $0 on another. A Twitter like counter can be AP — showing 10,241 likes when the true count is 10,243 costs nothing.

Step 4 — Estimate Volume and Velocity

Raw numbers drive the decision more than any principle. Use envelope math:

  • Under 10 TB, under 10,000 QPS — a well-tuned single PostgreSQL primary with read replicas handles almost any web application. Do not add complexity you do not need.
  • 10,000–100,000 write QPS, eventual consistency acceptable — Cassandra or DynamoDB. Both are designed for this range and partition horizontally with minimal ops overhead.
  • 100+ TB analytical data, complex aggregations — a columnar OLAP warehouse: Snowflake, BigQuery, or Redshift. Row-oriented databases like PostgreSQL perform orders of magnitude worse for full-table scans.
  • Sub-millisecond read latency at any scale — Redis (in-memory). Disk-based databases, even SSDs, cannot compete on latency for hot data.

Step 5 — Factor in Operational Realities

The best database for your data model is worthless if your team cannot operate it. Ask:

  • Team expertise — a team of SQL experts will ship faster on PostgreSQL than on Cassandra even if Cassandra is technically a better fit. Expertise compounds over time.
  • Managed vs self-hosted — AWS RDS, Google Cloud SQL, PlanetScale, and Supabase remove most operational burden from relational databases. DynamoDB, Cosmos DB, and Firestore are fully managed NoSQL options. Self-hosting Cassandra or Elasticsearch requires dedicated infrastructure engineers.
  • Ecosystem and tooling — PostgreSQL has decades of tooling: pgAdmin, pg_dump, logical replication, extensions (PostGIS for geo, pgvector for vector search). MongoDB has a mature driver ecosystem. Niche databases may lack ORM support or migration tooling.
  • Cost — DynamoDB charges per read/write capacity unit. BigQuery charges per byte scanned. PostgreSQL on RDS charges per instance-hour. Model your expected workload through each pricing model before committing.
Polyglot persistence architecture for a large e-commerce platform Application Services (Order / User / Search) PostgreSQL Orders, Payments MongoDB Product Catalog Redis Sessions, Cart Cache Elasticsearch Full-text Search Redshift Analytics / BI ACID, CP Flex schema Sub-ms reads Full-text, facets Columnar OLAP Real pattern: Amazon, Shopify, and Alibaba all use polyglot persistence.
Polyglot persistence: a single e-commerce platform uses five different databases, each matched to one workload.

The Polyglot Persistence Pattern

At scale, no single database is the right answer for every workload in a system. The industry-standard solution is polyglot persistence — using the best-fit database for each distinct access pattern within the same application. Amazon's product detail page, for example, reads from a relational database for pricing/inventory, a document store for the product description, a key-value cache for the session, a search index for the search bar, and a graph database for the "customers also bought" recommendations — all within a single page load.

The cost of polyglot persistence is operational complexity: more systems to monitor, more failure modes, more data synchronization to manage. This is why startups often start with a single PostgreSQL instance — premature polyglot persistence is as harmful as premature optimization.

Common mistake: Choosing a database because it is new or trendy, rather than because it fits the workload. MongoDB earned a bad reputation in the early 2010s because developers used it for highly relational data that needed transactions. The database was not wrong — the fit was. Always let the workload drive the choice.

A Practical Decision Checklist

  1. Draw your data model — identify entities and their relationships.
  2. List your top 3 most frequent query patterns.
  3. Estimate peak QPS, data volume, and acceptable latency (p99).
  4. Decide on consistency requirements — CP or AP?
  5. Ask what your team already knows how to operate.
  6. Consider whether a managed cloud service removes operational burden.
  7. Model the cost at your expected scale before committing.
In system design interviews: Interviewers expect you to justify your database choice. "I would use PostgreSQL because the data is strongly relational, we need ACID transactions for the payment flow, the team has deep PostgreSQL expertise, and our estimated 5,000 write QPS is well within a single primary with read replicas" is a complete answer. "I would use MongoDB because it is flexible" is not.

Database selection is a multi-dimensional trade-off. There is no universally correct answer — only the answer that best fits your data shape, access patterns, consistency requirements, volume, and team. Master these five dimensions and you will rarely choose wrong.