Databases & Storage

NoSQL Types

18 min Lesson 3 of 10

NoSQL Types

The term "NoSQL" covers a broad family of databases that abandon the rigid table-and-row model of relational systems in exchange for flexibility, horizontal scalability, or specialised access patterns. Understanding the four main families — key-value, document, wide-column, and graph — is essential because each solves a different set of problems. Picking the wrong family is one of the most expensive data-layer mistakes you can make: it is not something you fix with an index or a query rewrite.

NoSQL does NOT mean "no schema". Every NoSQL store has a schema — it is just often implicit (enforced by the application) rather than declared in a DDL statement. Ignoring this leads to messy, unqueryable data at scale.
The Four NoSQL Database Families NoSQL Database Families Key-Value Redis, DynamoDB Memcached Document MongoDB, Firestore CouchDB Wide-Column Cassandra, HBase BigTable Graph Neo4j, Amazon Neptune O(1) lookup rich queries time-series / write-heavy relationship traversal
The four NoSQL database families — each optimised for a distinct data model and access pattern.

Key-Value Stores

A key-value store is the simplest possible data model: a giant distributed hash table where every value is addressed by an opaque key. The database makes no assumptions about the structure of the value — it could be a string, a binary blob, a serialised object, or even a complex data structure (as Redis demonstrates with its native types like sorted sets and streams).

Canonical examples: Redis, Amazon DynamoDB (in its simplest usage pattern), Memcached, Riak KV.

Access pattern: GET key, SET key value, DEL key. Every operation is O(1). You can look up by key; you cannot efficiently scan or filter by value without a secondary index.

Where they shine:

  • Session storage — a 64-character session token maps to a JSON user blob. Reads are millisecond-level. Redis stores 100 million sessions on a single node.
  • Caching — cache the result of an expensive database query keyed by the query hash. A 95% cache hit rate means the database sees only 1 in 20 requests.
  • Rate limiting — Redis atomic increments (INCR + EXPIRE) implement sliding-window rate limiters without a transaction.
  • Feature flags — a boolean or JSON blob keyed by feature:{name}:{userId}.

Limitations: No joins, no secondary indexes by default, no rich queries. If you later need to ask "give me all users whose plan is 'pro'", a pure key-value store forces a full scan or requires you to maintain your own inverted index.

DynamoDB is not just key-value. Amazon DynamoDB supports a composite key (partition key + sort key) and secondary indexes, making it more of a wide-column store in practice. It is common to use DynamoDB as both a key-value store and a lightweight document store in the same application.

Document Stores

A document store persists semi-structured documents — most commonly JSON or BSON objects — and lets you query and index arbitrary fields within those documents. Unlike a key-value store, the database understands the internal structure of the value.

Canonical examples: MongoDB, Google Firestore, CouchDB, Amazon DocumentDB.

Data model: A document is a self-contained unit. An e-commerce order document might embed the line items, shipping address, and payment method all in one object rather than spreading them across five normalised tables. This is denormalisation by design: you optimise for the common read path at the cost of update complexity.

Where they shine:

  • Content management — a blog post has a title, body, tags (array), author (embedded object), and an arbitrary set of metadata fields. The schema varies per post type. A document store handles this naturally; a relational schema requires nullable columns or EAV tables.
  • Product catalogs — a TV has different attributes than a t-shirt. Storing products as documents avoids a 200-column table where 180 columns are NULL for any given row.
  • User profiles — preferences, social links, and activity history that differ between user types.
  • Event logs — each event type has a different payload shape.

Limitations: Joins across collections are expensive and need to be done in the application layer or via the $lookup operator (MongoDB), which is slow at scale. Consistency guarantees vary: MongoDB supports multi-document ACID transactions since v4.0, but they carry a significant performance cost. Strong consistency across shards is still harder than in a relational system.

The embedding vs. referencing trap: Embedding related data (e.g. comments inside a post document) gives fast reads but creates unbounded document growth. A popular post with 50,000 comments becomes a huge document that must be loaded in full. Reference by ID and fetch separately when the nested collection is large or updated frequently.

Wide-Column Stores

Wide-column stores organise data as a map of rows, where each row can have a different set of columns and columns are grouped into column families. The physical storage layout — storing all values for a column family together on disk — makes sequential reads of a column family extremely fast, even across billions of rows.

Canonical examples: Apache Cassandra, Google Bigtable, Apache HBase, ScyllaDB.

Data model: Think of a giant sparse table. Each row has a primary key. Within a row you can have thousands of columns, and different rows can have completely different columns. In Cassandra, a partition key determines which node owns the data, and a clustering key controls the sort order within a partition — this makes range scans within a partition very fast.

Where they shine:

  • Time-series data — sensor readings, stock prices, application metrics. Model the partition key as the device/metric ID and the clustering key as the timestamp. You get sub-millisecond range queries for "all readings for sensor X in the last hour".
  • Write-heavy workloads — Cassandra uses an LSM-tree storage engine. Writes always go to an in-memory memtable and are appended to a commit log, giving write throughput in the hundreds of thousands per second per node.
  • Messaging inboxes — Facebook's original Inbox was built on HBase. Partition by user ID, cluster by message timestamp.
  • IoT at scale — billions of events per day from millions of devices.

Limitations: Your data model must be designed around your queries, not around your entities. Adding a new query pattern often requires a new table (denormalised copy of the data). There is no flexible ad-hoc querying like SQL. Cassandra has no joins and limited aggregation support.

Wide-Column vs Document Data Model Comparison Wide-Column (Cassandra) partition_key clustering_key columns... sensor_001 2024-01-01 09:00 temp=22.1 sensor_001 2024-01-01 09:01 temp=22.3, hum=55 sensor_002 2024-01-01 09:00 temp=18.5, psi=1.2 same partition Fast range scan within partition. Each row can have different columns. Optimised for write-heavy workloads. Document Store (MongoDB) { _id: "p001", title: "Getting Started", tags: ["intro","beginner"], author: { name:"Ada", id:7 } } — nested object, variable fields { _id: "p002", title: "Advanced Topics", videoUrl: "https://..." } — different shape, no tags/author Flexible schema, rich queries.
Wide-column stores model data around query access patterns; document stores allow flexible per-document schemas.

Graph Databases

Graph databases are built to store and traverse relationships as first-class citizens. The data model consists of nodes (entities) and edges (relationships between entities), each of which can carry arbitrary properties. Unlike relational databases, where a JOIN is computed at query time by scanning foreign keys, a graph database stores the pointers between nodes physically, making multi-hop traversals — "find all friends of friends who live in the same city and share at least two skills" — extremely fast regardless of dataset size.

Canonical examples: Neo4j, Amazon Neptune, JanusGraph, TigerGraph.

Where they shine:

  • Social networks — friend recommendations, degree-of-separation queries, mutual connections. LinkedIn uses graph technology to power "People You May Know".
  • Fraud detection — a fraudster reuses an address, device, or phone number across multiple accounts. Graph traversals detect these rings in milliseconds; SQL self-joins struggle at the same depth.
  • Knowledge graphs — Google's Knowledge Graph, Wikipedia's Wikidata. Entities and their typed relationships form the backbone of semantic search.
  • Recommendation engines — "users who bought X also bought Y" modelled as a bipartite graph of users and products.
  • Access control / IAM — hierarchical role inheritance in complex organisations.

Limitations: Graph databases do not scale horizontally as easily as wide-column or key-value stores because partitioning a highly connected graph across nodes introduces expensive cross-shard edge traversals. They are also poor fits for aggregate analytics (count/sum/group-by) over large datasets — use a data warehouse for that.

Do not reach for a graph DB prematurely. A relational database with proper foreign keys and a recursive CTE handles most graph-like queries up to hundreds of millions of edges. Switch to a dedicated graph database when you are doing 4+ hop traversals over billions of edges in latency-sensitive paths.

Side-by-Side Summary

The table below condenses the four families into a quick reference for system design decisions:

  • Key-Value: Model = opaque blob per key. Latency = sub-millisecond. Scale = horizontal (consistent hashing). Best for caching, sessions, feature flags.
  • Document: Model = JSON/BSON documents with indexed fields. Latency = single-digit milliseconds. Scale = horizontal sharding. Best for content, catalogs, flexible schemas.
  • Wide-Column: Model = sparse table, rows with variable column sets. Latency = single-digit ms writes, fast range scans. Scale = linear horizontal. Best for time-series, IoT, write-heavy.
  • Graph: Model = nodes + edges with properties. Latency = fast multi-hop traversal. Scale = limited horizontal. Best for social graphs, fraud, recommendations.
Polyglot persistence is the practice of using multiple NoSQL (and SQL) stores in the same application — each chosen for the access pattern it handles best. A typical large-scale system might use PostgreSQL for transactional data, Redis for caching and sessions, Cassandra for event logs, and Neo4j for the recommendation graph. This is normal and encouraged — just be aware of the operational overhead of running multiple stores.