Real-World System Design Case Studies

Design a Notification System

18 min Lesson 8 of 10

Design a Notification System

Notifications are the pulse of every modern product. They tell a user their ride has arrived, their package has shipped, and their password reset is waiting. At scale — think 1 billion push notifications per day at Meta, or 4 billion emails per day at Gmail — the naive approach of calling a vendor SDK inline during a request is a reliability disaster. This lesson designs a production-grade, multi-channel notification service from first principles.

Phase 1 — Requirements

Functional Requirements

Send notifications via three channels: push (iOS APNs / Android FCM), email (SendGrid / SES), and SMS (Twilio / SNS).
Callers (other internal services) trigger notifications via an API — they specify the user, the template, the channel priority, and optional data payload.
Support per-user preference management: a user can opt out of a channel or mute a category entirely.
Support scheduled notifications (send at a specific time, e.g. a reminder at 9 AM in the user's local timezone).
Provide a delivery receipt: callers can query whether a notification was delivered, failed, or is pending.

Non-Functional Requirements

Throughput: 10 million notifications/day (~116/second average, ~1,000/second peak).
Latency: Push and SMS must reach the vendor within 1 second of trigger. Email can be best-effort up to 30 seconds.
Availability: 99.9% uptime — the notification service itself must not become a single point of failure for callers.
At-least-once delivery: a failed send must be retried automatically; duplicates are tolerable (idempotent at the vendor level).
Soft real-time analytics: delivery success/failure rates visible in an ops dashboard within 1 minute.

Key constraint: Notifications are a side-effect of business logic, not the primary response. The API that triggers a notification (e.g. POST /orders) must never block waiting for a vendor to accept the message. The notification pipeline must be entirely asynchronous from the caller's perspective.

Phase 2 — Scale Estimate

10M notifications/day — broken down roughly as 60 % push, 30 % email, 10 % SMS.
Push: 6 M/day = 70/s avg, 600/s peak — comfortably within a single FCM connection pool but needs fan-out logic.
Email: 3 M/day = 35/s — SES throughput limit is 14 emails/s per account by default (raise via quota request); need multiple sending identities or a pool.
SMS: 1 M/day = 12/s — Twilio's default rate limit is 1 message/second per phone number; you need a long-code pool or a short code at this volume.
Storage: each notification log row ~500 B; 10 M/day × 500 B × 90 days retention ≈ 450 GB. A single Postgres instance handles this; partition by day for fast pruning.

Phase 3 — API Design

The internal trigger API is the only public surface of this service. Keep it lean:

POST /v1/notifications
Authorization: Bearer <service-token>
{
  "user_id": "u_8821",
  "template_id": "order_shipped",
  "channels": ["push", "email"],          // ordered preference; fallback if first fails
  "priority": "high",                      // high | normal | low
  "data": { "order_id": "O-9912", "eta": "2 hours" },
  "scheduled_at": null                     // ISO-8601 or null for immediate
}

Response 202 Accepted:
{ "notification_id": "n_7fa3b2", "status": "queued" }

GET /v1/notifications/{notification_id}
Response: { "status": "delivered|failed|pending", "channel_results": [...] }

Notice the 202 Accepted — not 200 OK. The service acknowledges receipt and hands the job to the async pipeline. The caller is never blocked waiting for a vendor response.

Phase 4 — High-Level Architecture

The system has four logical layers: an API gateway, a job queue, channel workers, and vendor integrations. An additional preference service and a notification log cut across all of them.

Multi-channel notification architecture: callers enqueue jobs via the API; per-channel workers consume jobs and call vendor APIs; results are written to the notification log.

Phase 5 — Deep Dives and Trade-offs

1. Queue Topic Design — One Queue or Many?

Use separate topics per channel (e.g. notifications.push, notifications.email, notifications.sms). This lets you scale each consumer group independently: push workers need very high throughput (600/s peak); SMS workers are rate-limited by vendor and need throttling. A single mixed queue would force the slowest channel to block the fastest.

Also add a high-priority topic for each channel (e.g. notifications.push.high). Security alerts, OTP codes, and payment confirmations should never queue behind a marketing bulk send. Workers always drain the .high partition first (Kafka partition ordering, or SQS FIFO with message groups).

2. Preference Service — Caching is Non-Negotiable

Before any notification is enqueued, the API must check whether the user has opted out of that channel or muted that category. At 1,000 req/s peak this check would hammer the preferences DB. Cache user preferences in Redis with a 5-minute TTL. When a user updates their preferences, write through to both Postgres and Redis immediately. The 5-minute staleness window means a user who opts out might receive at most a few more notifications — acceptable, and stated in your SLA.

Preference check flow: Redis is checked first; a cache miss falls back to Postgres and populates the cache. Opted-out users are silently suppressed before anything hits the queue.

3. Retry and Dead-Letter Queue (DLQ)

Vendor APIs fail. FCM returns 503; SendGrid throttles. Build exponential back-off into every worker:

Retry 1 after 10 s, retry 2 after 30 s, retry 3 after 2 min — then move to the DLQ.
A separate DLQ consumer alerts on-call engineers, writes to the notification log with status = failed, and optionally escalates to the next channel (push failed → try email).
Make retries idempotent: include the notification_id as the vendor's idempotency key where the API supports it (SES, Twilio both support this).

Best practice: Add a device-token staleness check before sending push. Device tokens go stale when a user uninstalls the app. FCM returns a registration_not_registered error — handle it by deleting the token from your database immediately, not on the next batch job. Otherwise you will waste capacity retrying dead tokens.

4. Template Rendering — Where Does It Happen?

Render the notification body in the worker, not in the API service. The API service receives a template_id and a data payload. The worker fetches the template (from a cache-backed template store), merges the data, and sends the rendered string to the vendor. This keeps the API service thin and fast, and lets you update templates without redeploying the API tier.

5. Analytics and Observability

Every state transition (queued → dispatched → delivered / failed) writes an event to a Kafka analytics topic. A stream-processing job (Flink or Kinesis Data Analytics) aggregates delivery rates per channel per template in 1-minute windows and writes to a time-series store (InfluxDB or CloudWatch Metrics). The ops dashboard queries this for SLA monitoring. This is also where you detect degraded vendor performance before it triggers customer complaints.

Pitfall — fanout to inactive devices: If your push topic has 10 million registered device tokens but 40 % of users have not opened the app in 6 months, you are burning queue capacity on tokens that will silently fail. Run a nightly cleanup job: query FCM / APNs feedback endpoints for unregistered tokens and prune them. This can reduce push volume by 30–50 % and dramatically improve your delivery ratio metric.

Summary — Key Design Decisions

Async by design: callers always get 202; the queue decouples trigger from delivery.
Per-channel topics with priority lanes: isolates throughput limits; high-priority messages are never delayed by bulk marketing sends.
Preference cache in Redis: opt-out checks never slow the critical path; 5-minute TTL is an acceptable consistency trade-off.
Exponential back-off + DLQ + channel escalation: ensures at-least-once delivery without infinite retry loops.
Template rendering in the worker: keeps the API tier stateless and allows template hot-swapping.
Analytics via stream processing: sub-minute delivery dashboards without querying the transactional DB.