Design a Notification System
Design a Notification System
Notifications are the pulse of every modern product. They tell a user their ride has arrived, their package has shipped, and their password reset is waiting. At scale — think 1 billion push notifications per day at Meta, or 4 billion emails per day at Gmail — the naive approach of calling a vendor SDK inline during a request is a reliability disaster. This lesson designs a production-grade, multi-channel notification service from first principles.
Phase 1 — Requirements
Functional Requirements
- Send notifications via three channels: push (iOS APNs / Android FCM), email (SendGrid / SES), and SMS (Twilio / SNS).
- Callers (other internal services) trigger notifications via an API — they specify the user, the template, the channel priority, and optional data payload.
- Support per-user preference management: a user can opt out of a channel or mute a category entirely.
- Support scheduled notifications (send at a specific time, e.g. a reminder at 9 AM in the user's local timezone).
- Provide a delivery receipt: callers can query whether a notification was delivered, failed, or is pending.
Non-Functional Requirements
- Throughput: 10 million notifications/day (~116/second average, ~1,000/second peak).
- Latency: Push and SMS must reach the vendor within 1 second of trigger. Email can be best-effort up to 30 seconds.
- Availability: 99.9% uptime — the notification service itself must not become a single point of failure for callers.
- At-least-once delivery: a failed send must be retried automatically; duplicates are tolerable (idempotent at the vendor level).
- Soft real-time analytics: delivery success/failure rates visible in an ops dashboard within 1 minute.
POST /orders) must never block waiting for a vendor to accept the message. The notification pipeline must be entirely asynchronous from the caller's perspective.
Phase 2 — Scale Estimate
- 10M notifications/day — broken down roughly as 60 % push, 30 % email, 10 % SMS.
- Push: 6 M/day = 70/s avg, 600/s peak — comfortably within a single FCM connection pool but needs fan-out logic.
- Email: 3 M/day = 35/s — SES throughput limit is 14 emails/s per account by default (raise via quota request); need multiple sending identities or a pool.
- SMS: 1 M/day = 12/s — Twilio's default rate limit is 1 message/second per phone number; you need a long-code pool or a short code at this volume.
- Storage: each notification log row ~500 B; 10 M/day × 500 B × 90 days retention ≈ 450 GB. A single Postgres instance handles this; partition by day for fast pruning.
Phase 3 — API Design
The internal trigger API is the only public surface of this service. Keep it lean:
Notice the 202 Accepted — not 200 OK. The service acknowledges receipt and hands the job to the async pipeline. The caller is never blocked waiting for a vendor response.
Phase 4 — High-Level Architecture
The system has four logical layers: an API gateway, a job queue, channel workers, and vendor integrations. An additional preference service and a notification log cut across all of them.
Phase 5 — Deep Dives and Trade-offs
1. Queue Topic Design — One Queue or Many?
Use separate topics per channel (e.g. notifications.push, notifications.email, notifications.sms). This lets you scale each consumer group independently: push workers need very high throughput (600/s peak); SMS workers are rate-limited by vendor and need throttling. A single mixed queue would force the slowest channel to block the fastest.
Also add a high-priority topic for each channel (e.g. notifications.push.high). Security alerts, OTP codes, and payment confirmations should never queue behind a marketing bulk send. Workers always drain the .high partition first (Kafka partition ordering, or SQS FIFO with message groups).
2. Preference Service — Caching is Non-Negotiable
Before any notification is enqueued, the API must check whether the user has opted out of that channel or muted that category. At 1,000 req/s peak this check would hammer the preferences DB. Cache user preferences in Redis with a 5-minute TTL. When a user updates their preferences, write through to both Postgres and Redis immediately. The 5-minute staleness window means a user who opts out might receive at most a few more notifications — acceptable, and stated in your SLA.
3. Retry and Dead-Letter Queue (DLQ)
Vendor APIs fail. FCM returns 503; SendGrid throttles. Build exponential back-off into every worker:
- Retry 1 after 10 s, retry 2 after 30 s, retry 3 after 2 min — then move to the DLQ.
- A separate DLQ consumer alerts on-call engineers, writes to the notification log with
status = failed, and optionally escalates to the next channel (push failed → try email). - Make retries idempotent: include the
notification_idas the vendor's idempotency key where the API supports it (SES, Twilio both support this).
registration_not_registered error — handle it by deleting the token from your database immediately, not on the next batch job. Otherwise you will waste capacity retrying dead tokens.
4. Template Rendering — Where Does It Happen?
Render the notification body in the worker, not in the API service. The API service receives a template_id and a data payload. The worker fetches the template (from a cache-backed template store), merges the data, and sends the rendered string to the vendor. This keeps the API service thin and fast, and lets you update templates without redeploying the API tier.
5. Analytics and Observability
Every state transition (queued → dispatched → delivered / failed) writes an event to a Kafka analytics topic. A stream-processing job (Flink or Kinesis Data Analytics) aggregates delivery rates per channel per template in 1-minute windows and writes to a time-series store (InfluxDB or CloudWatch Metrics). The ops dashboard queries this for SLA monitoring. This is also where you detect degraded vendor performance before it triggers customer complaints.
Summary — Key Design Decisions
- Async by design: callers always get 202; the queue decouples trigger from delivery.
- Per-channel topics with priority lanes: isolates throughput limits; high-priority messages are never delayed by bulk marketing sends.
- Preference cache in Redis: opt-out checks never slow the critical path; 5-minute TTL is an acceptable consistency trade-off.
- Exponential back-off + DLQ + channel escalation: ensures at-least-once delivery without infinite retry loops.
- Template rendering in the worker: keeps the API tier stateless and allows template hot-swapping.
- Analytics via stream processing: sub-minute delivery dashboards without querying the transactional DB.