Networking & Communication

Project: Design the Communication Layer

18 min Lesson 10 of 10

Project: Design the Communication Layer

You have spent the past nine lessons building a vocabulary: IP routing, DNS, HTTP/HTTPS, TCP vs UDP, REST, gRPC, WebSockets, API Gateways, SSE, and long polling. This capstone lesson puts all of it to work. You will work through a concrete system — a real-time collaborative productivity app (think Notion or Figma) — and make every protocol and API decision from scratch, with explicit trade-off reasoning for each choice.

Goal of this lesson: develop the habit of treating protocol selection as an engineering decision, not an implementation detail. For every boundary in your architecture, ask: What data moves here? How often? How latency-sensitive? What failure modes matter? Your answers drive the choice.

The System We Are Designing

Product brief: A collaborative document editor where up to 500 users can co-edit the same workspace in real time. Features include live cursors, instant text sync, comments, file attachments, and a notification feed. The system must support 10 million registered users with peak concurrent editors around 50,000.

We will design the communication layer — not the storage engine or the CDN — just the protocols, API shapes, and message flows that connect clients to servers and services to services.

Step 1 — Map Every Communication Boundary

Before picking any protocol, enumerate what actually communicates. In our system there are five distinct boundaries:

  1. Browser ↔ API Gateway — user actions (create doc, invite collaborator)
  2. Browser ↔ Collaboration Service — real-time edits and cursor positions
  3. Browser ↔ Notification Service — feed updates (comment added, mention)
  4. API Gateway ↔ Internal Microservices — service-to-service calls (auth, search, billing)
  5. Services ↔ Message Bus — async event fan-out (doc saved → trigger indexer, notify subscribers)

Mapping boundaries first prevents the common mistake of choosing one protocol for everything and then retrofitting it to cases where it fits poorly.

Step 2 — Apply Decision Criteria to Each Boundary

Boundary 1 — Browser to API Gateway (CRUD Actions)

Data: create/read/update/delete documents, manage users, upload metadata. Pattern: request-response, driven by explicit user action. Latency target: 200–500 ms is acceptable. Choice: REST over HTTPS.

REST wins here because: resources map cleanly to URL paths (/docs/{id}, /workspaces/{id}/members); HTTP caching (ETags, Cache-Control) reduces read load; browser fetch() handles it natively; and error semantics (400, 401, 404, 409, 422) are well understood by every frontend developer. Use JSON request/response bodies. Paginate large collections with cursor-based pagination (?after=cursor&limit=50) rather than offset — offset pagination is inconsistent under concurrent writes.

Version your REST API from day one. Use a URL prefix (/v1/) not a header. Headers are invisible in browser URLs and harder to route in API gateways. When you break a contract, increment the version — do not silently change responses.

Boundary 2 — Browser to Collaboration Service (Real-Time Edits)

Data: operational transforms or CRDTs — tiny delta operations like "insert character X at position 47", dozens per second per active user. Pattern: bidirectional, continuous, low latency, high frequency. Choice: WebSockets.

This is the textbook WebSocket use case. The client and server both need to push at any moment; polling would add unacceptable latency (even 1-second polling means 1 s of lag before a co-editor sees your cursor move). HTTP/2 server push is not bidirectional. SSE is server-to-client only. A single WebSocket connection per tab carries all document operations, cursor events, and presence signals with sub-100 ms round-trip latency on a good connection.

Use a compact binary message format — MessagePack or a small custom schema — rather than JSON. At 50,000 concurrent editors generating 20 operations per second each, JSON verbosity adds real CPU and bandwidth cost: a MessagePack delta might be 12 bytes where the equivalent JSON is 60 bytes.

Boundary 3 — Browser to Notification Service (Feed Updates)

Data: events like "Ana commented on your doc", "Your export is ready". Pattern: server pushes occasionally; client never pushes. Latency target: 5–30 seconds is acceptable. Choice: SSE (Server-Sent Events).

SSE is simpler than WebSockets for unidirectional push: a plain HTTP/2 response that never closes, sending text/event-stream chunks. The browser's native EventSource API reconnects automatically on disconnect. You avoid the WebSocket upgrade handshake, and the connection multiplexes over the same HTTP/2 connection as your REST calls, consuming fewer file descriptors on the server. Reserve WebSockets for boundaries that require bidirectional flow.

Boundary 4 — API Gateway to Internal Microservices

Data: structured RPC calls — "validate this auth token", "search documents for this user", "charge this subscription". Pattern: internal, synchronous, latency-sensitive, schema-strict. Choice: gRPC.

gRPC over HTTP/2 serialises with Protocol Buffers (typically 3–10× smaller than JSON, faster to encode/decode) and gives you strongly typed contracts enforced by generated code. Internal services trust each other's schemas; you do not need REST's loose JSON tolerance. gRPC also supports bidirectional streaming natively — useful if the search service streams partial results back to the gateway. Latency for a token-validation call drops from ~2 ms (REST/JSON) to ~0.3 ms (gRPC/Protobuf) at moderate load — a meaningful saving when it sits in the critical path of every request.

Boundary 5 — Services to Message Bus (Async Events)

Data: domain events — DocumentSaved, CommentCreated, UserInvited. Pattern: fan-out, decoupled producers and consumers, at-least-once delivery. Choice: Async messaging (Kafka or similar) with a defined event schema.

This boundary is not HTTP at all. When a document is saved, the collaboration service should not make synchronous HTTP calls to the indexer, the notifier, and the audit logger — that tight coupling slows down the save and cascades failures. Instead, publish one event to a topic; each downstream consumer reads independently. Use Avro or Protobuf schemas registered in a schema registry to prevent producers and consumers from drifting out of sync.

Step 3 — The Complete Architecture Diagram

Communication layer architecture for the collaborative document editor Browser Client (User) API Gateway TLS Termination Rate Limiting Collaboration Service Notification Service Auth Service Message Bus Kafka / Queue Indexer Consumer REST WebSocket SSE gRPC events REST (HTTPS) WebSocket SSE gRPC (internal) Async Events (Kafka)
Complete communication layer: each boundary uses the protocol that best fits its data shape and latency requirements.

Step 4 — Handle the Hard Cases

Reconnection and State Recovery for WebSockets

Mobile clients drop connections constantly. When a client reconnects to the collaboration service, it sends its last-seen sequence number (e.g., { "reconnect": true, "last_seq": 1482 }). The server replays any operations the client missed since that sequence. Without this, a reconnected client silently diverges from the document state — a data-corruption bug masquerading as a network issue.

Graceful Degradation for SSE

If a user is behind an HTTP/1.1 proxy that buffers the response body, SSE breaks silently. Implement a 30-second heartbeat (: keepalive\n\n comment line). If the client has not received a heartbeat within 45 seconds, fall back to long polling at 15-second intervals. Surface this degradation in your monitoring: a spike in long-poll traffic signals an infrastructure problem.

Backpressure on the Message Bus

If the indexer consumer falls behind (e.g., during a reindex), the Kafka topic lag grows. Do not let producers block — instead, set a maximum lag alert threshold (e.g., 100,000 messages) and scale indexer consumers horizontally. The collaboration service must never wait on downstream consumers; async decoupling only works if the message bus absorbs bursts.

Step 5 — Decision Summary Table

Protocol decision summary table Boundary Protocol Key Reason Browser → API Gateway REST / HTTPS Cacheable, standard HTTP errors Browser → Collab Service WebSocket Bidirectional, sub-100 ms, high freq Browser → Notification Svc SSE Server-push only, simpler than WS Gateway → Microservices gRPC Typed contract, low latency, binary Services → Event Bus Async Messaging Decoupled fan-out, burst absorption No single protocol is correct for all boundaries — match the tool to the communication pattern.
Protocol selection summary: each boundary in the system maps to the protocol that best fits its interaction pattern.

The Mindset to Take Forward

The real output of this exercise is not a specific protocol choice — it is the decision-making framework:

  • Direction: unidirectional (SSE) vs bidirectional (WebSocket)?
  • Frequency: occasional (REST) vs continuous (WebSocket / gRPC streaming)?
  • Coupling: synchronous (REST, gRPC) vs asynchronous (message bus)?
  • Audience: external clients needing interoperability (REST) vs internal services needing performance (gRPC)?
  • Failure mode: what happens if the downstream is slow or down? Does the upstream block?
The most common mistake in system design interviews is using the same protocol everywhere — "REST for everything" or "WebSockets for everything". Interviewers look for candidates who can justify why they chose a specific protocol at a specific boundary, including what they ruled out and why.
Practice drill: take any system you use daily — Slack, Google Docs, Uber, Twitter — and map out its communication boundaries. For each one, hypothesize what protocol they likely use and why. You will start noticing patterns: real-time position tracking is almost always WebSocket; notification badges are almost always SSE or long-poll; internal service calls are often gRPC; public APIs are REST. Building this intuition now makes protocol decisions in interviews and at work feel natural rather than arbitrary.

Networking and communication is the connective tissue of every distributed system. You now have the full vocabulary and the decision framework. Apply it relentlessly every time you draw an architecture diagram.

Tutorial Complete!

Congratulations! You have completed all lessons in this tutorial.