1. The Question
Design a Slack-like messaging system that supports workspaces, channels, direct messages, presence, typing indicators, and message history. The system should provide low-latency real-time delivery, durable storage of messages, offline delivery/retry, and scale to millions of users and hundreds of thousands of concurrent connections.
2. Clarifying Questions
- Are we building for a single region or globally distributed users? (Assume global distribution.)
- What latency targets are required for delivery and presence updates? (Aim for <200ms for typical cases.)
- Do we require strict global ordering of messages per channel or eventual ordering is ok? (Per-channel causal ordering preferred; global ordering not required.)
- How long must messages be retained? (Assume configurable retention: short-term recent messages + long-term archival.)
- Do we need end-to-end encryption? (Out of scope for initial design; note as an extension.)
3. Requirements
Functional:
- User accounts, workspaces, channels (public/private), direct messages (DMs)
- Real-time messaging with typing indicators and presence
- Message history retrieval and search (basic)
- Offline delivery with retries and dead-letter handling
Non-functional:
- Low latency for real-time interactions (<200ms typical)
- High availability and durability of messages
- Horizontal scalability to millions of users and ~100k+ concurrent connections
- Fault tolerance and graceful degradation
- Reasonable consistency: per-channel ordering and at-least-once delivery with deduplication
4. Scale Estimates
Assumptions for capacity planning:
- Total registered users: 50M
- Daily active users (DAU): 10M
- Peak concurrent connections: 300k
- Average messages per DAU per day: 200 -> 2B messages/day
- Average message size: 1 KB
- Retention: 30 days hot storage, 1 year cold archive
Derived factors:
- Peak writes per second (P99): ~30k-40k writes/s (if 2B/day ~23k/s average; peaks higher)
- Read traffic: timeline reads when users open a channel; can lead to heavy fan-out reads
- Storage: 2B messages/day * 1 KB ≈ 2 TB/day before replication; Cassandra replication (RF=3) -> ~6 TB/day logical storage
5. Data Model
Relational (PostgreSQL) - metadata and relational entities:
- users(id PK, username, email, profile, created_at)
- workspaces(id PK, name, owner_id, created_at)
- channels(id PK, workspace_id FK, name, type, created_at)
- memberships(id PK, workspace_id, user_id, role)
- channel_memberships(channel_id, user_id, joined_at)
NoSQL (Cassandra) - messages (high write throughput, partition tolerance):
- messages_by_channel ( channel_id PK, time_bucket (e.g., YYYYMMDD) clustering, message_ts clustering DESC, message_id, sender_id, body, attachments_meta, metadata (edited, reactions) )
Partitioning strategy: partition by (channel_id, time_bucket) to bound partition size. Use clustering order by message_ts DESC for efficient reads of recent messages. Use a unique message_id (UUID or time-based) for idempotency.
Offline delivery/queue metadata (Cassandra or Kafka + durable store):
- delivery_state(message_id, recipient_id, status, last_attempt_ts, attempts)
Secondary storage: S3 for attachments and long-term archive.
6. API Design
REST endpoints (control plane):
- POST /login -> returns auth token
- GET /workspaces/{id}/channels
- POST /workspaces/{id}/channels -> create channel
- GET /channels/{id}/messages?limit=50&before=ts -> history
- POST /channels/{id}/messages -> persistent send (also used by WebSocket gateway)
WebSocket (data plane) — persistent real-time connection:
-
Connect: ws://gateway.example.com/socket?token=...
-
Client -> Gateway messages (JSON):
- {"type":"message.send","channel_id":"c123","body":"...","client_msg_id":"uuid"}
- {"type":"typing.start","channel_id":"c123"}
-
Gateway -> Client messages:
- {"type":"message.deliver","channel_id":"c123","message":{...}}
- {"type":"presence.update","user_id":"u1","status":"online"}
Idempotency: client_msg_id used to dedupe retries. Server returns ack with server_msg_id and receipt status.
7. High-Level Architecture
Core components:
- WebSocket Gateway / Edge: handles auth, TLS termination, and maintains client connections. Stateless; forwards events to chat servers or message broker.
- Distributed Chat Servers: cluster of servers responsible for specific channel partitions; maintain in-memory presence and routing tables. Peer-to-peer communication between chat servers for inter-server routing.
- Service Registry (ZooKeeper): tracks chat server instances, their partitions, and availability for routing. Chat servers register themselves in ZooKeeper.
- Message Broker (Kafka): durable commit log for messages and events; enables fan-out, replay, and decoupling producers/consumers.
- Durable Storage (Cassandra): primary store for messages with high write throughput and partitioning.
- Relational DB (Postgres): user/workspace/channel metadata and transactional operations.
- Delivery Queue / Retry System: Kafka + worker pool for offline delivery and DLQ handling. Dead-letter queue for permanently failed deliveries.
- Presence Service: aggregated presence stored in in-memory store (Redis/Cluster) for quick lookups.
- Attachment Store: S3 or equivalent for large blobs.
Flow (send message):
- Client sends via WebSocket to Gateway.
- Gateway authenticates and forwards to the responsible Chat Server (based on channel partition table from ZooKeeper or consistent hash).
- Chat Server writes message to Kafka for durability and replication, and writes to Cassandra asynchronously (or via consumer reading from Kafka).
- Chat Server routes message to members: if local, directly push via their WebSocket; if remote, forward to remote chat servers (peer-to-peer) or publish to Kafka topics for delivery.
- If recipient offline, enqueue delivery in retry queue; attempt push when they reconnect. After N attempts, move to DLQ and mark as undelivered.
Use ZooKeeper for membership and partition leader election; chat servers coordinate leader responsibilities for channel partitions. Kafka ensures ordered append and durable commit; Cassandra handles scalable reads/writes for timeline retrieval.
8. Detailed Design Decisions
- WebSockets for low-latency bi-directional communication; use heartbeat and reconnect strategies.
- ZooKeeper for service registry and leader election: lightweight, proven for coordination (can replace with etcd/Consul).
- Peer-to-peer server communication: reduces central broker load for fan-out; servers directly forward messages to each other when they host recipients.
- Kafka as the durability and fan-out backbone: producers (chat servers) write to topic per channel partition; consumers persist to Cassandra and trigger deliveries. Guarantees ordered writes per partition.
- Cassandra for message storage: optimized for write-heavy workloads, tunable consistency, and partitioning. Use replication factor 3 and appropriate consistency level (LOCAL_QUORUM for reads/writes in region).
- PostgreSQL for metadata where relational integrity and transactions are important.
- Dead-letter/retry queue mechanism: consumer workers attempt delivery and on failures retry with exponential backoff. After max attempts, message goes to DLQ for manual/operator review or alternate routing.
- Delivery guarantees: aim for at-least-once delivery with idempotency at the client via message IDs; dedupe on server using message_id or client_msg_id.
- Ordering: per-channel ordering preserved via Kafka partitioning and by having channel leaders coordinate writes; eventual consistency across regions is acceptable with causal ordering per channel.
9. Bottlenecks & Scaling
- Fan-out to large channels (thousands of members): heavy network and CPU cost. Mitigations: use multicast via message broker, batch pushes, or progressive fan-out (store & let clients pull for very large channels).
- Hot channels (very popular channels): shard channel by time buckets or split heavy channels into sub-partitions with a metadata layer to reassemble.
- Single points: ZooKeeper cluster must be highly available; run across multiple nodes and regions. Kafka and Cassandra clusters must be replicated and monitored.
- Storage I/O: high write throughput to Cassandra; ensure adequate cluster size and disk throughput; use SSDs.
- Message ordering vs availability trade-offs: stricter ordering reduces availability; prefer per-channel order and eventual consistency across region with cross-region replication strategies.
- Connection scale: WebSocket gateways must be horizontally scalable with sticky routing or distributed session placement; use a load balancer that supports long-lived connections and route to the right gateway based on consistent hashing or ephemeral registration.
- Retry storms: when a region fails and many clients reconnect, limit reconnection rate and use backoff jitter.
Monitoring & mitigation:
- Metrics: message latency, consumer lag, delivery success rate, DLQ rate.
- Autoscaling: chat server pools, gateway pools, broker partitions.
- Backpressure: if downstream (Cassandra/Kafka) overloaded, apply backpressure at gateway/chat server and degrade non-critical features.
10. Follow-up Questions / Extensions
- Add search across messages (indexing pipeline, eventual index in Elasticsearch with privacy controls).
- End-to-end encryption for private channels and DMs.
- Threaded conversations and message edits/deletes — impact on storage and indexes.
- Rich presence signals and read receipts — extra event types and storage needs.
- Multi-device sync: ensure ordered delivery and consistent read markers across devices.
- Compliance and retention policies: legal holds, export, and deletion workflows.
- Optimize for very large channels (e.g., #general with 100k users): design CDN-like distribution or server-side deduped fan-out.
11. Wrap-up
Proposed a WebSocket-based real-time architecture using distributed chat servers, ZooKeeper for service registry, Kafka for durable commit logs and fan-out, Cassandra for high-throughput message storage, and PostgreSQL for relational metadata. Included dead-letter/retry queue for offline delivery and discussed availability trade-offs, write paths, replication, and retry logic to ensure durable, low-latency delivery. Key trade-offs: per-channel ordering and at-least-once delivery with idempotency vs strict global consistency.