System Design/openai/Design Slack — Real-time Messaging System

Design Slack — Real-time Messaging System

HARD20 minmessagingreal-timewebsocketscassandrapostgreszookeeperarchitecture
Asked at: Coinbase, OpenAI

Design a Slack-like real-time messaging system supporting channels, DMs, presence, and durable message delivery with low latency and high availability.

1. The Question

Design a Slack-like messaging system that supports workspaces, channels, direct messages, presence, typing indicators, and message history. The system should provide low-latency real-time delivery, durable storage of messages, offline delivery/retry, and scale to millions of users and hundreds of thousands of concurrent connections.

2. Clarifying Questions

  • Are we building for a single region or globally distributed users? (Assume global distribution.)
  • What latency targets are required for delivery and presence updates? (Aim for <200ms for typical cases.)
  • Do we require strict global ordering of messages per channel or eventual ordering is ok? (Per-channel causal ordering preferred; global ordering not required.)
  • How long must messages be retained? (Assume configurable retention: short-term recent messages + long-term archival.)
  • Do we need end-to-end encryption? (Out of scope for initial design; note as an extension.)

3. Requirements

Functional:

  • User accounts, workspaces, channels (public/private), direct messages (DMs)
  • Real-time messaging with typing indicators and presence
  • Message history retrieval and search (basic)
  • Offline delivery with retries and dead-letter handling

Non-functional:

  • Low latency for real-time interactions (<200ms typical)
  • High availability and durability of messages
  • Horizontal scalability to millions of users and ~100k+ concurrent connections
  • Fault tolerance and graceful degradation
  • Reasonable consistency: per-channel ordering and at-least-once delivery with deduplication

4. Scale Estimates

Assumptions for capacity planning:

  • Total registered users: 50M
  • Daily active users (DAU): 10M
  • Peak concurrent connections: 300k
  • Average messages per DAU per day: 200 -> 2B messages/day
  • Average message size: 1 KB
  • Retention: 30 days hot storage, 1 year cold archive

Derived factors:

  • Peak writes per second (P99): ~30k-40k writes/s (if 2B/day ~23k/s average; peaks higher)
  • Read traffic: timeline reads when users open a channel; can lead to heavy fan-out reads
  • Storage: 2B messages/day * 1 KB ≈ 2 TB/day before replication; Cassandra replication (RF=3) -> ~6 TB/day logical storage

5. Data Model

Relational (PostgreSQL) - metadata and relational entities:

  • users(id PK, username, email, profile, created_at)
  • workspaces(id PK, name, owner_id, created_at)
  • channels(id PK, workspace_id FK, name, type, created_at)
  • memberships(id PK, workspace_id, user_id, role)
  • channel_memberships(channel_id, user_id, joined_at)

NoSQL (Cassandra) - messages (high write throughput, partition tolerance):

  • messages_by_channel ( channel_id PK, time_bucket (e.g., YYYYMMDD) clustering, message_ts clustering DESC, message_id, sender_id, body, attachments_meta, metadata (edited, reactions) )

Partitioning strategy: partition by (channel_id, time_bucket) to bound partition size. Use clustering order by message_ts DESC for efficient reads of recent messages. Use a unique message_id (UUID or time-based) for idempotency.

Offline delivery/queue metadata (Cassandra or Kafka + durable store):

  • delivery_state(message_id, recipient_id, status, last_attempt_ts, attempts)

Secondary storage: S3 for attachments and long-term archive.

6. API Design

REST endpoints (control plane):

  • POST /login -> returns auth token
  • GET /workspaces/{id}/channels
  • POST /workspaces/{id}/channels -> create channel
  • GET /channels/{id}/messages?limit=50&before=ts -> history
  • POST /channels/{id}/messages -> persistent send (also used by WebSocket gateway)

WebSocket (data plane) — persistent real-time connection:

  • Connect: ws://gateway.example.com/socket?token=...

  • Client -> Gateway messages (JSON):

    • {"type":"message.send","channel_id":"c123","body":"...","client_msg_id":"uuid"}
    • {"type":"typing.start","channel_id":"c123"}
  • Gateway -> Client messages:

    • {"type":"message.deliver","channel_id":"c123","message":{...}}
    • {"type":"presence.update","user_id":"u1","status":"online"}

Idempotency: client_msg_id used to dedupe retries. Server returns ack with server_msg_id and receipt status.

7. High-Level Architecture

Core components:

  • WebSocket Gateway / Edge: handles auth, TLS termination, and maintains client connections. Stateless; forwards events to chat servers or message broker.
  • Distributed Chat Servers: cluster of servers responsible for specific channel partitions; maintain in-memory presence and routing tables. Peer-to-peer communication between chat servers for inter-server routing.
  • Service Registry (ZooKeeper): tracks chat server instances, their partitions, and availability for routing. Chat servers register themselves in ZooKeeper.
  • Message Broker (Kafka): durable commit log for messages and events; enables fan-out, replay, and decoupling producers/consumers.
  • Durable Storage (Cassandra): primary store for messages with high write throughput and partitioning.
  • Relational DB (Postgres): user/workspace/channel metadata and transactional operations.
  • Delivery Queue / Retry System: Kafka + worker pool for offline delivery and DLQ handling. Dead-letter queue for permanently failed deliveries.
  • Presence Service: aggregated presence stored in in-memory store (Redis/Cluster) for quick lookups.
  • Attachment Store: S3 or equivalent for large blobs.

Flow (send message):

  1. Client sends via WebSocket to Gateway.
  2. Gateway authenticates and forwards to the responsible Chat Server (based on channel partition table from ZooKeeper or consistent hash).
  3. Chat Server writes message to Kafka for durability and replication, and writes to Cassandra asynchronously (or via consumer reading from Kafka).
  4. Chat Server routes message to members: if local, directly push via their WebSocket; if remote, forward to remote chat servers (peer-to-peer) or publish to Kafka topics for delivery.
  5. If recipient offline, enqueue delivery in retry queue; attempt push when they reconnect. After N attempts, move to DLQ and mark as undelivered.

Use ZooKeeper for membership and partition leader election; chat servers coordinate leader responsibilities for channel partitions. Kafka ensures ordered append and durable commit; Cassandra handles scalable reads/writes for timeline retrieval.

8. Detailed Design Decisions

  • WebSockets for low-latency bi-directional communication; use heartbeat and reconnect strategies.
  • ZooKeeper for service registry and leader election: lightweight, proven for coordination (can replace with etcd/Consul).
  • Peer-to-peer server communication: reduces central broker load for fan-out; servers directly forward messages to each other when they host recipients.
  • Kafka as the durability and fan-out backbone: producers (chat servers) write to topic per channel partition; consumers persist to Cassandra and trigger deliveries. Guarantees ordered writes per partition.
  • Cassandra for message storage: optimized for write-heavy workloads, tunable consistency, and partitioning. Use replication factor 3 and appropriate consistency level (LOCAL_QUORUM for reads/writes in region).
  • PostgreSQL for metadata where relational integrity and transactions are important.
  • Dead-letter/retry queue mechanism: consumer workers attempt delivery and on failures retry with exponential backoff. After max attempts, message goes to DLQ for manual/operator review or alternate routing.
  • Delivery guarantees: aim for at-least-once delivery with idempotency at the client via message IDs; dedupe on server using message_id or client_msg_id.
  • Ordering: per-channel ordering preserved via Kafka partitioning and by having channel leaders coordinate writes; eventual consistency across regions is acceptable with causal ordering per channel.

9. Bottlenecks & Scaling

  • Fan-out to large channels (thousands of members): heavy network and CPU cost. Mitigations: use multicast via message broker, batch pushes, or progressive fan-out (store & let clients pull for very large channels).
  • Hot channels (very popular channels): shard channel by time buckets or split heavy channels into sub-partitions with a metadata layer to reassemble.
  • Single points: ZooKeeper cluster must be highly available; run across multiple nodes and regions. Kafka and Cassandra clusters must be replicated and monitored.
  • Storage I/O: high write throughput to Cassandra; ensure adequate cluster size and disk throughput; use SSDs.
  • Message ordering vs availability trade-offs: stricter ordering reduces availability; prefer per-channel order and eventual consistency across region with cross-region replication strategies.
  • Connection scale: WebSocket gateways must be horizontally scalable with sticky routing or distributed session placement; use a load balancer that supports long-lived connections and route to the right gateway based on consistent hashing or ephemeral registration.
  • Retry storms: when a region fails and many clients reconnect, limit reconnection rate and use backoff jitter.

Monitoring & mitigation:

  • Metrics: message latency, consumer lag, delivery success rate, DLQ rate.
  • Autoscaling: chat server pools, gateway pools, broker partitions.
  • Backpressure: if downstream (Cassandra/Kafka) overloaded, apply backpressure at gateway/chat server and degrade non-critical features.

10. Follow-up Questions / Extensions

  • Add search across messages (indexing pipeline, eventual index in Elasticsearch with privacy controls).
  • End-to-end encryption for private channels and DMs.
  • Threaded conversations and message edits/deletes — impact on storage and indexes.
  • Rich presence signals and read receipts — extra event types and storage needs.
  • Multi-device sync: ensure ordered delivery and consistent read markers across devices.
  • Compliance and retention policies: legal holds, export, and deletion workflows.
  • Optimize for very large channels (e.g., #general with 100k users): design CDN-like distribution or server-side deduped fan-out.

11. Wrap-up

Proposed a WebSocket-based real-time architecture using distributed chat servers, ZooKeeper for service registry, Kafka for durable commit logs and fan-out, Cassandra for high-throughput message storage, and PostgreSQL for relational metadata. Included dead-letter/retry queue for offline delivery and discussed availability trade-offs, write paths, replication, and retry logic to ensure durable, low-latency delivery. Key trade-offs: per-channel ordering and at-least-once delivery with idempotency vs strict global consistency.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.