MockMe.ai | AI-Powered System Design Interview Practice

1. The Question

Design a distributed logged system to ingest onboarding events (or logs) at high throughput, validate and transform them asynchronously, provide near-real-time aggregated metrics, store queryable onboarding metadata for low-latency reads, archive raw data for audits, and expose an API for querying metadata. The system must be durable, highly available, scalable, and observable.

2. Clarifying Questions

What is an "onboarding" event shape and typical size?
Expected peak and steady-state throughput (events/sec and bytes/sec)?
Ordering guarantees required (per-user, per-entity)?
Acceptable end-to-end latency for "real-time" metrics?
Query patterns for metadata (by onboarding ID, user ID, time window)?
Retention / compliance requirements for raw data archives?
SLOs / durability guarantees (at-least-once, exactly-once)?
Allowed tech stack constraints (e.g., Kafka, Spark, Cassandra are available)?

3. Requirements

Functional:

Ingest onboarding events via an API gateway.
Lightweight immediate validation and fast client ACK.
Asynchronous processing for detailed validation, enrichment, and aggregation.
Store queryable onboarding metadata for low-latency reads.
Durable archival of raw/processed payloads.
Expose an API for querying metadata with filtering and pagination.

Non-functional:

High throughput and horizontal scalability.
No data loss (durability) and reasonable consistency for metadata.
Near-real-time aggregation (seconds to sub-minute latency).
Observability: metrics, logs, traces, and alerts.
Cost-effective long-term storage.

4. Scale Estimates

Example baseline: 10k events/sec steady, 100k events/sec spikes.
Average event size: 1–10 KB (JSON) → 10–100 MB/sec steady, 1 GB/sec spikes.
Partitions: Kafka topics sized to keep partition throughput under safe limits (e.g., 10–100 MB/s per partition) → 10–100 partitions for baseline, scale up for spikes.
Metadata DB: Cassandra cluster sized for write-heavy workload, thousands of writes/sec; provision for compaction and read hotspots.
Storage: S3 required capacity ~ daily events * avg size; lifecycle policies to move to cheaper tiers after retention window.

5. Data Model

Primary metadata (Cassandra):

Table: onboarding_by_id
- Partition key: onboarding_id
- Clustering: none
- Columns: status, user_id, created_at, updated_at, stage_timings (map), summary_fields (json/blob)
Table: onboarding_by_user
- Partition key: user_id
- Clustering: created_at DESC, onboarding_id
- Columns: onboarding_id, status, summary_fields

Design notes:

Keep metadata narrow and query-friendly (precomputed fields) for low-latency reads.
Store blobs/large JSON in S3 with references (S3 path + checksum) in Cassandra.
Use TTLs or cleanup jobs if retention requires automated deletion.

6. API Design

Ingestion API (Onboarding Service behind API Gateway):

POST /onboardings
- Body: minimal required fields (user_id, onboarding_payload)
- Response: 202 Accepted with onboarding_id
- Behavior: do lightweight validation and publish message to Kafka topic "onboardings-raw"

Query API (Query Service):

GET /onboardings/{onboarding_id}
- Returns metadata from Cassandra.
GET /users/{user_id}/onboardings?status=...&limit=...&cursor=...
- Supports filtering, pagination, and sorting.

Operational APIs:

GET /metrics, /health, /traces for observability integration.

Security:

Auth: bearer tokens or mTLS; rate limiting enforced at gateway; input sanitization at ingestion service.

7. High-Level Architecture

Use a decoupled pipeline:

API Gateway / Onboarding Service
- Accept requests, do lightweight validation, rate-limit, push to Kafka, return ack.
Kafka (Message Broker)
- Durable, partitioned topics to buffer and order events.
Spark Streaming (or similar stream processor)
- Consume from Kafka, perform detailed validation, enrichment, stateful processing, and aggregation.
- Output: (a) write metadata updates to Cassandra, (b) write processed/aggregated files to S3, (c) emit metrics to Monitoring Service.
Cassandra (Metadata DB)
- Store queryable metadata for low-latency reads.
S3 (Long-term Storage)
- Archive raw and processed payloads compressed and partitioned by date.
Query Service
- API layer that reads from Cassandra to answer external queries.
Monitoring & Observability
- Export metrics, logs, and traces from all components (Prometheus/Grafana, Datadog); set alerts for lag, error rates, throughput.

Diagram (conceptual): Onboard API -> Kafka -> Spark Streaming -> {Cassandra, S3, Monitoring} -> Query API

8. Detailed Design Decisions

Why Kafka?

Durability, partitioning, ordered delivery per partition, replay semantics for reprocessing.

Why Spark Streaming?

Handles large-scale stream transformations, stateful operations, windowed aggregations, and integration with Kafka and S3.

Why Cassandra?

Write-optimized, predictable low-latency reads for pre-modeled query patterns and horizontal scale.

Why S3?

Cheap, durable archive storage with lifecycle management and easy integration for batch analytics.

Consistency model:

Accept eventual consistency for read-after-write in exchange for availability at scale.
Use idempotent writes and unique onboarding_id (UUID) to achieve effective at-least-once semantics; if exactly-once is required, enable Kafka EOS + Spark checkpoints and careful idempotency in Cassandra writes.

Validation pipeline:

Two phases: (1) lightweight at ingestion; (2) deep validation in stream processing with enrichment and corrective actions.

Backpressure & flow control:

Rate limiting at API gateway, Kafka as buffer, consumer autoscaling and lag monitoring to adjust consumers/partitions.

9. Bottlenecks & Scaling

Potential bottlenecks:

Kafka broker throughput and partition count limits. Mitigation: scale brokers, increase partitions, tune ISR and batch sizes.
Spark Streaming processing lag during spikes. Mitigation: autoscale executors, use efficient serialization, tune micro-batch size or switch to structured streaming with checkpointing.
Cassandra hotspots on partition keys (e.g., time-series writes to same partition). Mitigation: design partition keys to distribute load, use bucketing, and provision nodes for expected write rate.
S3 request latencies for many small files. Mitigation: buffer and aggregate small events into larger files before upload (parquet + compression).
Network and disk IO on storage nodes. Mitigation: provision appropriate instance types, use SSDs, enable rack-awareness.

Operational considerations:

Monitor consumer lag, Kafka under-replicated partitions, Spark job failures, Cassandra compaction/GC pauses, S3 upload errors.
Implement dead-letter topics for poison messages and replay strategies.

10. Follow-up Questions / Extensions

How to add exactly-once processing guarantees?
How to support schema evolution of onboarding payloads?
How to implement cross-region replication for disaster recovery?
How to provide per-tenant rate limiting and multi-tenancy isolation?
How to reduce cost for long-term retention while keeping queryability for recent data?
How to support ad-hoc analytics (run Spark batch jobs on archived S3 data)?

11. Wrap-up

This design uses a decoupled streaming pipeline with Kafka for durable buffering, Spark Streaming for enrichment and aggregation, Cassandra for fast metadata queries, and S3 for durable archival. Key trade-offs include eventual consistency for high availability and operational complexity versus the strong guarantees of tightly-coupled transactional systems. Observability and careful capacity planning are critical to ensure reliability at scale.

Distributed Logged System