1. The Question
Design a distributed logged system to ingest onboarding events (or logs) at high throughput, validate and transform them asynchronously, provide near-real-time aggregated metrics, store queryable onboarding metadata for low-latency reads, archive raw data for audits, and expose an API for querying metadata. The system must be durable, highly available, scalable, and observable.
2. Clarifying Questions
- What is an "onboarding" event shape and typical size?
- Expected peak and steady-state throughput (events/sec and bytes/sec)?
- Ordering guarantees required (per-user, per-entity)?
- Acceptable end-to-end latency for "real-time" metrics?
- Query patterns for metadata (by onboarding ID, user ID, time window)?
- Retention / compliance requirements for raw data archives?
- SLOs / durability guarantees (at-least-once, exactly-once)?
- Allowed tech stack constraints (e.g., Kafka, Spark, Cassandra are available)?
3. Requirements
Functional:
- Ingest onboarding events via an API gateway.
- Lightweight immediate validation and fast client ACK.
- Asynchronous processing for detailed validation, enrichment, and aggregation.
- Store queryable onboarding metadata for low-latency reads.
- Durable archival of raw/processed payloads.
- Expose an API for querying metadata with filtering and pagination.
Non-functional:
- High throughput and horizontal scalability.
- No data loss (durability) and reasonable consistency for metadata.
- Near-real-time aggregation (seconds to sub-minute latency).
- Observability: metrics, logs, traces, and alerts.
- Cost-effective long-term storage.
4. Scale Estimates
- Example baseline: 10k events/sec steady, 100k events/sec spikes.
- Average event size: 1–10 KB (JSON) → 10–100 MB/sec steady, 1 GB/sec spikes.
- Partitions: Kafka topics sized to keep partition throughput under safe limits (e.g., 10–100 MB/s per partition) → 10–100 partitions for baseline, scale up for spikes.
- Metadata DB: Cassandra cluster sized for write-heavy workload, thousands of writes/sec; provision for compaction and read hotspots.
- Storage: S3 required capacity ~ daily events * avg size; lifecycle policies to move to cheaper tiers after retention window.
5. Data Model
Primary metadata (Cassandra):
-
Table: onboarding_by_id
- Partition key: onboarding_id
- Clustering: none
- Columns: status, user_id, created_at, updated_at, stage_timings (map), summary_fields (json/blob)
-
Table: onboarding_by_user
- Partition key: user_id
- Clustering: created_at DESC, onboarding_id
- Columns: onboarding_id, status, summary_fields
Design notes:
- Keep metadata narrow and query-friendly (precomputed fields) for low-latency reads.
- Store blobs/large JSON in S3 with references (S3 path + checksum) in Cassandra.
- Use TTLs or cleanup jobs if retention requires automated deletion.
6. API Design
Ingestion API (Onboarding Service behind API Gateway):
- POST /onboardings
- Body: minimal required fields (user_id, onboarding_payload)
- Response: 202 Accepted with onboarding_id
- Behavior: do lightweight validation and publish message to Kafka topic "onboardings-raw"
Query API (Query Service):
- GET /onboardings/{onboarding_id}
- Returns metadata from Cassandra.
- GET /users/{user_id}/onboardings?status=...&limit=...&cursor=...
- Supports filtering, pagination, and sorting.
Operational APIs:
- GET /metrics, /health, /traces for observability integration.
Security:
- Auth: bearer tokens or mTLS; rate limiting enforced at gateway; input sanitization at ingestion service.
7. High-Level Architecture
Use a decoupled pipeline:
- API Gateway / Onboarding Service
- Accept requests, do lightweight validation, rate-limit, push to Kafka, return ack.
- Kafka (Message Broker)
- Durable, partitioned topics to buffer and order events.
- Spark Streaming (or similar stream processor)
- Consume from Kafka, perform detailed validation, enrichment, stateful processing, and aggregation.
- Output: (a) write metadata updates to Cassandra, (b) write processed/aggregated files to S3, (c) emit metrics to Monitoring Service.
- Cassandra (Metadata DB)
- Store queryable metadata for low-latency reads.
- S3 (Long-term Storage)
- Archive raw and processed payloads compressed and partitioned by date.
- Query Service
- API layer that reads from Cassandra to answer external queries.
- Monitoring & Observability
- Export metrics, logs, and traces from all components (Prometheus/Grafana, Datadog); set alerts for lag, error rates, throughput.
Diagram (conceptual): Onboard API -> Kafka -> Spark Streaming -> {Cassandra, S3, Monitoring} -> Query API
8. Detailed Design Decisions
Why Kafka?
- Durability, partitioning, ordered delivery per partition, replay semantics for reprocessing.
Why Spark Streaming?
- Handles large-scale stream transformations, stateful operations, windowed aggregations, and integration with Kafka and S3.
Why Cassandra?
- Write-optimized, predictable low-latency reads for pre-modeled query patterns and horizontal scale.
Why S3?
- Cheap, durable archive storage with lifecycle management and easy integration for batch analytics.
Consistency model:
- Accept eventual consistency for read-after-write in exchange for availability at scale.
- Use idempotent writes and unique onboarding_id (UUID) to achieve effective at-least-once semantics; if exactly-once is required, enable Kafka EOS + Spark checkpoints and careful idempotency in Cassandra writes.
Validation pipeline:
- Two phases: (1) lightweight at ingestion; (2) deep validation in stream processing with enrichment and corrective actions.
Backpressure & flow control:
- Rate limiting at API gateway, Kafka as buffer, consumer autoscaling and lag monitoring to adjust consumers/partitions.
9. Bottlenecks & Scaling
Potential bottlenecks:
- Kafka broker throughput and partition count limits. Mitigation: scale brokers, increase partitions, tune ISR and batch sizes.
- Spark Streaming processing lag during spikes. Mitigation: autoscale executors, use efficient serialization, tune micro-batch size or switch to structured streaming with checkpointing.
- Cassandra hotspots on partition keys (e.g., time-series writes to same partition). Mitigation: design partition keys to distribute load, use bucketing, and provision nodes for expected write rate.
- S3 request latencies for many small files. Mitigation: buffer and aggregate small events into larger files before upload (parquet + compression).
- Network and disk IO on storage nodes. Mitigation: provision appropriate instance types, use SSDs, enable rack-awareness.
Operational considerations:
- Monitor consumer lag, Kafka under-replicated partitions, Spark job failures, Cassandra compaction/GC pauses, S3 upload errors.
- Implement dead-letter topics for poison messages and replay strategies.
10. Follow-up Questions / Extensions
- How to add exactly-once processing guarantees?
- How to support schema evolution of onboarding payloads?
- How to implement cross-region replication for disaster recovery?
- How to provide per-tenant rate limiting and multi-tenancy isolation?
- How to reduce cost for long-term retention while keeping queryability for recent data?
- How to support ad-hoc analytics (run Spark batch jobs on archived S3 data)?
11. Wrap-up
This design uses a decoupled streaming pipeline with Kafka for durable buffering, Spark Streaming for enrichment and aggregation, Cassandra for fast metadata queries, and S3 for durable archival. Key trade-offs include eventual consistency for high availability and operational complexity versus the strong guarantees of tightly-coupled transactional systems. Observability and careful capacity planning are critical to ensure reliability at scale.