System Design/amazon/Distributed Logged System

Distributed Logged System

HARD20 mindistributed-systemsstream-processingkafkasparkcassandras3observabilityapi-design
Asked at: Amazon

Design a scalable, durable distributed onboarding/logging pipeline with real-time aggregation, queryable metadata, and end-to-end observability.

1. The Question

Design a distributed logged system to ingest onboarding events (or logs) at high throughput, validate and transform them asynchronously, provide near-real-time aggregated metrics, store queryable onboarding metadata for low-latency reads, archive raw data for audits, and expose an API for querying metadata. The system must be durable, highly available, scalable, and observable.

2. Clarifying Questions

  • What is an "onboarding" event shape and typical size?
  • Expected peak and steady-state throughput (events/sec and bytes/sec)?
  • Ordering guarantees required (per-user, per-entity)?
  • Acceptable end-to-end latency for "real-time" metrics?
  • Query patterns for metadata (by onboarding ID, user ID, time window)?
  • Retention / compliance requirements for raw data archives?
  • SLOs / durability guarantees (at-least-once, exactly-once)?
  • Allowed tech stack constraints (e.g., Kafka, Spark, Cassandra are available)?

3. Requirements

Functional:

  • Ingest onboarding events via an API gateway.
  • Lightweight immediate validation and fast client ACK.
  • Asynchronous processing for detailed validation, enrichment, and aggregation.
  • Store queryable onboarding metadata for low-latency reads.
  • Durable archival of raw/processed payloads.
  • Expose an API for querying metadata with filtering and pagination.

Non-functional:

  • High throughput and horizontal scalability.
  • No data loss (durability) and reasonable consistency for metadata.
  • Near-real-time aggregation (seconds to sub-minute latency).
  • Observability: metrics, logs, traces, and alerts.
  • Cost-effective long-term storage.

4. Scale Estimates

  • Example baseline: 10k events/sec steady, 100k events/sec spikes.
  • Average event size: 1–10 KB (JSON) → 10–100 MB/sec steady, 1 GB/sec spikes.
  • Partitions: Kafka topics sized to keep partition throughput under safe limits (e.g., 10–100 MB/s per partition) → 10–100 partitions for baseline, scale up for spikes.
  • Metadata DB: Cassandra cluster sized for write-heavy workload, thousands of writes/sec; provision for compaction and read hotspots.
  • Storage: S3 required capacity ~ daily events * avg size; lifecycle policies to move to cheaper tiers after retention window.

5. Data Model

Primary metadata (Cassandra):

  • Table: onboarding_by_id

    • Partition key: onboarding_id
    • Clustering: none
    • Columns: status, user_id, created_at, updated_at, stage_timings (map), summary_fields (json/blob)
  • Table: onboarding_by_user

    • Partition key: user_id
    • Clustering: created_at DESC, onboarding_id
    • Columns: onboarding_id, status, summary_fields

Design notes:

  • Keep metadata narrow and query-friendly (precomputed fields) for low-latency reads.
  • Store blobs/large JSON in S3 with references (S3 path + checksum) in Cassandra.
  • Use TTLs or cleanup jobs if retention requires automated deletion.

6. API Design

Ingestion API (Onboarding Service behind API Gateway):

  • POST /onboardings
    • Body: minimal required fields (user_id, onboarding_payload)
    • Response: 202 Accepted with onboarding_id
    • Behavior: do lightweight validation and publish message to Kafka topic "onboardings-raw"

Query API (Query Service):

  • GET /onboardings/{onboarding_id}
    • Returns metadata from Cassandra.
  • GET /users/{user_id}/onboardings?status=...&limit=...&cursor=...
    • Supports filtering, pagination, and sorting.

Operational APIs:

  • GET /metrics, /health, /traces for observability integration.

Security:

  • Auth: bearer tokens or mTLS; rate limiting enforced at gateway; input sanitization at ingestion service.

7. High-Level Architecture

Use a decoupled pipeline:

  1. API Gateway / Onboarding Service
    • Accept requests, do lightweight validation, rate-limit, push to Kafka, return ack.
  2. Kafka (Message Broker)
    • Durable, partitioned topics to buffer and order events.
  3. Spark Streaming (or similar stream processor)
    • Consume from Kafka, perform detailed validation, enrichment, stateful processing, and aggregation.
    • Output: (a) write metadata updates to Cassandra, (b) write processed/aggregated files to S3, (c) emit metrics to Monitoring Service.
  4. Cassandra (Metadata DB)
    • Store queryable metadata for low-latency reads.
  5. S3 (Long-term Storage)
    • Archive raw and processed payloads compressed and partitioned by date.
  6. Query Service
    • API layer that reads from Cassandra to answer external queries.
  7. Monitoring & Observability
    • Export metrics, logs, and traces from all components (Prometheus/Grafana, Datadog); set alerts for lag, error rates, throughput.

Diagram (conceptual): Onboard API -> Kafka -> Spark Streaming -> {Cassandra, S3, Monitoring} -> Query API

8. Detailed Design Decisions

Why Kafka?

  • Durability, partitioning, ordered delivery per partition, replay semantics for reprocessing.

Why Spark Streaming?

  • Handles large-scale stream transformations, stateful operations, windowed aggregations, and integration with Kafka and S3.

Why Cassandra?

  • Write-optimized, predictable low-latency reads for pre-modeled query patterns and horizontal scale.

Why S3?

  • Cheap, durable archive storage with lifecycle management and easy integration for batch analytics.

Consistency model:

  • Accept eventual consistency for read-after-write in exchange for availability at scale.
  • Use idempotent writes and unique onboarding_id (UUID) to achieve effective at-least-once semantics; if exactly-once is required, enable Kafka EOS + Spark checkpoints and careful idempotency in Cassandra writes.

Validation pipeline:

  • Two phases: (1) lightweight at ingestion; (2) deep validation in stream processing with enrichment and corrective actions.

Backpressure & flow control:

  • Rate limiting at API gateway, Kafka as buffer, consumer autoscaling and lag monitoring to adjust consumers/partitions.

9. Bottlenecks & Scaling

Potential bottlenecks:

  • Kafka broker throughput and partition count limits. Mitigation: scale brokers, increase partitions, tune ISR and batch sizes.
  • Spark Streaming processing lag during spikes. Mitigation: autoscale executors, use efficient serialization, tune micro-batch size or switch to structured streaming with checkpointing.
  • Cassandra hotspots on partition keys (e.g., time-series writes to same partition). Mitigation: design partition keys to distribute load, use bucketing, and provision nodes for expected write rate.
  • S3 request latencies for many small files. Mitigation: buffer and aggregate small events into larger files before upload (parquet + compression).
  • Network and disk IO on storage nodes. Mitigation: provision appropriate instance types, use SSDs, enable rack-awareness.

Operational considerations:

  • Monitor consumer lag, Kafka under-replicated partitions, Spark job failures, Cassandra compaction/GC pauses, S3 upload errors.
  • Implement dead-letter topics for poison messages and replay strategies.

10. Follow-up Questions / Extensions

  • How to add exactly-once processing guarantees?
  • How to support schema evolution of onboarding payloads?
  • How to implement cross-region replication for disaster recovery?
  • How to provide per-tenant rate limiting and multi-tenancy isolation?
  • How to reduce cost for long-term retention while keeping queryability for recent data?
  • How to support ad-hoc analytics (run Spark batch jobs on archived S3 data)?

11. Wrap-up

This design uses a decoupled streaming pipeline with Kafka for durable buffering, Spark Streaming for enrichment and aggregation, Cassandra for fast metadata queries, and S3 for durable archival. Key trade-offs include eventual consistency for high availability and operational complexity versus the strong guarantees of tightly-coupled transactional systems. Observability and careful capacity planning are critical to ensure reliability at scale.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.