System Design/amazon/Large-scale ticketing system design

Large-scale ticketing system design

HARD18 minscalabilitymicroservicescachingconsistencyavailabilityevent-drivendatabases
Asked at: Amazon

Design a scalable ticketing platform for peak events that supports real-time bookings, high throughput, low latency, and transactional consistency.

1. The Question

The Question

Design a large-scale ticketing system capable of handling millions of users during peak events (e.g., Black Friday). The system should support event discovery, seat selection, real-time booking, secure payments, promo codes, mobile QR validation, and admin analytics while prioritizing availability and transactional correctness for purchases.

2. Clarifying Questions

Key clarifying questions to ask the interviewer:

  • Are tickets reserved by seat (assigned seating) or by general admission?
  • Is the system global (multi-region) with regional ticket pools, or single-region?
  • Are refunds and hold/preview flows required?
  • Expected SLA for checkout latency (e.g., <2s for reads, <5s for payments)?
  • Are there third-party constraints (payment gateways, KYC, anti-fraud)?
  • How strict must consistency be for inventory? Can we allow eventual consistency for reads but require strong consistency for final purchases?

3. Requirements

Functional requirements:

  • Event discovery and search (low-latency reads).
  • Real-time seat map and selection (for assigned seating).
  • Temporary holds (seat reservation windows) and final purchase.
  • Secure payments and payment status tracking.
  • Promo codes, discounts, and limits per user.
  • Mobile QR code generation for entry.
  • Admin dashboards and analytics.

Non-functional requirements:

  • High availability during peaks (Black Friday scale).
  • Low read latency; high write throughput for purchases.
  • Strong correctness for inventory (prevent oversell).
  • Horizontal scalability and multi-region support.
  • Robust security, fraud detection, and idempotent operations.

4. Scale Estimates

Example peak estimates (adjust to interviewer-provided numbers):

  • Peak concurrent users browsing: 1,000,000.
  • Peak checkout attempts per second: 5,000–20,000 TPS (depending on event).
  • Average event size: 10k–100k seats.
  • Orders per day during event: 10M+.
  • Read-heavy workloads: search and seat map reads ~90% of traffic.
  • Write-heavy hotspots: inventory and order writes concentrated on popular events.

Design decisions should be validated with the above targets and allow easy re-sharding if higher throughput is required.

5. Data Model

Core entities (simplified):

  • Event
    • id, name, description, start_time, end_time, venue_id, metadata
  • Venue
    • id, name, address, seating_layout
  • Seat
    • id, venue_id, section, row, number, price_band
  • InventoryUnit (for sharded inventory tracking)
    • id, event_id, seat_id (nullable for GA), status {available, held, sold}, hold_expiry, version
  • Order
    • id, user_id, event_id, items[{seat_id, price}], total_amount, status{initiated, paid, failed, cancelled}, created_at
  • Payment
    • id, order_id, gateway_id, status, auth_token, timestamp
  • PromoCode
    • code, discount_type, constraints, usage_count
  • User
    • id, name, contact, payment_methods_ref

Notes:

  • Use compact representations for hot inventory (InventoryUnit) and partition by event_id or event_id+section for sharding.
  • Keep write model minimal and fast; store audit/event logs in append-only event store (Kafka) for history and analytics.

6. API Design

Public API endpoints (HTTP/REST or gRPC behind an API Gateway):

  • GET /events?query=... — search events (paginated)
  • GET /events/{id}/seats — seat map and current availability (cached, near-real-time)
  • POST /events/{id}/hold — place a short-term hold on selected seats (returns hold_id, TTL)
    • body: { user_id, seats: [seat_id], client_token }
  • POST /orders — create order from hold and begin payment
    • body: { hold_id, user_id, payment_method, promo_code }
  • POST /payments/webhook — payment gateway status updates (idempotent)
  • GET /orders/{id} — get order status and QR code once paid
  • POST /orders/{id}/cancel — cancel order and release inventory

API design notes:

  • Use idempotency tokens on POST /orders and payment webhooks to avoid double-charges.
  • Enforce rate limits per IP/account and require authenticated requests for critical endpoints.

7. High-Level Architecture

High-level components:

  • Edge: CDN + WAF + API Gateway to absorb traffic spikes.
  • Load Balancers -> Stateless API Services (microservices):
    • Discovery Service (read-heavy): search, event listing (cached via CDN/Redis).
    • Seat/Inventory Service: holds, finalizes inventory (strong consistency required).
    • Order Service: creates orders, coordinates payment.
    • Payment Service: interacts with external payment gateways (idempotent, retry safe).
    • Promo Service: validates and applies promo codes.
    • QR & Mobile Service: generates QR codes and validates at entry.
    • Admin & Analytics Service: aggregates metrics from event streams.
  • Cache Layer: Redis cluster for hold tracking, hot reads (seat maps, pricing).
  • Message Bus: Kafka for async workflows (order events, analytics, notification, reconciliation).
  • Datastores:
    • Sharded NoSQL (Cassandra/DynamoDB) for inventory and high-write datasets.
    • Relational DB (Aurora/Postgres) for orders/payments and strong transactional needs, or use single-purpose transactional DB per service.
    • Search index (Elasticsearch/OpenSearch) for event search.
  • Background workers: reconcile payments, expired holds, fraud checks, email/SMS notifications.
  • Monitoring/Observability: distributed tracing, metrics (Prometheus), logs (ELK), dashboards (Grafana).

Design patterns:

  • CQRS: separate read-optimized services (fast, cached) from write/transactional paths.
  • Saga pattern: manage distributed transactions across services for orders/payments/inventory.
  • Event sourcing for audit and replay via Kafka.

8. Detailed Design Decisions

Key design details and trade-offs:

  • Inventory correctness: Use a primary write path in Inventory Service that enforces atomic state transitions (available -> held -> sold). Implement optimistic locking (version numbers) or use a single-partition leader for a seat shard to serialize updates.

  • Holds: Implement short TTL holds in Redis with a server-side lease and use Kafka events to persist hold state asynchronously to durable storage. If a hold expires, a background worker releases inventory.

  • Prevent oversell: For assigned seats, map each seat to a consistent shard and ensure the Inventory Service serializes modifications for that shard. For GA tickets, maintain counters in a distributed counter system (e.g., sharded counters in DynamoDB/Cassandra) with compare-and-swap or token-bucket issuance.

  • Payments & consistency: Use a Saga orchestrator. Steps: reserve inventory (local DB transaction), create order (local), call payment gateway (external), on success mark order paid and finalize inventory; on failure, release inventory. Use idempotency and compensating transactions for partial failures.

  • Caching: Cache seat maps and pricing aggressively. Use near-real-time invalidation via Kafka events when inventory changes. Accept eventual consistency for browsing; require authoritative check at hold/purchase.

  • Scalability: Partition by event_id, and for large venues further by section/row. Use autoscaling groups and scale Kafka partitions and DB capacity as load grows.

  • Anti-fraud & bots: Use rate limiting, CAPTCHA on suspicious flows, require account verification for purchases, device fingerprinting, and payment gateway risk scoring. Implement queuing and progressive disclosure for extremely high demand (virtual waiting rooms).

  • Latency: Keep user-facing services stateless; push long-running tasks to background workers. Optimize checkout path to minimal RPCs and synchronous calls only where correctness requires it.

9. Bottlenecks & Scaling

Potential bottlenecks and mitigations:

  • Inventory hotspot for a popular event: shard inventory by event and section; use consistent hashing; add more partitions and route requests.
  • Database write throughput: move hot write paths to NoSQL with high write capacity (Cassandra/Dynamo/partitioned Postgres) and use batching where appropriate.
  • Payment gateway throughput: parallelize across multiple gateway providers and implement retry/backoff and queueing; offload non-critical work.
  • Cache stampede on popular reads: use layered caching (CDN -> Redis -> DB) and request coalescing.
  • Message bus overload: increase partition count, scale consumers, use topic per functionality (orders, holds, analytics).
  • Network & edge limits: use multi-region failover and geolocation-based routing; pre-warm caches and provision capacity for known sale times.

Testing & resilience:

  • Load test with realistic traffic shapes and failure injection.
  • Chaos testing for DB/node failures and network partitions.
  • Graceful degradation: allow browsing to remain available if checkout is degraded (show queues or limits).

10. Follow-up Questions / Extensions

Possible extensions to discuss:

  • Multi-region active-active availability and data replication strategies.
  • Reserving bundles (multiple seats) with atomic guarantees.
  • Dynamic pricing and inventory reservations for presales.
  • Subscription or fan-club priority queues.
  • Offline/edge validation and intermittent connectivity handling for mobile validation gates.
  • Analytics: real-time dashboards and ML for fraud/demand prediction.
  • Cost optimization: cold storage for audit logs and tiered read replicas for reporting.

11. Wrap-up

Summary:

  • Use microservices with clear separation: read-heavy discovery and cached seat maps, and a strongly consistent write path for inventory and orders.
  • Use caching and CDNs for low-latency reads; use sharding and partitioning for inventory scalability.
  • Ensure transactional correctness using optimistic locking/leader serialization and Saga patterns for cross-service workflows.
  • Protect the checkout path with idempotency, rate limiting, and fraud detection.
  • Validate the design with load testing, monitoring, and a multi-region deployment plan for resiliency.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.