System Design/misc/Typeahead Suggestion / Autocomplete Design

Typeahead Suggestion / Autocomplete Design

MEDIUM14 minautocompletetriecachingscalabilityreal-timearchitecture
Asked at: Misc

Design a low-latency, highly available autocomplete system that returns the top-3 suggestions per prefix ordered by recency and frequency at large scale.

1. The Question

Design a typeahead (autocomplete) service that returns the top-3 suggestions for the prefix the user is typing. Suggestions should be ordered by a combination of popularity (frequency) and recency. The system must provide real-time responses with very low latency, scale to large traffic, and be highly available.

2. Clarifying Questions

  • Is the top-k fixed (k=3) or configurable? (Assume fixed k=3 for the core design, but allow extension.)
  • Are suggestions global or personalized? (Assume global by default; discuss personalization later.)
  • What is the target tail latency (p50/p95)? (Aim for p95 < 50ms server-side.)
  • How fresh must popularity stats be? (Near-real-time acceptable; hourly/batched updates acceptable.)
  • Read/write ratio and expected qps? (We’ll estimate from example traffic below.)
  • Acceptable consistency model? (Eventual consistency for popularity is fine.)

3. Requirements

Functional:

  • Return up to 3 suggestions for a given input prefix.
  • Order suggestions by a ranking combining frequency and recency.
  • Support incremental queries as user types (very low latency).

Non-functional:

  • Low latency (p50 << p95, target p95 < 50ms server-side).
  • High throughput and horizontal scalability.
  • High availability and graceful degradation.
  • Reasonable freshness of ranking (near-real-time acceptable).

4. Scale Estimates

  • Example corpus: Google-like 5B queries/day.
  • Unique queries ~30% => 1.5B unique/day.
  • Storage rough calc from prompt: 15 chars/query * 2 bytes => ~45GB/day of raw query bytes.
  • Overall qps: 5B / 86400 ≈ 57,870 QPS average. With typing amplification and peaks, provision for 3x–5x peak => ~180k–290k QPS.
  • Autocomplete read amplification: each character typed issues a request; average session may produce multiple requests per search; plan cache-heavy reads.
  • Target cache hit rate >= 95% to meet latency targets; cold-path queries served from persistent store.

5. Data Model

Core concepts:

  • Compressed Trie (radix tree): store phrases as a space-efficient prefix tree.
  • Node metadata: top-k suggestions precomputed for that prefix (list of suggestion IDs + score), termination flag, and optional phrase frequency + last-seen timestamp.
  • Suggestion record: {id, phrase, global_frequency, last_update_timestamp}.
  • Storage layout:
    • Persistent store for full trie (replicated across nodes) — use a distributed datastore (Cassandra / RocksDB-backed service) or highly available zookeeper-like store for service config; for large tries, store as sharded partitions.
    • In-memory caches (Redis) holding nodes/top-k for hot prefixes.
  • Sharding key: first character or prefix hash (e.g., first 1-2 characters) to partition the trie across servers, preventing a single monolithic trie.
  • Ranking score: a function combining frequency and recency, e.g. score = α * normalized_freq + β * recency_boost (timestamp decay). Normalize to allow comparing phrases across buckets.

6. API Design

Public API (HTTP/JSON):

  • GET /v1/autocomplete?s=<prefix>&k=3&context={optional}

    • Request params: s (string), k (optional int), context (optional user or locale), lang/region.
    • Response: { "prefix": "be", "suggestions": [ {"phrase":"belt","score":123.4}, ... ] }
  • POST /v1/feedback

    • body: {"phrase": "bell", "event": "select|complete", "user_id": "..."}
    • Used to log selection/impression events for offline aggregation.
  • Admin endpoints: push new trie shard, trigger reindex, metrics.

Latency requirements: API should return cached suggestions in <10ms at the app layer if served from Redis; cold reads from trie store aim for <50ms.

7. High-Level Architecture

Components:

  • Load Balancer (LB): routes client requests to application tier.
  • Application Servers: stateless front-end that handles API, rate-limits, and orchestrates cache/db lookups.
  • Cache Tier (Redis): stores node-level precomputed top-3 suggestions for hot prefixes with TTL.
  • Trie Storage (sharded): persistent store holding full compressed trie partitions (Cassandra / RocksDB on dedicated servers). Each shard holds a subset of prefixes and is replicated for HA.
  • Aggregator (streaming): collects selection/impression events (Kafka) and aggregates counts + recency info into temporary storage (Cassandra / hash tables).
  • Applier: batch job (hourly/half-hourly configurable) that consumes aggregates, recomputes affected trie shards, precomputes top-k per node, and atomically swaps shards (or pushes diffs) to storage.
  • Pub/Sub or Config Store: notify app servers / caches of trie updates (invalidate keys or warm caches).
  • Monitoring & Alerts, Metrics, and Admin UI.

Request flow (read):

  1. Client -> LB -> App Server.
  2. App checks Redis for prefix node top-3. If hit, return suggestions.
  3. If miss, app queries appropriate trie shard (based on prefix shard mapping) for node data, returns and populates Redis.

Write/updates flow:

  • User selection events → App Server logs to Kafka → Aggregator updates aggregate store → Applier periodically rebuilds trie shards and publishes updates.

8. Detailed Design Decisions

  • Precompute top-k per node: key for low latency. Avoid deep traversal at query time.
  • Compressed Trie (radix) reduces memory and traversal steps.
  • Shard by prefix to distribute storage and read load; keep mapping consistent via consistent hashing or deterministic prefix mapping.
  • Cache hot prefixes in Redis with short TTL (seconds to minutes) and write-through warming after applier pushes updates.
  • Ranking strategy: combine long-term frequency (aggregated counts) with recency (timestamp decay). We can use exponential decay: score = freq * exp(-λ * age).
  • Updates: use batched recompute to avoid frequent large writes. Eventual consistency is acceptable; real-time per-selection increments only update aggregates, not trie directly.
  • Atomic swap: build new shard offline and swap pointer/metadata to new version to avoid partial reads.
  • Replica reads: app servers can read any replica; choose nearest low-latency replica.
  • Hot-key mitigation: detect hot prefixes and replicate shards or use local in-memory caches on many app nodes.

9. Bottlenecks & Scaling

  • Hot prefixes (e.g., popular terms like "youtube") can overwhelm single shard/cache key. Mitigations:
    • Replicate hot-prefix entries across multiple Redis instances or use client-side caching.
    • Use rate limiting for abusive clients.
  • Aggregator/Applier throughput: recomputing huge trie shards can be heavy. Mitigations:
    • Incremental updates: only rebuild affected subtree(s) instead of whole trie.
    • Parallelize reindexing across shard workers.
  • Network latency: colocate app servers with caches and shards; use regional shards for geo-local traffic.
  • Memory limits: compressed trie + storing only top-k per node reduces footprint. Evict low-frequency phrases from persistent store if below threshold.
  • Cold-starts: warm caches during deployment, use progressive rollout and fallback to prefix-scan queries if shard temporarily unavailable.

10. Follow-up Questions / Extensions

  • Personalization: incorporate user history and context into ranking (privacy considerations).
  • Typo tolerance: add fuzzy matching (edit distance) or include n-gram/phonetic indexes; increases compute and memory cost.
  • Multilingual support: shard by language, normalize input with unicode normalization.
  • A/B testing and online learning: test ranking variants and adapt weights.
  • Real-time updates: if stricter freshness required, integrate an incremental streaming path that applies high-frequency updates to in-memory top-k for hot prefixes.
  • Rich suggestions: include categories, snippets, icons; store metadata alongside phrase records.

11. Wrap-up

A production-grade autocomplete service trades off freshness, memory footprint, and latency. The recommended approach:

  • Use a sharded, compressed trie with precomputed top-k per node to achieve low-latency reads.
  • Serve hot prefixes from Redis cache; store full trie in a replicated persistent store.
  • Collect events via a streaming pipeline; aggregate and periodically rebuild trie shards (applier) to update rankings.

This design achieves high availability and low query latency while keeping update complexity manageable through batched recompute and targeted incremental updates.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.