MockMe.ai | AI-Powered System Design Interview Practice

1. The Question

Design a distributed API rate limiter that prevents malicious or negligent users from sending too many requests. Requirements: minimal latency, support different algorithms (fixed-window and sliding-window), partitioning to handle large scale, support authenticated and unauthenticated endpoints, and be fault-tolerant.

2. Clarifying Questions

What resources do we limit (per-user, per-IP, per-API-key, per-endpoint)?
Are limits per endpoint, per service, or global? Can they vary by plan?
Do we need hard real-time correctness or eventual consistency is acceptable?
Expected QPS and number of distinct users (peak and average)?
SLA for added latency from rate check (e.g., <1ms, <5ms)?
Should the limiter support multiple algorithms (fixed window, sliding window, token bucket)?

3. Requirements

Functional:

Enforce rate limits per principal (user ID or IP) and per service/endpoint.
Support fixed-window and sliding-window algorithms.
Allow per-endpoint/plan config and dynamic updates.

Non-functional:

Minimal added latency to each request.
Scales to billions of users and many endpoints.
Fault-tolerant and highly available.
Low operational cost; avoid disk-thrashing on the critical path.

Edge cases:

Handle unauthenticated endpoints (use IP) and authenticated requests (user ID).
Mitigate false positives (shared IPs) with hybrid logic.

4. Scale Estimates

Example capacity estimate from the prompt:

Users: 1,000,000,000 (1B)
Services/endpoints to limit: 20
Per record metadata: userID (8 bytes) + counter (4 bytes) ≈ 12 bytes

Memory estimate: 1B users * 20 services * 12 bytes ≈ 240 GB in-memory.

Implication: single-node in-memory storage is infeasible; partition/shard across nodes. Expect very high QPS; design to partition by key and cache hot keys at the load balancer.

5. Data Model

Two main models depending on algorithm:

Fixed-window (fast, simple):

Key: {partition}:{service}:{principal}:{window_start}
Value: integer counter, TTL set to window size + small buffer.

Sliding-window (accurate):

Use a time-ordered structure per key (e.g., Redis list or sorted set)
Key: {partition}:{service}:{principal}:requests
Value: timestamps in a sorted set (score = timestamp). Evict entries older than (now - window).

Other metadata:

Store per-endpoint config (limit, window, algorithm, tier) in a config store.
Partition/replica metadata for routing.

Preferred storage: Redis (in-memory, rich data structures, TTL, replication).

6. API Design

Expose a simple check API used by front door or services:

POST /rate_limit/check

Payload (JSON):

principal_id: string | null (user ID)
ip: string
service: string (endpoint/service name)
timestamp: ISO8601 or epoch ms

Response (JSON):

allowed: boolean
remaining: integer (optional)
retry_after_seconds: integer (optional)
quota: config metadata (optional)

Synchronous fast path: load balancer cache → if miss, call rate limiter service → if allowed, forward request.

7. High-Level Architecture

Components:

Load Balancer (LB): receives client traffic. It holds a small write-back cache for hot principals.
Rate Limiter Service layer (dedicated): horizontally partitioned nodes (shards) that own ranges of principals.
Redis partitions (co-located or per shard): single-leader replication per partition for strong counters and reads/writes.
Backend application services: only receive request if rate limiter allows.

Flow:

Client → LB.
LB checks local cache for principal/endpoint. If cache says blocked, reject quickly.
If cache miss or allowed, LB calls Rate Limiter Service shard responsible for principal.
Rate Limiter reads/updates Redis (atomic increment or sorted-set ops), returns allow/deny.
If allowed, LB forwards request to backend.

Optimizations: partition by hash(principal) to reduce coordination; use LB-side right-back cache for hot keys; co-locate Redis shards with limiter instances to reduce network hops.

8. Detailed Design Decisions

Where to run limiter:

Local (per service): zero extra network hop but wastes app capacity and couples scaling.
Dedicated distributed layer: extra hop but shields backend servers and enables independent scaling. Recommended: dedicated layer with LB caching.

Key choice:

Hybrid principal selection: use IP for unauthenticated endpoints and user ID for authenticated ones. Support per-endpoint policy.

Storage & replication:

Use Redis (in-memory, TTLs, sorted sets) for speed. Use single-leader replication per partition for correctness on counters; failover promotes follower to leader.
Avoid leaderless/multi-leader unless you accept temporary divergence. CRDTs could be used but complicate accuracy.

Algorithms:

Fixed window: store (window, count). Very fast; subject to boundary bursts.
Sliding window: maintain timestamps in a list/sorted-set and purge old entries before counting; accurate but costlier.

Concurrency:

Use atomic Redis ops (INCR, INCRBY, ZADD, ZREMRANGEBYSCORE, ZCARD) to avoid application-level locks.
If using in-process counters, ensure locks or single-threaded event loop per partition.

9. Bottlenecks & Scaling

Load Balancer network: high-volume attacks still hit LB; mitigate with WAF and edge filtering.
Rate limiter service hotspot: mitigate by consistent hashing, split hot keys to dedicated downstream caches or token buckets.
Redis write load: shard by principal and scale horizontally; use pipelining and batching where possible.
Coordination overhead: keep counters local to partition; avoid cross-partition operations.
Latency: LB cache avoids many network calls; strive for sub-ms Redis operations and colocate rate limiter nodes near Redis shards.

Scaling strategies:

Shard by principal hash range; add shards and rebalance with minimal impact.
Cache hot principals at LB with short TTL and write-back/refresh heuristics.
Use quota tiers to offload strict limits from global store (e.g., treat very high-volume API keys specially).

10. Follow-up Questions / Extensions

Support token bucket/leaky bucket for burst smoothing and refill rates.
Per-client adaptive limits using behavioral signals.
Global distributed counters for cross-service quotas (billing-related).
Analytics: capture rate-limit events for dashboards and alerts.
Abuse mitigation: automated blocking, CAPTCHA triggers, gradual throttling vs hard blocks.
Multi-region deployment with geo-routing and global deduplication.

11. Wrap-up

Recommended design: dedicated, partitioned rate limiter service backed by Redis per partition, single-leader replication for correctness, and LB-side hot-key caching. Use hybrid keying (IP for unauthenticated, user ID for authenticated) and support fixed-window for cheap checks and sliding-window for accuracy. Scale by sharding, caching hot keys, and isolating hotspots.

API Rate Limiter Design