1. The Question
Design a distributed API rate limiter that prevents malicious or negligent users from sending too many requests. Requirements: minimal latency, support different algorithms (fixed-window and sliding-window), partitioning to handle large scale, support authenticated and unauthenticated endpoints, and be fault-tolerant.
2. Clarifying Questions
- What resources do we limit (per-user, per-IP, per-API-key, per-endpoint)?
- Are limits per endpoint, per service, or global? Can they vary by plan?
- Do we need hard real-time correctness or eventual consistency is acceptable?
- Expected QPS and number of distinct users (peak and average)?
- SLA for added latency from rate check (e.g., <1ms, <5ms)?
- Should the limiter support multiple algorithms (fixed window, sliding window, token bucket)?
3. Requirements
Functional:
- Enforce rate limits per principal (user ID or IP) and per service/endpoint.
- Support fixed-window and sliding-window algorithms.
- Allow per-endpoint/plan config and dynamic updates.
Non-functional:
- Minimal added latency to each request.
- Scales to billions of users and many endpoints.
- Fault-tolerant and highly available.
- Low operational cost; avoid disk-thrashing on the critical path.
Edge cases:
- Handle unauthenticated endpoints (use IP) and authenticated requests (user ID).
- Mitigate false positives (shared IPs) with hybrid logic.
4. Scale Estimates
Example capacity estimate from the prompt:
- Users: 1,000,000,000 (1B)
- Services/endpoints to limit: 20
- Per record metadata: userID (8 bytes) + counter (4 bytes) ≈ 12 bytes
Memory estimate: 1B users * 20 services * 12 bytes ≈ 240 GB in-memory.
Implication: single-node in-memory storage is infeasible; partition/shard across nodes. Expect very high QPS; design to partition by key and cache hot keys at the load balancer.
5. Data Model
Two main models depending on algorithm:
Fixed-window (fast, simple):
- Key: {partition}:{service}:{principal}:{window_start}
- Value: integer counter, TTL set to window size + small buffer.
Sliding-window (accurate):
- Use a time-ordered structure per key (e.g., Redis list or sorted set)
- Key: {partition}:{service}:{principal}:requests
- Value: timestamps in a sorted set (score = timestamp). Evict entries older than (now - window).
Other metadata:
- Store per-endpoint config (limit, window, algorithm, tier) in a config store.
- Partition/replica metadata for routing.
Preferred storage: Redis (in-memory, rich data structures, TTL, replication).
6. API Design
Expose a simple check API used by front door or services:
POST /rate_limit/check
Payload (JSON):
- principal_id: string | null (user ID)
- ip: string
- service: string (endpoint/service name)
- timestamp: ISO8601 or epoch ms
Response (JSON):
- allowed: boolean
- remaining: integer (optional)
- retry_after_seconds: integer (optional)
- quota: config metadata (optional)
Synchronous fast path: load balancer cache → if miss, call rate limiter service → if allowed, forward request.
7. High-Level Architecture
Components:
- Load Balancer (LB): receives client traffic. It holds a small write-back cache for hot principals.
- Rate Limiter Service layer (dedicated): horizontally partitioned nodes (shards) that own ranges of principals.
- Redis partitions (co-located or per shard): single-leader replication per partition for strong counters and reads/writes.
- Backend application services: only receive request if rate limiter allows.
Flow:
- Client → LB.
- LB checks local cache for principal/endpoint. If cache says blocked, reject quickly.
- If cache miss or allowed, LB calls Rate Limiter Service shard responsible for principal.
- Rate Limiter reads/updates Redis (atomic increment or sorted-set ops), returns allow/deny.
- If allowed, LB forwards request to backend.
Optimizations: partition by hash(principal) to reduce coordination; use LB-side right-back cache for hot keys; co-locate Redis shards with limiter instances to reduce network hops.
8. Detailed Design Decisions
Where to run limiter:
- Local (per service): zero extra network hop but wastes app capacity and couples scaling.
- Dedicated distributed layer: extra hop but shields backend servers and enables independent scaling. Recommended: dedicated layer with LB caching.
Key choice:
- Hybrid principal selection: use IP for unauthenticated endpoints and user ID for authenticated ones. Support per-endpoint policy.
Storage & replication:
- Use Redis (in-memory, TTLs, sorted sets) for speed. Use single-leader replication per partition for correctness on counters; failover promotes follower to leader.
- Avoid leaderless/multi-leader unless you accept temporary divergence. CRDTs could be used but complicate accuracy.
Algorithms:
- Fixed window: store (window, count). Very fast; subject to boundary bursts.
- Sliding window: maintain timestamps in a list/sorted-set and purge old entries before counting; accurate but costlier.
Concurrency:
- Use atomic Redis ops (INCR, INCRBY, ZADD, ZREMRANGEBYSCORE, ZCARD) to avoid application-level locks.
- If using in-process counters, ensure locks or single-threaded event loop per partition.
9. Bottlenecks & Scaling
- Load Balancer network: high-volume attacks still hit LB; mitigate with WAF and edge filtering.
- Rate limiter service hotspot: mitigate by consistent hashing, split hot keys to dedicated downstream caches or token buckets.
- Redis write load: shard by principal and scale horizontally; use pipelining and batching where possible.
- Coordination overhead: keep counters local to partition; avoid cross-partition operations.
- Latency: LB cache avoids many network calls; strive for sub-ms Redis operations and colocate rate limiter nodes near Redis shards.
Scaling strategies:
- Shard by principal hash range; add shards and rebalance with minimal impact.
- Cache hot principals at LB with short TTL and write-back/refresh heuristics.
- Use quota tiers to offload strict limits from global store (e.g., treat very high-volume API keys specially).
10. Follow-up Questions / Extensions
- Support token bucket/leaky bucket for burst smoothing and refill rates.
- Per-client adaptive limits using behavioral signals.
- Global distributed counters for cross-service quotas (billing-related).
- Analytics: capture rate-limit events for dashboards and alerts.
- Abuse mitigation: automated blocking, CAPTCHA triggers, gradual throttling vs hard blocks.
- Multi-region deployment with geo-routing and global deduplication.
11. Wrap-up
Recommended design: dedicated, partitioned rate limiter service backed by Redis per partition, single-leader replication for correctness, and LB-side hot-key caching. Use hybrid keying (IP for unauthenticated, user ID for authenticated) and support fixed-window for cheap checks and sliding-window for accuracy. Scale by sharding, caching hot keys, and isolating hotspots.