System Design/misc/API Rate Limiter Design

API Rate Limiter Design

MEDIUM12 minrate-limitingdistributed-systemsredisapiarchitecture
Asked at: Misc

Design a scalable, low-latency distributed rate limiter supporting fixed and sliding windows, hybrid user/IP keys, caching, and Redis-backed storage.

1. The Question

Design a distributed API rate limiter that prevents malicious or negligent users from sending too many requests. Requirements: minimal latency, support different algorithms (fixed-window and sliding-window), partitioning to handle large scale, support authenticated and unauthenticated endpoints, and be fault-tolerant.

2. Clarifying Questions

  • What resources do we limit (per-user, per-IP, per-API-key, per-endpoint)?
  • Are limits per endpoint, per service, or global? Can they vary by plan?
  • Do we need hard real-time correctness or eventual consistency is acceptable?
  • Expected QPS and number of distinct users (peak and average)?
  • SLA for added latency from rate check (e.g., <1ms, <5ms)?
  • Should the limiter support multiple algorithms (fixed window, sliding window, token bucket)?

3. Requirements

Functional:

  • Enforce rate limits per principal (user ID or IP) and per service/endpoint.
  • Support fixed-window and sliding-window algorithms.
  • Allow per-endpoint/plan config and dynamic updates.

Non-functional:

  • Minimal added latency to each request.
  • Scales to billions of users and many endpoints.
  • Fault-tolerant and highly available.
  • Low operational cost; avoid disk-thrashing on the critical path.

Edge cases:

  • Handle unauthenticated endpoints (use IP) and authenticated requests (user ID).
  • Mitigate false positives (shared IPs) with hybrid logic.

4. Scale Estimates

Example capacity estimate from the prompt:

  • Users: 1,000,000,000 (1B)
  • Services/endpoints to limit: 20
  • Per record metadata: userID (8 bytes) + counter (4 bytes) ≈ 12 bytes

Memory estimate: 1B users * 20 services * 12 bytes ≈ 240 GB in-memory.

Implication: single-node in-memory storage is infeasible; partition/shard across nodes. Expect very high QPS; design to partition by key and cache hot keys at the load balancer.

5. Data Model

Two main models depending on algorithm:

Fixed-window (fast, simple):

  • Key: {partition}:{service}:{principal}:{window_start}
  • Value: integer counter, TTL set to window size + small buffer.

Sliding-window (accurate):

  • Use a time-ordered structure per key (e.g., Redis list or sorted set)
  • Key: {partition}:{service}:{principal}:requests
  • Value: timestamps in a sorted set (score = timestamp). Evict entries older than (now - window).

Other metadata:

  • Store per-endpoint config (limit, window, algorithm, tier) in a config store.
  • Partition/replica metadata for routing.

Preferred storage: Redis (in-memory, rich data structures, TTL, replication).

6. API Design

Expose a simple check API used by front door or services:

POST /rate_limit/check

Payload (JSON):

  • principal_id: string | null (user ID)
  • ip: string
  • service: string (endpoint/service name)
  • timestamp: ISO8601 or epoch ms

Response (JSON):

  • allowed: boolean
  • remaining: integer (optional)
  • retry_after_seconds: integer (optional)
  • quota: config metadata (optional)

Synchronous fast path: load balancer cache → if miss, call rate limiter service → if allowed, forward request.

7. High-Level Architecture

Components:

  • Load Balancer (LB): receives client traffic. It holds a small write-back cache for hot principals.
  • Rate Limiter Service layer (dedicated): horizontally partitioned nodes (shards) that own ranges of principals.
  • Redis partitions (co-located or per shard): single-leader replication per partition for strong counters and reads/writes.
  • Backend application services: only receive request if rate limiter allows.

Flow:

  1. Client → LB.
  2. LB checks local cache for principal/endpoint. If cache says blocked, reject quickly.
  3. If cache miss or allowed, LB calls Rate Limiter Service shard responsible for principal.
  4. Rate Limiter reads/updates Redis (atomic increment or sorted-set ops), returns allow/deny.
  5. If allowed, LB forwards request to backend.

Optimizations: partition by hash(principal) to reduce coordination; use LB-side right-back cache for hot keys; co-locate Redis shards with limiter instances to reduce network hops.

8. Detailed Design Decisions

Where to run limiter:

  • Local (per service): zero extra network hop but wastes app capacity and couples scaling.
  • Dedicated distributed layer: extra hop but shields backend servers and enables independent scaling. Recommended: dedicated layer with LB caching.

Key choice:

  • Hybrid principal selection: use IP for unauthenticated endpoints and user ID for authenticated ones. Support per-endpoint policy.

Storage & replication:

  • Use Redis (in-memory, TTLs, sorted sets) for speed. Use single-leader replication per partition for correctness on counters; failover promotes follower to leader.
  • Avoid leaderless/multi-leader unless you accept temporary divergence. CRDTs could be used but complicate accuracy.

Algorithms:

  • Fixed window: store (window, count). Very fast; subject to boundary bursts.
  • Sliding window: maintain timestamps in a list/sorted-set and purge old entries before counting; accurate but costlier.

Concurrency:

  • Use atomic Redis ops (INCR, INCRBY, ZADD, ZREMRANGEBYSCORE, ZCARD) to avoid application-level locks.
  • If using in-process counters, ensure locks or single-threaded event loop per partition.

9. Bottlenecks & Scaling

  • Load Balancer network: high-volume attacks still hit LB; mitigate with WAF and edge filtering.
  • Rate limiter service hotspot: mitigate by consistent hashing, split hot keys to dedicated downstream caches or token buckets.
  • Redis write load: shard by principal and scale horizontally; use pipelining and batching where possible.
  • Coordination overhead: keep counters local to partition; avoid cross-partition operations.
  • Latency: LB cache avoids many network calls; strive for sub-ms Redis operations and colocate rate limiter nodes near Redis shards.

Scaling strategies:

  • Shard by principal hash range; add shards and rebalance with minimal impact.
  • Cache hot principals at LB with short TTL and write-back/refresh heuristics.
  • Use quota tiers to offload strict limits from global store (e.g., treat very high-volume API keys specially).

10. Follow-up Questions / Extensions

  • Support token bucket/leaky bucket for burst smoothing and refill rates.
  • Per-client adaptive limits using behavioral signals.
  • Global distributed counters for cross-service quotas (billing-related).
  • Analytics: capture rate-limit events for dashboards and alerts.
  • Abuse mitigation: automated blocking, CAPTCHA triggers, gradual throttling vs hard blocks.
  • Multi-region deployment with geo-routing and global deduplication.

11. Wrap-up

Recommended design: dedicated, partitioned rate limiter service backed by Redis per partition, single-leader replication for correctness, and LB-side hot-key caching. Use hybrid keying (IP for unauthenticated, user ID for authenticated) and support fixed-window for cheap checks and sliding-window for accuracy. Scale by sharding, caching hot keys, and isolating hotspots.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.