System Design/doordash/Detecting and Preventing Review Abuse

Detecting and Preventing Review Abuse

MEDIUM20 minreviewsabuse-detectionfraud-detectionstream-processingmachine-learning
Asked at: Amazon, DoorDash

Design a system to detect, track, and mitigate abusive reviews and coordinated review manipulation on Amazon.com.

1. The Question

Design a system for Amazon.com that detects, tracks, and helps mitigate abusive reviews and review manipulation. Abuse includes fake or purchased reviews, coordinated campaigns (sockpuppet accounts), review stuffing for specific ASINs, and targeted attacks against sellers or competitors. The system should ingest live reviews and historical data, score suspicious activity, surface alerts for automated or human actions, and allow feedback for continuous improvement.

2. Clarifying Questions

  • What types of abusive behavior are we targeting? (fake reviews, coordinated campaigns, review bombing, incentivized reviews)
  • Are we expected to take automated actions (hide, remove, suspend) or just surface alerts for human review?
  • Must detection be real-time, near-real-time, or is daily batch acceptable?
  • What accuracy/latency SLAs are required for scoring and actions?
  • Do we need to integrate external signals (IP reputation, device ids, payment history)?

3. Requirements

Functional:

  • Ingest reviews and related events (submissions, edits, reports) in real-time.
  • Score each review and user for abuse likelihood.
  • Surface alerts and group related abusive events into incidents.
  • Provide APIs/UI for moderators to review alerts, take actions, and provide feedback.
  • Support re-scoring when new evidence arrives.

Non-functional:

  • High availability (99.9% +) and low latency for real-time scoring (sub-second to a few seconds).
  • Scalable to millions of reviews/day and hundreds of millions of users.
  • Auditable actions and explainable signals for moderation.
  • Privacy and compliance with data policies.

4. Scale Estimates

Estimated scale for Amazon-like service (ballpark):

  • Total users: 300M active shoppers
  • Reviews created: 20M new reviews/day (peak bursty traffic)
  • Review events (create/edit/report): ~30M events/day
  • Average review size: 2 KB -> ~40 GB/day raw
  • Peak write rate: ~300 events/sec sustained, spikes to several thousand/sec in bursts
  • Read rate for scoring and UI: ~5k-50k reads/sec depending on usage and analytics
  • Storage: 20M reviews/day -> 7.3B reviews/year (~14 TB/year at 2 KB compressed + indexes)

Design for horizontal scalability; use sharding and eventual consistency where appropriate.

5. Data Model

Key entities and fields (simplified):

  • Review { review_id, asin, user_id, rating, text, created_at, edited_at, source, status }
  • User { user_id, created_at, credential_age, purchase_history, device_ids[], payment_methods[], flags }
  • Product { asin, category, seller_id, review_count, rating_distribution }
  • ReviewSignal { review_id, spam_score, model_version, feature_snapshot, rules_triggered }
  • Alert/Incident { incident_id, related_reviews[], aggregated_score, status, created_at, actions[] }
  • FeatureStore entries for online features: { key: user_id|review_id, feature_vector, last_updated }

Indexes: time-based indexes for reviews, inverted/indexed text for search (Elasticsearch), and wide-key stores for user and product aggregates.

6. API Design

Example REST endpoints:

  • POST /reviews

    • body: { user_id, asin, rating, text, metadata }
    • response: { review_id, status }
    • behavior: write review event to ingest queue; return quickly
  • POST /reviews/{review_id}/report

    • body: { reporter_id, reason }
    • response: { report_id }
  • GET /reviews/{review_id}/score

    • response: { review_id, spam_score, model_version, signals }
  • GET /incidents?status=open&priority=high

    • response: list of incidents for moderator UI
  • POST /incidents/{id}/action

    • body: { action: hide|remove|suspend, actor_id, reason }

APIs should be asynchronous for heavy work; provide webhook/callbacks for updated verdicts.

7. High-Level Architecture

Components:

  1. Ingestion layer: API gateway -> write service -> append to durable event log (Kafka/Kinesis).
  2. Real-time stream processing: streaming jobs (Flink/Spark Streaming) that enrich events, compute streaming features, and call online scoring models or rules engine.
  3. Feature Store: online low-latency store (Redis/Cassandra) for features used in real-time scoring; offline feature warehouse (BigQuery/S3) for model training.
  4. ML scoring: real-time model server (TF-Serving or custom) for low-latency inference; batch training pipelines for model updates.
  5. Rules engine: deterministic heuristics for high-precision signals (e.g., identical review text, short-lived accounts).
  6. Alerting & incident management: aggregate suspicious events, group into incidents, notify moderation UI and automated action service.
  7. Storage: OLTP store for review metadata (DynamoDB/Cassandra), search index (Elasticsearch) for content lookups, long-term S3 cold storage.
  8. Feedback loop: moderator actions and user reports are fed back into training labels and rules tuning.

Dataflow: API -> event log -> stream enrich -> feature store + scoring -> alerting + persistent storage -> moderator UI -> feedback into training.

8. Detailed Design Decisions

  • Event log: Kafka/Kinesis for durability and replay.
  • Streaming compute: Flink for event-time processing and stateful aggregations.
  • Feature store: Redis or ScyllaDB for low-latency online features; Parquet on S3 for offline features.
  • Model serving: a combination of rules (fast, high precision) and ML models (neural or gradient-boosted trees) for probabilistic signals; ensemble scoring merging both.
  • Database choices: DynamoDB/Cassandra for scale and durability of review metadata; Elasticsearch for full-text search and similarity queries.
  • Grouping incidents: use graph algorithms to cluster related reviews/users (connected components) based on shared IPs, device ids, text similarity.
  • Explainability: store feature snapshots and rule triggers for every scored review to justify actions to moderators and users.

9. Bottlenecks & Scaling

  • Latency: online scoring must be <1s for good UX; mitigate with cached features and cheap rules before full model scoring.
  • Model throughput: scale model servers horizontally and use batching for throughput; fallback to rules if model is overloaded.
  • Storage I/O: heavy writes to feature store and indexes require sharding and throttling; use async writes where possible.
  • False positives: aggressive rules can hurt sellers; require human-in-the-loop for high-impact actions and A/B test thresholds.
  • Data skew: popular ASINs or attack campaigns create hotspots; use consistent hashing and dynamic re-sharding.
  • Adaptive attackers: use adversarial testing and continuous retraining to detect new manipulation tactics.

10. Follow-up Questions / Extensions

  • How to combine cross-product signals (orders, returns, browsing) to improve detection?
  • Detecting coordinated networks: implement graph analytics and community detection for sockpuppet networks.
  • Privacy & compliance: how to anonymize training data and handle deletion requests?
  • Rate-limiting and throttling for suspicious accounts.
  • Automated appeals flow: let sellers or reviewers contest decisions and incorporate their feedback into model retraining.

11. Wrap-up

Combine deterministic rules with ML models in a streaming architecture: ingest reviews -> compute features -> score -> aggregate incidents -> take action with human oversight. Ensure scalability, explainability, and feedback loops for continuous improvement while minimizing false positives.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.