System Design/meta/Detect Weapon Ads with ML

Detect Weapon Ads with ML

HARD12 minmlsystem-designads-moderationmultimodalscalabilityavailability
Asked at: Meta

End-to-end ML system to detect ads selling weapons across web and mobile (text, image, video). Minimize false negatives; meet 99.9% availability and scale to ~100K ads/day.

1. The Question

Design an end-to-end machine learning system to detect ads that are selling weapons. The system should ingest ads from Web and Mobile, process text, images, and video (video details skipped for storage estimates), support multiple languages, and prioritize minimizing false negatives (i.e., high recall for weapon-selling ads). It must be updatable (retrain/deploy/validate), achieve ~99.9% availability with cost awareness, and scale to roughly 100K ads/day.

2. Clarifying Questions

  • Which ad fields are available via the Ad API? (e.g., title, description, images, video links, seller metadata, contact info)
  • Do we need real-time blocking or near-real-time flagging for review? (assume both: prediction API for instant decisions + async review flow)
  • What actions should follow a positive detection? (auto-take-down, escalate to human review, shadow block)
  • Acceptable latency for a prediction? (assume <500ms p95 for online)
  • Are ground-truth labels available and how frequently? (assume periodic labeling + user reports)
  • Any legal/regulatory requirements per region? (assume yes — must support audits and explainability)

3. Requirements

Functional Requirements:

  • Track ads on Web and Mobile.
  • Detect if an ad is selling weapons (text, images, video).
  • Support multiple languages.
  • API endpoints: Predict Ad (POST), Log Prediction (POST), Trigger Training (POST), Deploy Model (POST).
  • Ingest data from existing Ad API.

Non-Functional Requirements:

  • Minimize false negatives (high recall); balance precision to limit review load.
  • Model performance tracked by precision/recall.
  • Availability: 99.9% (cost-aware).
  • Scalability: ~100K ads/day.
  • Updatability: retrain, validate, deploy with canary/rollback.
  • Online feature store: 1 month of ad data stored (NoSQL).
  • Prediction logging throughput (as specified): 4/sec * 3600 entries per day (see scale section).
  • Skip detailed video storage estimates for interview.

4. Scale Estimates

Assumptions and derived estimates:

  • Ads/day: 100,000 ads.
  • Peak arrival rate: input gives 10,000/3600 -> ~2.8 QPS. (If interpreted differently, plan for up to 100 QPS headroom.)
  • Online prediction QPS target: plan for 5-100 QPS to handle bursts; p95 latency budget <500ms.
  • Prediction logging: input states 4/Sec * 3600 entries per day. Literal interpretation = 4 * 3600 = 14,400 entries/day. If 4/sec sustained for full day: 4 * 86,400 = 345,600/day. Design to support at least 350K logged events/day.
  • Online feature store window: 1 month of ad data. If 100K ads/day -> ~3M ad records retained in online store.
  • Storage and inference: model serving instances sized to keep 99.9% uptime with autoscaling; use GPU/CPU mix depending on model complexity.

Safety headroom: provision autoscaling for 3x baseline to tolerate spikes and degradation.

5. Data Model

Primary objects (Markdown-style fields):

  • Ad:

    • ad_id (string)
    • platform (web|mobile)
    • language (ISO code)
    • title (text)
    • description (text)
    • media: [ { type: image|video, url, mime, extracted_text (OCR) } ]
    • seller_metadata: { user_id, region, account_age }
    • timestamps: { created_at, updated_at }
  • PredictionRecord:

    • prediction_id
    • ad_id
    • model_version
    • timestamp
    • score (0-1)
    • label (weapon|non-weapon|uncertain)
    • decision (block|flag|allow)
    • features_snapshot (feature vector reference)
  • Feature Vector (stored in online feature store NoSQL):

    • ad_id, computed_text_embeddings, image_embeddings, ocr_text, metadata_features, timestamp

Notes:

  • Use compact serialized feature vectors for quick retrieval.
  • Store 1 month of online feature data; archive older data to offline stores for retraining.

6. API Design

Provide simple POST endpoints:

  • POST /predict

    • Body: { ad_id, title, description, media: [urls], platform, language, seller_metadata }
    • Response: { ad_id, model_version, score, label, decision, latency_ms }
  • POST /log_prediction

    • Body: { ad_id, model_version, score, label, decision, features_snapshot, timestamp }
    • Response: { status: ok }
  • POST /trigger_training

    • Body: { training_job_name, config: { data_range, hyperparams, validate: true } }
    • Response: { job_id, status }
  • POST /deploy_model

    • Body: { model_version, strategy: canary|rollout, traffic_percent }
    • Response: { deployment_id, status }

Design notes:

  • Use authentication (service tokens) and rate limiting.
  • /predict should be idempotent keyed on ad_id to avoid duplicate processing.
  • Log predictions asynchronously to avoid increasing p95 latency.

7. High-Level Architecture

Markdown high-level flow:

  • Ad API -> Ingest Service

    • performs lightweight validation, enrichment (language detect), and synchronous call to Predict API.
  • Predict API (stateless microservice)

    • Preprocessor (text cleaning, OCR for images, frame sampling for video -> video storage skipped)
    • Feature fetcher (reads recent features from online feature store NoSQL)
    • Model Ensemble: multimodal transformer/text classifier + image classifier + OCR classifier + heuristics/rules
    • Decision logic: ensemble scoring tuned for low false negatives, thresholding, policy rules
    • Returns decision; triggers async Log Prediction
  • Log Service

    • Append prediction to logging store (append-only), stream events to offline store and monitoring pipelines
  • Online Feature Store (NoSQL)

    • Stores feature vectors for 1 month; supports fast key-value lookups by ad_id
  • Model Training Pipeline (offline)

    • Batch data from logs + labeled data -> feature extraction -> training (multimodal) -> validation -> model registry
  • Model Deployment Service

    • Deploy model versions with canary/rollout; shadow traffic for validation
  • Monitoring & Alerting

    • Model metrics (precision/recall), latency, throughput, drift detectors
  • Human Review Queue

    • For uncertain/high-risk detections, route to moderators

Scaling and availability:

  • Front doors behind load balancers and autoscaling groups
  • Use caching for feature fetches
  • Use regional replicas for availability 99.9%

8. Detailed Design Decisions

  • Multimodal model choice: use a fusion approach — text transformer (fine-tuned multilingual BERT/XLM) + CNN/vision transformer for images + lightweight video frame model. Concatenate embeddings and use a small classifier head.
  • Emphasize recall: set thresholds to reduce false negatives; route borderline cases to human review to contain precision loss.
  • Use rule-based signals (keywords, contact info patterns, price patterns) as high-recall weak detectors combined in ensemble.
  • Feature store: NoSQL key-value store (e.g., DynamoDB/Cassandra) for 1-month retention and low-latency reads.
  • Serving: model servers in microservices with autoscaling (CPU for text-only, GPU for heavy multimodal). Use batching where possible for throughput-cost tradeoff.
  • Retraining: scheduled retrain (weekly) + event-driven retrain when drift detected. Use validation with holdout and offline metrics.
  • Deployment: model registry + canary rollout (start 1-5% traffic), compare metrics (recall/precision) and rollback if regressions.
  • Explainability: log top contributing features and model attention snippets to support appeals/audits.
  • Privacy and compliance: redact PII before storage; region-aware data handling.

9. Bottlenecks & Scaling

  • Preprocessing (OCR, image decoding, video frame extraction) can be CPU/GPU intensive. Offload heavy ops to async workers; perform quick heuristics inline.
  • Model inference cost: multimodal models are expensive. Mitigate with model tiers: fast lightweight model for high throughput + heavyweight for uncertain cases.
  • Online feature store read/write hot spots: use partitioning and caching (Redis) for hottest keys.
  • Logging throughput and storage: use append-only streams (Kafka) and compacted topics; batch writes to long-term stores.
  • Training data pipeline: ensure scalable ETL and bounded resource usage during retrains.
  • Availability: design regional failover and replicate model endpoints; use health checks and circuit breakers.

Mitigations:

  • Two-tier inferencing (cheap filter then heavy model), async enrichments, autoscaling, caching, and backpressure for ingestion.

10. Follow-up Questions / Extensions

  • How to expand to other illicit goods (drugs, stolen property)?
  • Integrate user reports and feedback loops for online labeling.
  • Add adversarial robustness and content obfuscation detection (image steganography, text obfuscation).
  • Region-specific policies and automated appeals workflows.
  • Add active learning to prioritize labeling uncertain examples to improve recall quickly.

11. Wrap-up

Build a multimodal, ensemble-based detection pipeline with a fast prediction API, online feature store (NoSQL), logging and retraining pipelines, and robust deployment practices (canary, shadow). Prioritize minimizing false negatives via thresholds, rule-based detectors, human-in-the-loop review, and continuous monitoring for model drift while meeting availability and cost constraints.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.