System Design/google/Improve YouTube Recommendation Algorithm

Improve YouTube Recommendation Algorithm

HARD12 minrecommendationmachine-learningyoutubescalabilityuser-experiencedata-platform
Asked at: Google

Design improvements to YouTube's recommendation system focusing on relevance, diversity, freshness, trust, and scalable ML infrastructure.

1. The Question

Design improvements to YouTube's recommendation algorithm. Focus on boosting relevance and engagement while reducing harmful or low-quality surfacing. Consider signals, model architecture, freshness, diversity, personalization, evaluation metrics, and how to serve recommendations at YouTube scale.

2. Clarifying Questions

  • What does "improve" mean: higher watch time, user satisfaction, retention, or trust?
  • Are we targeting all users globally or a subset (new users, logged-in users)?
  • Are we allowed to change UI/UX or only backend ranking?
  • Constraints: latency SLOs, compute budget, privacy/regulatory limits?

3. Requirements

  • Functional: produce ranked candidate videos tailored to a user context (homepage, up-next, search follow-up).
  • Non-functional: 100s ms end-to-end latency for front-end, high availability, per-region scaling.
  • Metrics: increase long-term user satisfaction, reduce clickbait and harmful content, maintain or improve watch time.
  • Constraints: user privacy, minimal degradation during rollout, limited compute cost growth.

4. Scale Estimates

  • Users: ~2B+ monthly users globally.
  • Content: hundreds of millions of active videos, millions uploaded per day.
  • Requests: tens of millions of recommendation requests per second peak across endpoints.
  • Storage: petabytes for raw logs, model features and embeddings; exabytes over time for full ingestion.
  • Serving: need to serve ranked lists with sub-200ms p95 latency per request.

5. Data Model

  • User profile: user_id (or anonymous id), demographics (opt-in), long-term watch preferences, interest embeddings, subscription/follow lists.
  • Item metadata: video_id, creator_id, title, tags, categories, language, upload_time, duration, metadata embeddings, moderation flags.
  • Interaction logs: (user_id, video_id, timestamp, action_type, watch_fraction, device, context).
  • Models: store user and item embeddings, feature snapshots, metadata indexes.
  • Privacy: store hashed IDs, support deletion and retention policies.

6. API Design

GET /recommendations?user_id={id}&context={home|upnext|search}&limit=10

Response: JSON list of { video_id, score, reason_tags, freshness_ts, metadata }

Notes:

  • Support anonymous users via session_id.
  • Include opaque reason_tags for client-side lightweight personalization and personalization debugging.
  • Provide feature flags in API for staged rollouts and A/B experiments.

7. High-Level Architecture

  1. Ingestion layer: stream interaction logs (Kafka), perform enrichment and deduplication.
  2. Offline training: batch jobs compute candidate generation models, ranking models, embeddings, and long-term features.
  3. Feature store: serve time-travelled features for training and online lookups.
  4. Candidate generation: lightweight recall systems (retrieval by collaborative filtering, content similarity, trending).
  5. Ranking: multi-stage ranking (fast model for online scoring + heavy model for re-ranking).
  6. Serving layer: low-latency cache + online feature fetch + scoring cluster behind API.
  7. Feedback loop: continuous training pipelines and online evaluation/telemetry for production metrics.

8. Detailed Design Decisions

  • Multi-stage pipeline: recall -> pre-rank -> rank -> re-rank to balance latency and model complexity.
  • Use representation learning: dense embeddings for users and videos produced by two-tower models, enabling fast ANN retrieval.
  • Mix objectives: combine short-term engagement with long-term satisfaction via multi-objective loss or counterfactual policy evaluation.
  • Diversity & debiasing: implement determinantal point processes (DPP) or constrained optimization to enforce topical diversity and limit creator overrepresentation.
  • Freshness: apply time-decay features and an exploration budget to surface new uploads.
  • Safety: integrate moderation signals and a trust score to demote harmful content; apply hard blocks for policy violations.
  • Personalization vs. trends: ensemble of signals with dynamic weighting based on session intent and cold-start heuristics.

9. Bottlenecks & Scaling

  • Candidate generation at scale: mitigate by sharding ANN indices, using per-region indices, and hybrid recall (embedding + inverted indexes).
  • Feature freshness vs. cost: use hot path feature caching for recent interactions and fall back to approximations for older features.
  • Model training time: accelerate with incremental / online updates and prioritized sampling for critical slices.
  • Latency: keep heavy re-rank offline or asynchronous; use smaller distilled models online.
  • Evaluation signal latency: rely on nearline metrics for rapid iteration and long-run causal metrics (experiment cohorts) for final decisions.

10. Follow-up Questions / Extensions

  • How to evaluate long-term satisfaction vs. immediate engagement? Describe experiments and instrumentation.
  • Discuss cold-start for new creators and new users.
  • Propose approaches for cross-device personalization and privacy-preserving personalization (federated learning, differential privacy).
  • How to handle multilingual and regional content fairness?

11. Wrap-up

Propose a phased rollout: start with instrumentation and offline evaluation, then A/B test improvements (safety, diversity, freshness), and finally scale with monitoring, cost controls, and governance. Highlight trade-offs between latency, relevance, and safety.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.