MockMe.ai | AI-Powered System Design Interview Practice

1. The Question

Design improvements to YouTube's recommendation algorithm. Focus on boosting relevance and engagement while reducing harmful or low-quality surfacing. Consider signals, model architecture, freshness, diversity, personalization, evaluation metrics, and how to serve recommendations at YouTube scale.

2. Clarifying Questions

What does "improve" mean: higher watch time, user satisfaction, retention, or trust?
Are we targeting all users globally or a subset (new users, logged-in users)?
Are we allowed to change UI/UX or only backend ranking?
Constraints: latency SLOs, compute budget, privacy/regulatory limits?

3. Requirements

Functional: produce ranked candidate videos tailored to a user context (homepage, up-next, search follow-up).
Non-functional: 100s ms end-to-end latency for front-end, high availability, per-region scaling.
Metrics: increase long-term user satisfaction, reduce clickbait and harmful content, maintain or improve watch time.
Constraints: user privacy, minimal degradation during rollout, limited compute cost growth.

4. Scale Estimates

Users: ~2B+ monthly users globally.
Content: hundreds of millions of active videos, millions uploaded per day.
Requests: tens of millions of recommendation requests per second peak across endpoints.
Storage: petabytes for raw logs, model features and embeddings; exabytes over time for full ingestion.
Serving: need to serve ranked lists with sub-200ms p95 latency per request.

5. Data Model

User profile: user_id (or anonymous id), demographics (opt-in), long-term watch preferences, interest embeddings, subscription/follow lists.
Item metadata: video_id, creator_id, title, tags, categories, language, upload_time, duration, metadata embeddings, moderation flags.
Interaction logs: (user_id, video_id, timestamp, action_type, watch_fraction, device, context).
Models: store user and item embeddings, feature snapshots, metadata indexes.
Privacy: store hashed IDs, support deletion and retention policies.

6. API Design

GET /recommendations?user_id={id}&context={home|upnext|search}&limit=10

Response: JSON list of { video_id, score, reason_tags, freshness_ts, metadata }

Notes:

Support anonymous users via session_id.
Include opaque reason_tags for client-side lightweight personalization and personalization debugging.
Provide feature flags in API for staged rollouts and A/B experiments.

7. High-Level Architecture

Ingestion layer: stream interaction logs (Kafka), perform enrichment and deduplication.
Offline training: batch jobs compute candidate generation models, ranking models, embeddings, and long-term features.
Feature store: serve time-travelled features for training and online lookups.
Candidate generation: lightweight recall systems (retrieval by collaborative filtering, content similarity, trending).
Ranking: multi-stage ranking (fast model for online scoring + heavy model for re-ranking).
Serving layer: low-latency cache + online feature fetch + scoring cluster behind API.
Feedback loop: continuous training pipelines and online evaluation/telemetry for production metrics.

8. Detailed Design Decisions

Multi-stage pipeline: recall -> pre-rank -> rank -> re-rank to balance latency and model complexity.
Use representation learning: dense embeddings for users and videos produced by two-tower models, enabling fast ANN retrieval.
Mix objectives: combine short-term engagement with long-term satisfaction via multi-objective loss or counterfactual policy evaluation.
Diversity & debiasing: implement determinantal point processes (DPP) or constrained optimization to enforce topical diversity and limit creator overrepresentation.
Freshness: apply time-decay features and an exploration budget to surface new uploads.
Safety: integrate moderation signals and a trust score to demote harmful content; apply hard blocks for policy violations.
Personalization vs. trends: ensemble of signals with dynamic weighting based on session intent and cold-start heuristics.

9. Bottlenecks & Scaling

Candidate generation at scale: mitigate by sharding ANN indices, using per-region indices, and hybrid recall (embedding + inverted indexes).
Feature freshness vs. cost: use hot path feature caching for recent interactions and fall back to approximations for older features.
Model training time: accelerate with incremental / online updates and prioritized sampling for critical slices.
Latency: keep heavy re-rank offline or asynchronous; use smaller distilled models online.
Evaluation signal latency: rely on nearline metrics for rapid iteration and long-run causal metrics (experiment cohorts) for final decisions.

10. Follow-up Questions / Extensions

How to evaluate long-term satisfaction vs. immediate engagement? Describe experiments and instrumentation.
Discuss cold-start for new creators and new users.
Propose approaches for cross-device personalization and privacy-preserving personalization (federated learning, differential privacy).
How to handle multilingual and regional content fairness?

11. Wrap-up

Propose a phased rollout: start with instrumentation and offline evaluation, then A/B test improvements (safety, diversity, freshness), and finally scale with monitoring, cost controls, and governance. Highlight trade-offs between latency, relevance, and safety.

Improve YouTube Recommendation Algorithm