System Design/meta/Evaluation Framework for Ads Ranking

Evaluation Framework for Ads Ranking

MEDIUM12 minadsrankingevaluationmetricsmla/b-testingattributionprivacy
Asked at: Meta

Framework to evaluate and optimize ads ranking using KPIs, A/B testing, ML models, attribution, and ethics.

1. The Question

Design an evaluation framework for ads ranking. The framework should define objectives and KPIs, describe required data, propose evaluation methods (offline and online), propose experiment strategies (A/B testing, holdouts), outline ML evaluation and attribution approaches, address scale and data quality, and include ethical/privacy considerations and reporting.

2. Clarifying Questions

  • What business objective dominates (revenue, ROAS, engagement, long-term retention)?
  • Are we evaluating ad-level ranking or campaign-level allocation?
  • What surfaces (newsfeed, stories, search, sidebar) and placements must be covered?
  • Do we have ground-truth conversions or use proxy signals (clicks, viewability)?
  • What latency and privacy constraints exist (PII restrictions, differential privacy)?
  • Is offline historical data available with treatment labels and impression logs?

3. Requirements

Functional:

  • Provide per-ad and per-segment KPIs (CTR, CVR, ROAS, Quality Score).
  • Support offline evaluation (batch) and online experiments (A/B, multi-armed bandits).
  • Produce dashboards and automated alerts for KPI regressions.

Non-functional:

  • Scalable to billions of impressions/day.
  • Low-cost reproducible offline pipelines.
  • Privacy-preserving (compliant with GDPR/CCPA).
  • Explainable metrics for stakeholders.

4. Scale Estimates

  • Impressions: 1B+/day (example at Meta scale).
  • Active ads: 10M+ creatives running concurrently.
  • Unique users: 100M+ daily active users.
  • Events: 10–100B raw events/day (impressions, clicks, conversions).
  • Storage: petabyte-scale event stores; aggregated telemetry in data warehouses.

Implication: sampling, aggregation, and streaming pipelines are required.

5. Data Model

Key entities (markdown list):

  • impression {id, user_id_hash, ad_id, campaign_id, placement, timestamp, context_features}
  • click {impression_id, timestamp}
  • conversion {event_type, value, timestamp, attribution_hint}
  • creative {ad_id, headline, image_features, landing_url, advertiser_id}
  • user_features {user_id_hash, demographics, segments, recency}
  • model_preds {ad_id, user_id_hash, pred_ctr, pred_cvr, pred_value}

Aggregates:

  • hourly/day rollups by ad/campaign/placement/segment with impressions, clicks, conversions, spend, revenue, ROAS.

Notes:

  • Use hashed user IDs and join keys to avoid PII.
  • Keep raw event logs immutable and build derived tables for fast queries.

6. API Design

Expose APIs for evaluation, experiment control, and reporting.

  • POST /eval/offline/run

    • payload: {start_ts, end_ts, model_version, metrics: ["ctr","cvr","roas"], filters}
    • response: job_id
  • GET /eval/offline/{job_id}/status

    • response: {status, progress, result_url}
  • POST /experiments/create

    • payload: {name, cohorts, treatment_config, start_ts, end_ts}
    • response: experiment_id
  • GET /reports/metrics

    • query params: {metric, granularity, start, end, group_by}
    • response: timeseries or aggregated table

APIs must authenticate and rate-limit. Results should reference dataset versions and model artifacts.

7. High-Level Architecture

Components (markdown):

  • Event Collection: client SDKs -> streaming layer (Kafka) -> raw event lake.
  • Preprocessing: streaming ETL -> cleaning, deduplication, enrichment (user segments), produce parquet/columnar tables.
  • Feature Store: online (low-latency) and offline (training) feature stores.
  • Offline Evaluator: batch job framework (Spark/Beam) to compute KPIs, cohort metrics, counterfactual replay.
  • Experiment Platform: traffic split, assignment service, treatment enforcement, logging of exposure and outcomes.
  • Online Evaluator / Real-time Metrics: near-real-time aggregations for monitoring and alerts.
  • Model Evaluation: scoring pipelines to produce pred distributions, calibration checks, uplift modeling.
  • Reporting & Dashboarding: BI layer and alerting.

Data flow: events -> preprocess -> feature store / offline tables -> evaluator & model scoring -> dashboards/experiments.

8. Detailed Design Decisions

Key decisions and rationale:

  • Metrics: prioritize a small set (CTR, CVR, ROAS, long-term retention, quality score). ROAS and long-term metrics prevent short-term click-chasing.
  • Attribution model: support multiple (last-click, multi-touch, algorithmic) and record attribution metadata; choose algorithmic attribution for final billing when possible.
  • Offline evaluation: use counterfactual policy evaluation (IPS, doubly robust) to estimate new-ranker impact from logged data.
  • Experimentation: run randomized A/B tests for causal inference; use stratified randomization by region/device/segment to reduce variance.
  • ML evaluation: report calibration, AUC, precision-recall, and business metrics (predicted vs. actual ROAS). Use uplift and policy learning metrics for ranking policies.
  • Data quality: enforce schema checks, anomaly detection, and backfilling policies.
  • Privacy: rely on hashed identifiers, aggregate-level reporting, and DP mechanisms for public releases.

Trade-offs: counterfactual methods reduce cost of online tests but need accurate propensities; A/B tests are gold standard but expensive and time-consuming.

9. Bottlenecks & Scaling

  • Event ingestion throughput: shard Kafka topics, provision consumers, backpressure strategies.
  • Storage cost: retain raw events for a bounded period and store compressed aggregated rollups long-term.
  • Join complexity: user-feature joins at scale; use precomputed feature vectors and online feature caching.
  • Variance in experiments: low-signal metrics (conversions) need larger sample sizes—use sequential testing and pre-registered analysis plans.
  • Real-time metrics freshness vs. cost: provide tiered SLAs (real-time for monitoring, hourly for analysis).

Mitigations: sampling frameworks, approximate algorithms (HyperLogLog), partitioned compute, autoscaling, and efficient columnar formats.

10. Follow-up Questions / Extensions

  • How to evaluate long-term impact (e.g., user lifetime value) vs. short-term conversions?
  • Should we integrate counterfactual policy learning (offline RL) into the evaluation pipeline?
  • How to surface explainability for advertisers (why an ad ranked lower)?
  • How to incorporate creative quality (NLP/image scoring) into ranking metrics?
  • Can we use multi-armed bandits for exploration while minimizing regret?
  • What privacy-preserving ML techniques (federated learning, differential privacy) should be adopted?

11. Wrap-up

An effective evaluation framework combines clear business KPIs, robust data pipelines, both offline counterfactual and online randomized evaluations, model-centric metrics, and operational monitoring. Emphasize reproducibility, privacy, and iterative improvements (A/B tests -> model updates -> re-evaluation). Provide clear dashboards and communicate trade-offs to stakeholders.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.