1. The Question
Design an evaluation framework for ads ranking. The framework should define objectives and KPIs, describe required data, propose evaluation methods (offline and online), propose experiment strategies (A/B testing, holdouts), outline ML evaluation and attribution approaches, address scale and data quality, and include ethical/privacy considerations and reporting.
2. Clarifying Questions
- What business objective dominates (revenue, ROAS, engagement, long-term retention)?
- Are we evaluating ad-level ranking or campaign-level allocation?
- What surfaces (newsfeed, stories, search, sidebar) and placements must be covered?
- Do we have ground-truth conversions or use proxy signals (clicks, viewability)?
- What latency and privacy constraints exist (PII restrictions, differential privacy)?
- Is offline historical data available with treatment labels and impression logs?
3. Requirements
Functional:
- Provide per-ad and per-segment KPIs (CTR, CVR, ROAS, Quality Score).
- Support offline evaluation (batch) and online experiments (A/B, multi-armed bandits).
- Produce dashboards and automated alerts for KPI regressions.
Non-functional:
- Scalable to billions of impressions/day.
- Low-cost reproducible offline pipelines.
- Privacy-preserving (compliant with GDPR/CCPA).
- Explainable metrics for stakeholders.
4. Scale Estimates
- Impressions: 1B+/day (example at Meta scale).
- Active ads: 10M+ creatives running concurrently.
- Unique users: 100M+ daily active users.
- Events: 10–100B raw events/day (impressions, clicks, conversions).
- Storage: petabyte-scale event stores; aggregated telemetry in data warehouses.
Implication: sampling, aggregation, and streaming pipelines are required.
5. Data Model
Key entities (markdown list):
- impression {id, user_id_hash, ad_id, campaign_id, placement, timestamp, context_features}
- click {impression_id, timestamp}
- conversion {event_type, value, timestamp, attribution_hint}
- creative {ad_id, headline, image_features, landing_url, advertiser_id}
- user_features {user_id_hash, demographics, segments, recency}
- model_preds {ad_id, user_id_hash, pred_ctr, pred_cvr, pred_value}
Aggregates:
- hourly/day rollups by ad/campaign/placement/segment with impressions, clicks, conversions, spend, revenue, ROAS.
Notes:
- Use hashed user IDs and join keys to avoid PII.
- Keep raw event logs immutable and build derived tables for fast queries.
6. API Design
Expose APIs for evaluation, experiment control, and reporting.
-
POST /eval/offline/run
- payload: {start_ts, end_ts, model_version, metrics: ["ctr","cvr","roas"], filters}
- response: job_id
-
GET /eval/offline/{job_id}/status
- response: {status, progress, result_url}
-
POST /experiments/create
- payload: {name, cohorts, treatment_config, start_ts, end_ts}
- response: experiment_id
-
GET /reports/metrics
- query params: {metric, granularity, start, end, group_by}
- response: timeseries or aggregated table
APIs must authenticate and rate-limit. Results should reference dataset versions and model artifacts.
7. High-Level Architecture
Components (markdown):
- Event Collection: client SDKs -> streaming layer (Kafka) -> raw event lake.
- Preprocessing: streaming ETL -> cleaning, deduplication, enrichment (user segments), produce parquet/columnar tables.
- Feature Store: online (low-latency) and offline (training) feature stores.
- Offline Evaluator: batch job framework (Spark/Beam) to compute KPIs, cohort metrics, counterfactual replay.
- Experiment Platform: traffic split, assignment service, treatment enforcement, logging of exposure and outcomes.
- Online Evaluator / Real-time Metrics: near-real-time aggregations for monitoring and alerts.
- Model Evaluation: scoring pipelines to produce pred distributions, calibration checks, uplift modeling.
- Reporting & Dashboarding: BI layer and alerting.
Data flow: events -> preprocess -> feature store / offline tables -> evaluator & model scoring -> dashboards/experiments.
8. Detailed Design Decisions
Key decisions and rationale:
- Metrics: prioritize a small set (CTR, CVR, ROAS, long-term retention, quality score). ROAS and long-term metrics prevent short-term click-chasing.
- Attribution model: support multiple (last-click, multi-touch, algorithmic) and record attribution metadata; choose algorithmic attribution for final billing when possible.
- Offline evaluation: use counterfactual policy evaluation (IPS, doubly robust) to estimate new-ranker impact from logged data.
- Experimentation: run randomized A/B tests for causal inference; use stratified randomization by region/device/segment to reduce variance.
- ML evaluation: report calibration, AUC, precision-recall, and business metrics (predicted vs. actual ROAS). Use uplift and policy learning metrics for ranking policies.
- Data quality: enforce schema checks, anomaly detection, and backfilling policies.
- Privacy: rely on hashed identifiers, aggregate-level reporting, and DP mechanisms for public releases.
Trade-offs: counterfactual methods reduce cost of online tests but need accurate propensities; A/B tests are gold standard but expensive and time-consuming.
9. Bottlenecks & Scaling
- Event ingestion throughput: shard Kafka topics, provision consumers, backpressure strategies.
- Storage cost: retain raw events for a bounded period and store compressed aggregated rollups long-term.
- Join complexity: user-feature joins at scale; use precomputed feature vectors and online feature caching.
- Variance in experiments: low-signal metrics (conversions) need larger sample sizes—use sequential testing and pre-registered analysis plans.
- Real-time metrics freshness vs. cost: provide tiered SLAs (real-time for monitoring, hourly for analysis).
Mitigations: sampling frameworks, approximate algorithms (HyperLogLog), partitioned compute, autoscaling, and efficient columnar formats.
10. Follow-up Questions / Extensions
- How to evaluate long-term impact (e.g., user lifetime value) vs. short-term conversions?
- Should we integrate counterfactual policy learning (offline RL) into the evaluation pipeline?
- How to surface explainability for advertisers (why an ad ranked lower)?
- How to incorporate creative quality (NLP/image scoring) into ranking metrics?
- Can we use multi-armed bandits for exploration while minimizing regret?
- What privacy-preserving ML techniques (federated learning, differential privacy) should be adopted?
11. Wrap-up
An effective evaluation framework combines clear business KPIs, robust data pipelines, both offline counterfactual and online randomized evaluations, model-centric metrics, and operational monitoring. Emphasize reproducibility, privacy, and iterative improvements (A/B tests -> model updates -> re-evaluation). Provide clear dashboards and communicate trade-offs to stakeholders.