MockMe.ai | AI-Powered System Design Interview Practice

1. The Question

Design a system that removes noise from audio streams and audio files. The system should support both real-time low-latency use cases (e.g., video calls) and high-quality batch processing (e.g., post-processing of uploaded videos). Consider accuracy, latency, throughput, model deployment, monitoring, and upgrade paths.

2. Clarifying Questions

What are the target applications? (real-time conferencing, live streaming, post-processing)
What noise types must be handled? (stationary: hum, hiss; non-stationary: traffic, music, other speakers)
Latency SLO for real-time path (e.g., <20 ms, <100 ms)?
Throughput targets: concurrent streams, batch jobs/hour?
Allowed compute: CPU-only edge, mobile, GPUs in cloud?
Success metrics: objective (SNR, PESQ) and subjective (MOS) thresholds?
Privacy/regulatory constraints on sending raw audio to cloud?

3. Requirements

Functional:

Real-time denoising API with low end-to-end latency.
Batch denoising API for higher-quality processing.
Support mono/stereo, sample rates up to 48 kHz.
Model update and A/B testing support.

Non-functional:

Real-time latency: typical target < 50 ms added processing.
Throughput: scale to 100k+ concurrent real-time streams (example target).
Accuracy: measurable improvements in SNR/PESQ and human MOS.
Deployability: run on server GPU/CPU and optionally on-device/mobile.
Robustness: handle packet loss, jitter, and variable input levels.

Observability & Ops:

Metrics for latency, error rates, quality metrics.
Logging and tools for subjective evaluation.

4. Scale Estimates

Example scale targets (adjust for your company):

Concurrent real-time streams: 100,000
Average throughput per stream: 16 kbps processed (voice), inference 10–50 ms per 20–40 ms frame
Batch jobs: 10,000 files/hour, average file 5 minutes
Model size: 5–200 MB depending on architecture
Storage: store models and metadata; audio storage depends on retention—if storing raw/before/after for 1% of traffic for quality eval, use tens of TB

Estimate server requirements:

Real-time: optimized CPU instances or small GPU/TPU pods with model quantization or DSP front-end to reduce CPU/GPU cost.
Batch: GPU-backed workers for high-quality models (Demucs/Conv-TasNet) for faster throughput.

5. Data Model

Primary entities (simplified):

audio_job
- id
- user_id
- type: "real_time" | "batch"
- sample_rate
- channels
- status
- created_at
- input_location (for batch)
- output_location
- model_version
audio_frame (ephemeral for streaming; not persisted to DB for scale)
- stream_id
- sequence_no
- timestamp
- payload (encoded PCM/float)
model_metadata
- model_version
- architecture
- quantization
- training_data_summary
- metrics (PESQ, SNR on validation sets)

Use object storage (e.g., GCS/S3) for audio blobs and a small metadata DB (Cloud Spanner/Cloud SQL/managed NoSQL) for job state and metrics.

6. API Design

Proposed APIs:

Real-time (gRPC bidirectional streaming):
- StartStream(stream_metadata) -> stream_id
- SendFrame(stream_id, frame_seq, pcm_chunk) -> ack
- ReceiveFrame(stream_id) -> denoised_pcm_chunk
- EndStream(stream_id)
Batch REST API:
- POST /v1/denoise: { input_uri, output_uri, model_version, options } -> job_id
- GET /v1/denoise/{job_id} -> {status, progress, output_uri}
- GET /v1/models -> list model versions and metrics

Security & quotas:

Auth via OAuth or API keys, per-project quotas, rate limiting.
Streaming tokens short-lived; mutual TLS optional for enterprise.

7. High-Level Architecture

Components:

Ingress Layer
- API gateway: handles auth, rate limiting
- Stream router: routes real-time streams to processing instances
Preprocessing
- Gain normalization, resampling, VAD (voice activity detection), packet reassembly
- Lightweight DSP (noise gating, spectral subtraction) as low-latency filter
Inference Layer
- Real-time path: low-latency models (RNNoise-style RNNs, small Conv models) deployed on CPU or small GPU; model served via low-overhead servers (gRPC)
- Batch path: high-quality models (Demucs, SEGAN, ConvTasNet) on GPU workers with batching
Postprocessing
- Overlap-add, smoothing, adaptive gain, reverb correction
Storage & Metadata
- Object storage for inputs/outputs
- Metadata DB for jobs, model registry
Monitoring & Evaluation
- Real-time metrics, audio quality metrics pipeline, human-in-the-loop evaluation and AB testing
CI/CD & Model Training
- Training pipeline, validation datasets, model rollout with canary and shadow testing

8. Detailed Design Decisions

Hybrid DSP + ML: use DSP front-end to reduce energy and handle simple stationary noise, and ML back-end for complex/non-stationary noise.

Model choices:

Real-time: RNNoise-inspired RNN or small causal Conv models; quantized int8 model for CPU.
Batch: Demucs or Conv-TasNet for best perceptual quality; non-causal, larger networks.

Latency optimization:

Frame size: 20–40 ms with overlap-add; smaller frames reduce latency at cost of frequency resolution.
Use causal architectures for real-time.
Frame-level batching in server to utilize SIMD/TPU without increasing per-stream latency.

Resource optimization:

Distillation, pruning, quantization for edge.
Mixed-precision for GPU batch inference.

Model serving:

GRPC servers with pinned threads for deterministic latency.
Autoscaling groups based on CPU/GPU utilization and request rates.

A/B testing / rollout:

Shadow traffic and canary evaluation using objective metrics and sampled human MOS.

9. Bottlenecks & Scaling

Potential bottlenecks:

Inference latency: mitigate with smaller models, quantization, dedicated inference instances, CPU vectorization.
Network bandwidth: transmit compressed frames; perform on-device preprocessing to reduce payload.
Cold-start and scaling: keep warm pools for low-latency path; use fast autoscaling for batch workloads.
Quality vs. compute: high-quality models require GPUs; isolate batch GPU clusters to avoid impacting real-time.
Monitoring and labeled data: collecting quality labels at scale is costly; use periodic human tests plus automated metrics.

Scaling strategies:

Horizontal scaling of inference containers, partitioning by customer or region.
Edge inference for mobile clients to offload server load and improve privacy.
Adaptive model selection: route to light or heavy model depending on resource availability and audio complexity detected by classifier.

10. Follow-up Questions / Extensions

Personalization: adapt models to speaker voice over time.
Multi-microphone support: beamforming + denoising.
Dereverberation and separation of multiple speakers.
On-device training or federated learning for privacy-preserving improvements.
Support for music and complex acoustic scenes.
Cost model: estimate per-hour GPU costs vs. CPU inference cost per stream.

11. Wrap-up

Design a hybrid system that uses DSP for low-latency prefiltering and small causal ML models for live inference, while using larger non-causal ML models in the batch path for high quality. Instrument with objective and subjective metrics, provide APIs for streaming and batch, and plan deployment and scaling using autoscaling inference fleets and GPU-backed batch workers.

Denoising System for Audio