1. The Question
Design a system that removes noise from audio streams and audio files. The system should support both real-time low-latency use cases (e.g., video calls) and high-quality batch processing (e.g., post-processing of uploaded videos). Consider accuracy, latency, throughput, model deployment, monitoring, and upgrade paths.
2. Clarifying Questions
- What are the target applications? (real-time conferencing, live streaming, post-processing)
- What noise types must be handled? (stationary: hum, hiss; non-stationary: traffic, music, other speakers)
- Latency SLO for real-time path (e.g., <20 ms, <100 ms)?
- Throughput targets: concurrent streams, batch jobs/hour?
- Allowed compute: CPU-only edge, mobile, GPUs in cloud?
- Success metrics: objective (SNR, PESQ) and subjective (MOS) thresholds?
- Privacy/regulatory constraints on sending raw audio to cloud?
3. Requirements
Functional:
- Real-time denoising API with low end-to-end latency.
- Batch denoising API for higher-quality processing.
- Support mono/stereo, sample rates up to 48 kHz.
- Model update and A/B testing support.
Non-functional:
- Real-time latency: typical target < 50 ms added processing.
- Throughput: scale to 100k+ concurrent real-time streams (example target).
- Accuracy: measurable improvements in SNR/PESQ and human MOS.
- Deployability: run on server GPU/CPU and optionally on-device/mobile.
- Robustness: handle packet loss, jitter, and variable input levels.
Observability & Ops:
- Metrics for latency, error rates, quality metrics.
- Logging and tools for subjective evaluation.
4. Scale Estimates
Example scale targets (adjust for your company):
- Concurrent real-time streams: 100,000
- Average throughput per stream: 16 kbps processed (voice), inference 10–50 ms per 20–40 ms frame
- Batch jobs: 10,000 files/hour, average file 5 minutes
- Model size: 5–200 MB depending on architecture
- Storage: store models and metadata; audio storage depends on retention—if storing raw/before/after for 1% of traffic for quality eval, use tens of TB
Estimate server requirements:
- Real-time: optimized CPU instances or small GPU/TPU pods with model quantization or DSP front-end to reduce CPU/GPU cost.
- Batch: GPU-backed workers for high-quality models (Demucs/Conv-TasNet) for faster throughput.
5. Data Model
Primary entities (simplified):
-
audio_job
- id
- user_id
- type: "real_time" | "batch"
- sample_rate
- channels
- status
- created_at
- input_location (for batch)
- output_location
- model_version
-
audio_frame (ephemeral for streaming; not persisted to DB for scale)
- stream_id
- sequence_no
- timestamp
- payload (encoded PCM/float)
-
model_metadata
- model_version
- architecture
- quantization
- training_data_summary
- metrics (PESQ, SNR on validation sets)
Use object storage (e.g., GCS/S3) for audio blobs and a small metadata DB (Cloud Spanner/Cloud SQL/managed NoSQL) for job state and metrics.
6. API Design
Proposed APIs:
-
Real-time (gRPC bidirectional streaming):
- StartStream(stream_metadata) -> stream_id
- SendFrame(stream_id, frame_seq, pcm_chunk) -> ack
- ReceiveFrame(stream_id) -> denoised_pcm_chunk
- EndStream(stream_id)
-
Batch REST API:
- POST /v1/denoise: { input_uri, output_uri, model_version, options } -> job_id
- GET /v1/denoise/{job_id} -> {status, progress, output_uri}
- GET /v1/models -> list model versions and metrics
Security & quotas:
- Auth via OAuth or API keys, per-project quotas, rate limiting.
- Streaming tokens short-lived; mutual TLS optional for enterprise.
7. High-Level Architecture
Components:
-
Ingress Layer
- API gateway: handles auth, rate limiting
- Stream router: routes real-time streams to processing instances
-
Preprocessing
- Gain normalization, resampling, VAD (voice activity detection), packet reassembly
- Lightweight DSP (noise gating, spectral subtraction) as low-latency filter
-
Inference Layer
- Real-time path: low-latency models (RNNoise-style RNNs, small Conv models) deployed on CPU or small GPU; model served via low-overhead servers (gRPC)
- Batch path: high-quality models (Demucs, SEGAN, ConvTasNet) on GPU workers with batching
-
Postprocessing
- Overlap-add, smoothing, adaptive gain, reverb correction
-
Storage & Metadata
- Object storage for inputs/outputs
- Metadata DB for jobs, model registry
-
Monitoring & Evaluation
- Real-time metrics, audio quality metrics pipeline, human-in-the-loop evaluation and AB testing
-
CI/CD & Model Training
- Training pipeline, validation datasets, model rollout with canary and shadow testing
8. Detailed Design Decisions
Hybrid DSP + ML: use DSP front-end to reduce energy and handle simple stationary noise, and ML back-end for complex/non-stationary noise.
Model choices:
- Real-time: RNNoise-inspired RNN or small causal Conv models; quantized int8 model for CPU.
- Batch: Demucs or Conv-TasNet for best perceptual quality; non-causal, larger networks.
Latency optimization:
- Frame size: 20–40 ms with overlap-add; smaller frames reduce latency at cost of frequency resolution.
- Use causal architectures for real-time.
- Frame-level batching in server to utilize SIMD/TPU without increasing per-stream latency.
Resource optimization:
- Distillation, pruning, quantization for edge.
- Mixed-precision for GPU batch inference.
Model serving:
- GRPC servers with pinned threads for deterministic latency.
- Autoscaling groups based on CPU/GPU utilization and request rates.
A/B testing / rollout:
- Shadow traffic and canary evaluation using objective metrics and sampled human MOS.
9. Bottlenecks & Scaling
Potential bottlenecks:
- Inference latency: mitigate with smaller models, quantization, dedicated inference instances, CPU vectorization.
- Network bandwidth: transmit compressed frames; perform on-device preprocessing to reduce payload.
- Cold-start and scaling: keep warm pools for low-latency path; use fast autoscaling for batch workloads.
- Quality vs. compute: high-quality models require GPUs; isolate batch GPU clusters to avoid impacting real-time.
- Monitoring and labeled data: collecting quality labels at scale is costly; use periodic human tests plus automated metrics.
Scaling strategies:
- Horizontal scaling of inference containers, partitioning by customer or region.
- Edge inference for mobile clients to offload server load and improve privacy.
- Adaptive model selection: route to light or heavy model depending on resource availability and audio complexity detected by classifier.
10. Follow-up Questions / Extensions
- Personalization: adapt models to speaker voice over time.
- Multi-microphone support: beamforming + denoising.
- Dereverberation and separation of multiple speakers.
- On-device training or federated learning for privacy-preserving improvements.
- Support for music and complex acoustic scenes.
- Cost model: estimate per-hour GPU costs vs. CPU inference cost per stream.
11. Wrap-up
Design a hybrid system that uses DSP for low-latency prefiltering and small causal ML models for live inference, while using larger non-causal ML models in the batch path for high quality. Instrument with objective and subjective metrics, provide APIs for streaming and batch, and plan deployment and scaling using autoscaling inference fleets and GPU-backed batch workers.