1. The Question
Design a cloud file storage service (like Dropbox or Google Drive) with the following core functionality:
- Upload files from any device
- Download files from any device
- Share files with other users and view shared files
- Automatically sync files across a user's devices
Non-functional priorities:
- Highly available (favor availability over strict consistency)
- Support files up to 50GB
- Secure and reliable (recover from loss/corruption)
- Low latency for upload/download/sync
Out of scope for this exercise: in-browser editing, viewing without download, blob store internals, per-user quotas, virus scanning, and full versioning.
2. Clarifying Questions
- Are users authenticated? (Yes — assume standard auth/JWT.)
- Is multi-user collaboration / simultaneous editing required? (No — treat concurrent edits with conflict resolution.)
- Is strong consistency required for all reads? (No — eventual consistency is acceptable; prefer availability.)
- Max file size? (Up to 50GB.)
- Target scale (users, files, daily ops)? (We provide scale estimates below.)
These shape choices such as use of signed URLs, chunking, and eventual consistency for replication.
3. Requirements
Functional (core):
- Upload file
- Download file
- Share file with users
- Sync across devices (both directions)
Non-functional (core):
- High availability (AP priority)
- Support files up to 50GB
- Secure (encryption in transit & at rest) and recoverable
- Low latency; use CDN for downloads
APIs (example):
- POST /files -> initiate metadata (returns fileId or presigned URL)
- POST /files/presigned-url -> request presigned URLs for uploads (multipart/chunked)
- PATCH /files/{fileId}/chunks -> report uploaded chunk status
- GET /files/{fileId} -> download metadata or a short-lived signed download URL
- POST /files/{fileId}/share -> add users to ACL
- GET /files/changes?since= -> fetch metadata changes
Authentication via JWT/headers; server enforces ACLs.
4. Scale Estimates
Example target scale (pick realistic numbers):
- 100M users
- 10% active daily = 10M daily active users
- Average files per active user: 200
- Average file size: 10MB (many small, some large up to 50GB)
- Peak uploads per second: ~3k–10k (depends on rollout)
- Peak downloads per second: ~10k–50k (leveraged by CDN)
Design must support massive object store (S3-like) with metadata DB sharded by user, CDN for reads, and scalable upload gateways.
5. Data Model
Keep metadata lightweight; store blobs in external object store (S3).
Suggested document model (NoSQL like DynamoDB / or SQL with appropriate indexes):
FileMetadata { id: string // fileId, can be content fingerprint name: string ownerUserId: string size: number mimeType: string createdAt: timestamp updatedAt: timestamp status: enum("uploading","uploaded","deleted") storagePath: string // S3 key or logical path chunks: [{chunkId:string, etag?:string, status: "not-uploaded"|"uploaded"}] acl: [{userId:string, role:"viewer"|"editor"}] }
User { id: string email: string devices: [deviceId] }
Sharing is modeled via ACL in the file metadata or a separate share table mapping fileId -> userId(s). Use indexes to fetch files by user.
6. API Design
Minimal APIs (examples):
-
POST /files Request: { fileId?, name, size, mimeType } Response: { fileId, uploadSessionId, recommendedChunkSize }
-
POST /files/presigned-url Request: { uploadSessionId, chunkIndex } Response: { url }
-
PATCH /files/{fileId}/chunks Request: { uploadSessionId, chunkIndex, etag } Response: { status }
-
POST /files/{fileId}/complete Request: { uploadSessionId } Response: { status }
-
GET /files/{fileId} Response: { metadata, signedDownloadUrl }
-
POST /files/{fileId}/share Request: { emails: [] } Response: { status }
-
GET /files/changes?since={timestamp} Response: { [FileMetadata] }
Notes: servers issue short-lived presigned URLs so clients talk directly to object store for large payloads. Clients manage chunking and resumability.
7. High-Level Architecture
Components:
- Clients: web/mobile/desktop with a sync agent that watches local FS and coordinates uploads/downloads.
- Load Balancer & API Gateway: routes API calls, terminates TLS, rate limiting.
- Auth Service: validates JWTs and issues signed tokens.
- File Service (metadata service): handles metadata, ACL checks, issues presigned URLs, coordinates multipart completion.
- Object Store (S3 or equivalent): stores file chunks / assembled objects.
- CDN (CloudFront or similar): caches downloads near users.
- Metadata DB: NoSQL (DynamoDB/Cassandra) or sharded SQL for metadata, indexing by user.
- Notification Service: WebSocket/SSE cluster or push notifications to signal changes to clients.
- Sync Coordinator: handles device sync logic, conflict detection/resolution policies.
Flow (upload): client requests presigned URLs -> uploads chunks directly to S3 -> reports chunk etags to File Service -> on completion File Service validates parts and marks metadata uploaded -> CDN invalidation / notify devices.
Flow (download): client asks for file -> server validates ACL -> issues short-lived signed URL pointing to CDN/S3 or returns CDN URL.
8. Detailed Design Decisions
- Storage: use managed object store (S3) for blobs; store metadata in DynamoDB/Postgres.
- Uploads: presigned URLs + client-side chunking (multipart upload) for resumability and to offload traffic.
- Identification: fileId can be content fingerprint (SHA-256) to detect duplicates; chunks also fingerprinted for resumable uploads.
- Consistency: eventual consistency for replication; strong consistency not required for most reads. For metadata updates (ACL changes) use transactional DB ops where needed.
- Sync: hybrid real-time + polling. Use WebSockets/SSE for active files and polling for stale files.
- Conflict resolution: last-write-wins by default; optionally keep copies (out of scope) for versioning.
- Security: HTTPS in transit; server-side encryption at rest; short-lived signed URLs; validate ACL on metadata service.
- Performance: use CDN for reads; parallel chunk uploads; adaptive chunk size based on network.
- Reliability: multipart upload state persisted in metadata DB; periodic cleanup for abandoned uploads; replication and backups for metadata DB.
9. Bottlenecks & Scaling
- Metadata DB hot keys: shard by userId and/or fileId. Use read replicas and caching for hot metadata.
- Upload gateways: presigned URLs reduce load; scale API servers horizontally for metadata operations.
- Object store limits: rely on cloud provider's scale (S3); enforce multipart uploads for large objects.
- Notification & WebSocket scale: partition by user hash; use push notifications when WS cannot scale.
- CDN cache churn: large frequently-changing files may cause cache misses; tune cache-control and invalidation.
- Large file uploads on mobile/unstable networks: rely on resumable chunked uploads and client retry logic.
Mitigations: shard indexes, autoscale services, rate-limit clients, backoff & retry, proactive cleanup jobs, and monitoring/alerting on latency and error rates.
10. Follow-up Questions / Extensions
- Add versioning: store immutable object per version and reference current version in metadata.
- Add deduplication: dedupe at chunk level using chunk fingerprints to save bandwidth and storage.
- Add end-to-end encryption: manage encryption keys per user; consider how sharing works with encrypted blobs.
- Add stronger collaboration: real-time document editing requires operational transforms or CRDTs and stronger consistency for small edits.
- Add quotas & billing: per-user storage accounting, quotas, and soft/hard limits.
- Add virus scanning: scan uploads before completing multipart assembly; consider async flow and UX impact.
11. Wrap-up
Summary:
- Use client-side chunking + presigned URLs to support large, resumable uploads.
- Store blobs in a scalable object store and metadata in a sharded DB.
- Use CDN for fast downloads and a hybrid real-time/polling system for sync notifications.
- Prioritize availability and low latency; ensure security with HTTPS, signed URLs, and at-rest encryption.
This design satisfies the core functional requirements while keeping the system scalable and resilient; many details (versioning, E2E encryption, malware scanning) are natural extensions.