System Design/misc/Design Dropbox / Google Drive

Design Dropbox / Google Drive

HARD12 minfilesstorageuploadsyncdistributed-systemscdnsecurity
Asked at: Misc

Design a cloud file storage service supporting upload, download, sharing and device sync for files up to 50GB with high availability and low latency.

1. The Question

Design a cloud file storage service (like Dropbox or Google Drive) with the following core functionality:

  • Upload files from any device
  • Download files from any device
  • Share files with other users and view shared files
  • Automatically sync files across a user's devices

Non-functional priorities:

  • Highly available (favor availability over strict consistency)
  • Support files up to 50GB
  • Secure and reliable (recover from loss/corruption)
  • Low latency for upload/download/sync

Out of scope for this exercise: in-browser editing, viewing without download, blob store internals, per-user quotas, virus scanning, and full versioning.

2. Clarifying Questions

  • Are users authenticated? (Yes — assume standard auth/JWT.)
  • Is multi-user collaboration / simultaneous editing required? (No — treat concurrent edits with conflict resolution.)
  • Is strong consistency required for all reads? (No — eventual consistency is acceptable; prefer availability.)
  • Max file size? (Up to 50GB.)
  • Target scale (users, files, daily ops)? (We provide scale estimates below.)

These shape choices such as use of signed URLs, chunking, and eventual consistency for replication.

3. Requirements

Functional (core):

  • Upload file
  • Download file
  • Share file with users
  • Sync across devices (both directions)

Non-functional (core):

  • High availability (AP priority)
  • Support files up to 50GB
  • Secure (encryption in transit & at rest) and recoverable
  • Low latency; use CDN for downloads

APIs (example):

  • POST /files -> initiate metadata (returns fileId or presigned URL)
  • POST /files/presigned-url -> request presigned URLs for uploads (multipart/chunked)
  • PATCH /files/{fileId}/chunks -> report uploaded chunk status
  • GET /files/{fileId} -> download metadata or a short-lived signed download URL
  • POST /files/{fileId}/share -> add users to ACL
  • GET /files/changes?since= -> fetch metadata changes

Authentication via JWT/headers; server enforces ACLs.

4. Scale Estimates

Example target scale (pick realistic numbers):

  • 100M users
  • 10% active daily = 10M daily active users
  • Average files per active user: 200
  • Average file size: 10MB (many small, some large up to 50GB)
  • Peak uploads per second: ~3k–10k (depends on rollout)
  • Peak downloads per second: ~10k–50k (leveraged by CDN)

Design must support massive object store (S3-like) with metadata DB sharded by user, CDN for reads, and scalable upload gateways.

5. Data Model

Keep metadata lightweight; store blobs in external object store (S3).

Suggested document model (NoSQL like DynamoDB / or SQL with appropriate indexes):

FileMetadata { id: string // fileId, can be content fingerprint name: string ownerUserId: string size: number mimeType: string createdAt: timestamp updatedAt: timestamp status: enum("uploading","uploaded","deleted") storagePath: string // S3 key or logical path chunks: [{chunkId:string, etag?:string, status: "not-uploaded"|"uploaded"}] acl: [{userId:string, role:"viewer"|"editor"}] }

User { id: string email: string devices: [deviceId] }

Sharing is modeled via ACL in the file metadata or a separate share table mapping fileId -> userId(s). Use indexes to fetch files by user.

6. API Design

Minimal APIs (examples):

  • POST /files Request: { fileId?, name, size, mimeType } Response: { fileId, uploadSessionId, recommendedChunkSize }

  • POST /files/presigned-url Request: { uploadSessionId, chunkIndex } Response: { url }

  • PATCH /files/{fileId}/chunks Request: { uploadSessionId, chunkIndex, etag } Response: { status }

  • POST /files/{fileId}/complete Request: { uploadSessionId } Response: { status }

  • GET /files/{fileId} Response: { metadata, signedDownloadUrl }

  • POST /files/{fileId}/share Request: { emails: [] } Response: { status }

  • GET /files/changes?since={timestamp} Response: { [FileMetadata] }

Notes: servers issue short-lived presigned URLs so clients talk directly to object store for large payloads. Clients manage chunking and resumability.

7. High-Level Architecture

Components:

  • Clients: web/mobile/desktop with a sync agent that watches local FS and coordinates uploads/downloads.
  • Load Balancer & API Gateway: routes API calls, terminates TLS, rate limiting.
  • Auth Service: validates JWTs and issues signed tokens.
  • File Service (metadata service): handles metadata, ACL checks, issues presigned URLs, coordinates multipart completion.
  • Object Store (S3 or equivalent): stores file chunks / assembled objects.
  • CDN (CloudFront or similar): caches downloads near users.
  • Metadata DB: NoSQL (DynamoDB/Cassandra) or sharded SQL for metadata, indexing by user.
  • Notification Service: WebSocket/SSE cluster or push notifications to signal changes to clients.
  • Sync Coordinator: handles device sync logic, conflict detection/resolution policies.

Flow (upload): client requests presigned URLs -> uploads chunks directly to S3 -> reports chunk etags to File Service -> on completion File Service validates parts and marks metadata uploaded -> CDN invalidation / notify devices.

Flow (download): client asks for file -> server validates ACL -> issues short-lived signed URL pointing to CDN/S3 or returns CDN URL.

8. Detailed Design Decisions

  • Storage: use managed object store (S3) for blobs; store metadata in DynamoDB/Postgres.
  • Uploads: presigned URLs + client-side chunking (multipart upload) for resumability and to offload traffic.
  • Identification: fileId can be content fingerprint (SHA-256) to detect duplicates; chunks also fingerprinted for resumable uploads.
  • Consistency: eventual consistency for replication; strong consistency not required for most reads. For metadata updates (ACL changes) use transactional DB ops where needed.
  • Sync: hybrid real-time + polling. Use WebSockets/SSE for active files and polling for stale files.
  • Conflict resolution: last-write-wins by default; optionally keep copies (out of scope) for versioning.
  • Security: HTTPS in transit; server-side encryption at rest; short-lived signed URLs; validate ACL on metadata service.
  • Performance: use CDN for reads; parallel chunk uploads; adaptive chunk size based on network.
  • Reliability: multipart upload state persisted in metadata DB; periodic cleanup for abandoned uploads; replication and backups for metadata DB.

9. Bottlenecks & Scaling

  • Metadata DB hot keys: shard by userId and/or fileId. Use read replicas and caching for hot metadata.
  • Upload gateways: presigned URLs reduce load; scale API servers horizontally for metadata operations.
  • Object store limits: rely on cloud provider's scale (S3); enforce multipart uploads for large objects.
  • Notification & WebSocket scale: partition by user hash; use push notifications when WS cannot scale.
  • CDN cache churn: large frequently-changing files may cause cache misses; tune cache-control and invalidation.
  • Large file uploads on mobile/unstable networks: rely on resumable chunked uploads and client retry logic.

Mitigations: shard indexes, autoscale services, rate-limit clients, backoff & retry, proactive cleanup jobs, and monitoring/alerting on latency and error rates.

10. Follow-up Questions / Extensions

  • Add versioning: store immutable object per version and reference current version in metadata.
  • Add deduplication: dedupe at chunk level using chunk fingerprints to save bandwidth and storage.
  • Add end-to-end encryption: manage encryption keys per user; consider how sharing works with encrypted blobs.
  • Add stronger collaboration: real-time document editing requires operational transforms or CRDTs and stronger consistency for small edits.
  • Add quotas & billing: per-user storage accounting, quotas, and soft/hard limits.
  • Add virus scanning: scan uploads before completing multipart assembly; consider async flow and UX impact.

11. Wrap-up

Summary:

  • Use client-side chunking + presigned URLs to support large, resumable uploads.
  • Store blobs in a scalable object store and metadata in a sharded DB.
  • Use CDN for fast downloads and a hybrid real-time/polling system for sync notifications.
  • Prioritize availability and low latency; ensure security with HTTPS, signed URLs, and at-rest encryption.

This design satisfies the core functional requirements while keeping the system scalable and resilient; many details (versioning, E2E encryption, malware scanning) are natural extensions.

Ready to practice this question?

Run a mock system design interview with AI coaching and detailed feedback.