How does a video streaming service deliver smooth playback?

Adaptive Bitrate Streaming (HLS or DASH). The video is encoded into multiple resolutions and chunked into 2–6 second segments. The client measures network speed and picks the right bitrate per segment, smoothly switching as conditions change.

How do you store petabytes of video cheaply?

Hot videos go to a multi-region object store with high egress (S3 + CloudFront, or specialized video CDNs). Cold videos move to lower-cost tiers. Frequently watched videos cache at CDN edges and ISP-level caches; only the long tail hits origin.

How is recommendation handled at scale?

Two stages: candidate generation (fast, broad — collaborative filtering, content-based) narrows billions of videos to thousands; ranking (slower, precise — learned ranker model) picks the top tens. Both run as offline-trained, online-served services with feature stores.

Design YouTube / Video Streaming — A System Design Walkthrough

Video streaming at YouTube scale touches every interesting problem: massive uploads, transcoding pipelines, petabyte storage, global CDN delivery, adaptive bitrate, and ML-driven recommendations. Here’s how I’d design it end-to-end.

Requirements

Functional

Upload a video.
Watch a video at the best bitrate the network supports.
Search.
Recommendations.
Likes, comments, subscribers.

Non-functional

Read-heavy. Watch:upload ratio of 1000:1 is conservative.
Sub-second start playback at p95.
Smooth bitrate adaptation as network conditions change.
High availability — broken playback is fatal.
Cheap storage at PB scale.

Out of scope

Auth, billing, ad serving.

Capacity

	Number
MAU	2.7B
DAU	1.5B
Hours uploaded per minute	500
Hours watched per day	1B+
Concurrent live viewers (peak event)	100M
Avg upload size	~500 MB
Storage growth per day	~50 PB (raw + variants)

Yes, that’s “petabytes per day.” Storage is the most expensive line item; CDN bandwidth is second.

API surface (sketch)

POST /api/upload (multipart or chunked)
  → returns video_id + upload_url for resumable upload

GET /api/video/{id}
  → metadata + manifest URL

GET /watch/{id}.m3u8
  → HLS manifest (segment list)

GET /segments/{id}/{quality}/seg-{n}.ts
  → individual video chunk

GET /api/recommend?video_id={id}
  → list of recommended videos

The watch path is dominated by static segment files served from CDN.

Upload pipeline

Client
  │ chunked upload (resumable)
  ▼
Edge upload gateway
  │
  ▼
Raw object store (S3 / GCS / R2)  ← single source of truth for the original
  │
  ▼  (event)
Transcode queue (Kafka / SQS)
  │
  ▼
Transcoder workers (GPU)
  │ per quality variant + per segment
  ▼
Variant object store (HLS/DASH segments)
  │
  ▼
CDN origin shield → CDN PoPs → users

Steps:

Resumable upload via tus.io or signed multipart. A 500 MB upload over flaky 4G works.
Antivirus / format validation on the raw upload.
Transcode fan-out. One source video → variants:
- 144p / 240p / 360p / 480p / 720p / 1080p / 4K
- HLS (.m3u8 manifest + .ts segments) and DASH variants
- 2–6 second segments
Generate manifests. Master playlist points to per-quality playlists; each per-quality is a list of segments.
Push to CDN origin shield. Pre-warm popular regions.
Update video metadata (status: ready).

The transcoder pool is the most expensive ongoing cost. Run it on spot GPU instances; jobs are idempotent and retryable.

Storage tiers

Tier	Latency	Cost	Used for
CDN edge cache	<10ms	High	Hot videos
CDN origin shield	50ms	Medium	Recently popular
Hot object store (S3 / GCS Standard)	100ms	Medium	Last 90 days, long tail of popular
Cold object store (S3 IA / Glacier-ish)	seconds	Low	Old videos rarely watched
Archive	minutes-hours	Lowest	Long-tail backups, originals

Movement between tiers is automated based on view counts. A video unwatched for 6 months drops to cold; a viral spike re-promotes to hot.

Adaptive Bitrate (HLS / DASH)

master.m3u8:
  #EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=426x240
  240p/playlist.m3u8
  #EXT-X-STREAM-INF:BANDWIDTH=1000000,RESOLUTION=854x480
  480p/playlist.m3u8
  #EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720
  720p/playlist.m3u8
  #EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080
  1080p/playlist.m3u8

Each per-quality playlist is a sequence of segments:

240p/playlist.m3u8:
  #EXTINF:6.0
  240p/seg-001.ts
  #EXTINF:6.0
  240p/seg-002.ts
  ...

The client (HTML5 <video> + hls.js, or AVPlayer on iOS, or ExoPlayer on Android) measures download speed per segment and picks the next quality. The server doesn’t decide bitrate — the client does.

This is why ABR is so robust: every client adapts to its own conditions.

CDN strategy

Three layers:

CDN edge — closest to the user. Caches popular segments; <50ms.
Mid-tier / regional cache — caches per region; reduces origin load.
Origin — your object store. Last resort.

For YouTube specifically, the layered model is augmented with ISP-level caches (Google Global Cache servers placed inside ISPs). For most teams, S3 + CloudFront / Cloudflare Stream / Bunny.net is enough.

Cache keys include video ID + quality + segment number — heavily cacheable. A single popular video’s 4K stream might be served from cache 99.99% of the time.

Live streaming

Adds two complications:

Latency target. “Low-latency HLS” gets you ~3-5s glass-to-glass. WebRTC gets you sub-second.
Fan-out. A single ingest source feeds millions of viewers.

Architecture:

Streamer (RTMP / WebRTC)
  ↓
Ingest (RTMP server / SFU)
  ↓
Transcoder (per-quality variants)
  ↓
Manifest writer (HLS LL / DASH)
  ↓
Origin → CDN PoPs (edge transmuxing for LL-HLS)
  ↓
Viewers

Live ingest is one streamer; live distribution is millions of viewers. The fan-out is at the CDN; ingest is a small fleet.

Recommendations

The classic two-stage model:

1. Candidate generation

From billions of videos, narrow to ~thousands the user might want. Multiple sources:

Collaborative filtering. Users similar to you watched these.
Content-based. Videos similar to what you’ve watched.
Trending. Currently popular.
Subscriptions. Channels you follow.

These run offline (precomputed) or as fast online services. Each contributes a candidate pool.

2. Ranking

A learned ranker model (gradient-boosted trees, a small neural net) scores each candidate using features:

Recency, watch time, like ratio, similarity to user history.
Cross features (user × video).

Top N from the ranker → user’s home feed.

Feedback loop

User watches → events to Kafka → train next model. Online learning + nightly retrains. Models versioned, A/B tested.

For the broader patterns see Distributed Systems Fundamentals and the LLM-side rendering of similar ideas in Self-Hosted LLMs in 2026 .

Search

Indexed by metadata (title, description, channel) into a search engine (Elasticsearch / OpenSearch).
Embeddings for semantic search of “show me videos about X.” See Build a RAG App with pgvector and FastAPI for the underlying pattern, applied at billion-row scale.
Personalization by injecting user-affinity features into ranking.

Comments and likes

These are write-heavy social features. Pattern:

Comments: Cassandra-style wide-row store keyed by video_id. Append-only writes. Pagination via last_comment_id.
Likes: counter incremented in Redis (write-behind to durable store every minute).
Notification fan-out for “your video has been commented on” — async via Kafka / NATS .

DRM and access control

For paid content:

Encrypted segments. Different keys per video; key delivery gated by license server.
Widevine / PlayReady / FairPlay for the major platforms.
Signed URLs for time-limited segment access.
Token rotation. Manifest URLs include a short-lived token.

Real DRM is complicated. For most non-premium use cases, signed URLs + encrypted-at-rest is enough.

Operational realities

Storage is the dominant cost. Plan tier transitions aggressively.
CDN bandwidth is line item #2. Negotiate; consider multi-CDN.
Transcoding is bursty. Use spot capacity for the queue.
Hot-key problem. A viral video → CDN cache miss storms. Pre-warm + stagger keys.
Geo-blocking. Compliance requires per-country availability rules. Encode at the manifest layer.

Capacity arithmetic

For 1B watch hours/day, average bitrate ~3 Mbps:

~3 EB/day transferred (3 × 10^18 bytes). Most served from CDN edge cache.
Even at 99% cache hit, that’s 30 PB/day from origin → CDN. Plan accordingly.

For 50 PB/day stored, at $0.02/GB/month for hot tier, raw infrastructure cost is ~$30M/month for hot storage. Tiering and deduplication are how the line item stays sustainable.

What interviewers love to dig into

“What if a video goes viral?” → CDN auto-scales; mid-tier caches absorb origin pressure; pre-warm popular regions.
“How do you prevent a single video from saturating CDN?” → Anycast; per-PoP rate limiting; diversion to peer caches.
“How is comment ordering handled?” → Either chronological with cursor pagination, or “top comments” via a learned ranker on engagement features.
“What if a transcoder fails mid-job?” → Idempotent jobs; retry with checkpointing; partial outputs discarded.
“How do you handle takedowns / DMCA?” → Soft delete in metadata; CDN invalidation; per-region geo-blocks.

What I’d actually build today

For a small video product (1k creators, 100k viewers):

Cloudflare Stream or Mux for video hosting + transcode + delivery (managed).
Postgres for metadata.
Redis for counters.
Kafka for events.
Postgres + pgvector for semantic search.

Skip the build-from-scratch transcoder until you outgrow managed. Mux/Stream get you to 1M users without thinking about transcoding pipelines.

Requirements#

Functional#

Non-functional#

Out of scope#

Capacity#

API surface (sketch)#

Upload pipeline#

Storage tiers#

Adaptive Bitrate (HLS / DASH)#

CDN strategy#

Live streaming#

Recommendations#

1. Candidate generation#

2. Ranking#

Feedback loop#

Search#

Comments and likes#

DRM and access control#

Operational realities#

Capacity arithmetic#

What interviewers love to dig into#

What I’d actually build today#

Read this next#