Introduction
I implemented video playback in a wgpu-based UI framework, so I’m documenting the pipeline and bandwidth costs.
The bottom line: the CPU decode → write_texture() approach uses about 249 MB/s of bandwidth at 1080p30fps. This isn’t a problem in practice, but there’s room for optimization with zero-copy. However, given the instability of the wgpu HAL and the amount of platform-specific code required, zero-copy isn’t something to tackle right now.
Video Playback Pipeline
The straightforward implementation looks like this:
動画ファイル
│
▼
┌─────────────────────┐
│ FFmpeg デコーダ │ CPU 上で動作
│ (libavcodec) │
└──────────┬──────────┘
│ YUV420P フレーム
▼
┌─────────────────────┐
│ swscale │ CPU 上でピクセルフォーマット変換
│ (YUV → RGBA) │
└──────────┬──────────┘
│ RGBA バッファ (8.3 MB @ 1080p)
▼
┌─────────────────────┐
│ write_texture() │ CPU → GPU 転送
│ (wgpu) │
└──────────┬──────────┘
│ GPU テクスチャ
▼
┌─────────────────────┐
│ レンダーパス │ GPU 上でテクスチャを描画
│ (wgpu) │
└─────────────────────┘
Every step runs every frame. Decoding and color conversion happen on the CPU, texture upload goes through the CPU→GPU bus, and rendering happens on the GPU.
Calculating Bandwidth Cost
Data size per frame for 1080p RGBA:
1920 × 1080 × 4 bytes (RGBA) = 8,294,400 bytes ≈ 8.3 MB/frame
Bandwidth by frame rate:
| Resolution | fps | Bandwidth |
|---|---|---|
| 1920×1080 | 30 | 249 MB/s |
| 1920×1080 | 60 | 498 MB/s |
| 3840×2160 | 30 | 995 MB/s |
| 3840×2160 | 60 | 1,990 MB/s |
249 MB/s at 1080p30fps. PCIe 3.0 x16 offers about 16 GB/s of bandwidth, so there’s plenty of headroom. But on mobile memory buses, 4K60fps gets tight.
Comparison with SVG Rasterization
Compared to the Stencil-then-Cover SVG animation approach I wrote about in the previous post, the transfer volume differs by orders of magnitude.
| SVG Animation (Stencil-then-Cover) | Video Playback | |
|---|---|---|
| What’s sent to GPU | Path control points (vertex data) | Full pixels (RGBA) |
| Transfer/frame | Hundreds of bytes to tens of KB | ~8.3 MB (1080p) |
| Bandwidth at 30fps | A few KB/s to hundreds of KB/s | 249 MB/s |
| GPU’s job | Rasterization (vertices → pixels) | Texture sampling only |
| CPU’s job | SMIL evaluation (light) | Decode + swscale (heavy) |
SVG sends “shape definitions” and lets the GPU rasterize them. Video sends “the pixels themselves.” The bandwidth cost differs by 3–4 orders of magnitude.
Why Bandwidth Cost Is Unavoidable for Video
Geometric data like SVG only needs control points sent over—the GPU handles rasterization. But for video, pixel data is the content itself.
SVG: 定義 (数 KB) → GPU がラスタライズ → ピクセル
動画: ピクセル (数 MB) → GPU にそのまま渡す → ピクセル
Video frames are a sequence of photographs. There’s no alternative where the GPU generates them from geometry. Compression codecs (H.264, VP9, AV1) help with storage and network transfer, but by the time you hand data to the GPU, you need decompressed pixel data.
In other words, as long as you use CPU decoding, the CPU→GPU bandwidth cost is structurally unavoidable.
Room for Reduction
You can’t eliminate the bandwidth cost entirely, but there are ways to reduce it.
Direct YUV Texture Upload
Instead of converting to RGBA with swscale, send YUV420P directly to the GPU and do the color conversion in a shader.
YUV420P: Y(1920×1080) + U(960×540) + V(960×540) = 3,110,400 bytes ≈ 3.0 MB
RGBA: 1920×1080×4 = 8,294,400 bytes ≈ 8.3 MB
YUV420P is about 36% the size of RGBA—a 60% reduction in transfer volume. It also eliminates the need for swscale on the CPU, reducing CPU load.
However, you need to write YUV→RGB conversion in the shader. The conversion matrix differs between BT.601 and BT.709, so you need an implementation that switches based on the codec’s metadata.
Texture Reuse
Rather than calling create_texture() every frame, pre-allocate textures and only update their contents with write_texture(). This eliminates texture allocation/deallocation costs. With double buffering, you can write the next frame while the GPU is still sampling the previous one.
The Ideal of Zero-Copy and the External Texture API
The ultimate optimization is zero-copy—eliminating the CPU→GPU transfer entirely.
【現状: CPU デコード方式】
動画ファイル → CPU デコード → CPU メモリ (RGBA) → write_texture() → GPU テクスチャ → 描画
~~~~~~~~~~~~~~~
この転送を消したい
【理想: ゼロコピー方式】
動画ファイル → HW デコーダ (GPU 上) → GPU テクスチャ → 描画
↑
デコーダが直接 GPU メモリに書く
CPU→GPU 転送がゼロ
Hardware decoders (VideoToolbox, MediaCodec, VAAPI, etc.) generate textures on the GPU. If these textures could be imported into wgpu, zero-copy would be achieved.
However, wgpu has no standard API for “importing externally created textures.” To do this, you’d need to drop down to the HAL layer and use create_texture_from_hal.
HW デコーダ
│
│ プラットフォーム固有のテクスチャハンドル
│ (Metal: MTLTexture, Vulkan: VkImage, etc.)
▼
wgpu HAL レイヤー
│
│ create_texture_from_hal()
│ ※ unsafe, バックエンド固有
▼
wgpu::Texture
│
│ 通常の wgpu API で使える
▼
レンダーパスで描画
create_texture_from_hal is treated as an internal wgpu API with no stability guarantees. It requires separate implementations for each backend (Metal / Vulkan / DX12).
Platform-Specific Realities
Every major platform has hardware decoders, but typical implementations today incur a GPU→CPU→GPU round trip.
【ありがちなパターン】
HW デコーダ (GPU) → CPU メモリにコピー → write_texture() → GPU テクスチャ → 描画
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~
GPU→CPU CPU→GPU
往復してしまっている
| Platform | HW Decoder | Decode Target | Why It Falls Back to CPU |
|---|---|---|---|
| macOS / iOS | VideoToolbox | GPU (Metal texture) | No API to directly import into wgpu |
| Android | MediaCodec | GPU (AHardwareBuffer) | Limited means to import via wgpu’s Vulkan backend |
| Linux | VAAPI / VDPAU | GPU (Vulkan / GL texture) | Same as above |
| WASM (Browser) | WebCodecs | GPU (VideoFrame) | Requires copyTo() to pull back to CPU |
On every platform, “decoding itself happens fast on the GPU, but passing data to wgpu requires going through CPU memory.”
There is a way to directly import a WebCodecs VideoFrame using WebGPU’s importExternalTexture(), but that’s the browser’s WebGPU API—not something you can use from Rust’s wgpu crate.
Conclusion
For now, the CPU decode → write_texture() approach is sufficient in practice. 249 MB/s of bandwidth at 1080p30fps is no problem in a PCIe environment.
Achieving zero-copy requires using wgpu HAL’s create_texture_from_hal to import platform-specific textures, but:
- The HAL API is unstable (may break with wgpu version upgrades)
- Requires separate implementations for Metal / Vulkan / DX12
- Hardware decoder integration code also differs per platform
Holding off is the right call until performance actually becomes a problem. Direct YUV upload and texture reuse can cut bandwidth by 60%, so exploring those first offers a much better cost-benefit ratio.