Video Understanding
Video understanding extends image understanding to the temporal dimension. Models must reason about motion, actions, and temporal structure across frames.
Video Representation
A video is a sequence of frames ${I_t}_{t=1}^T$, where each frame $I_t \in \mathbb{R}^{H \times W \times 3}$.
Key challenges vs. images:
- Temporal redundancy: adjacent frames are highly similar. Sampling strategies matter.
- Motion: objects and the camera move; temporal modeling must capture motion patterns.
- Long range: action recognition may require context over minutes.
- Computational cost: processing all frames is prohibitive; efficient sampling is essential.
Optical Flow
Optical flow $\mathbf{u}(x, y) = (u, v)$ describes the apparent motion of each pixel between two consecutive frames.
Lucas-Kanade: assumes constant flow in a small neighborhood; solves a local least-squares system.
Horn-Schunck: global smoothness constraint; variational formulation.
FlowNet / PWCNet / RAFT: learned optical flow with CNNs. RAFT (2020) builds a 4D cost volume and iteratively updates the flow field; state of the art.
Optical flow is used as an explicit motion feature in two-stream action recognition models.
Action Recognition
Classify the action performed in a video clip into one of $K$ categories.
Benchmarks: UCF-101 (101 classes, 13k clips), HMDB-51, Kinetics-400/600/700, Something-Something (temporal reasoning).
Two-Stream Networks
Simonyan & Zisserman (2014). Two parallel CNNs:
- Spatial stream: single RGB frame. Captures appearance.
- Temporal stream: stacked optical flow frames (10 consecutive frames, 20 channels). Captures motion.
Late fusion of softmax scores. Complementary information; fusion outperforms either stream alone.
3D Convolutions (C3D, I3D)
Replace 2D convolutions with 3D convolutions $W \in \mathbb{R}^{t \times k \times k \times C_\text{in} \times C_\text{out}}$ that operate on space and time jointly.
I3D (Two-Stream Inflated 3D ConvNet): “inflate” a 2D ImageNet-pretrained model (Inception) to 3D by repeating 2D weights along the temporal axis. Initialize 3D filters from 2D pretrained weights.
Limitation: high memory and compute cost; short temporal windows (16-32 frames).
Efficient 3D Models
SlowFast: two pathways at different temporal resolutions.
- Slow pathway: low frame rate (4 fps); captures spatial semantics; many channels.
- Fast pathway: high frame rate (32 fps); captures fine motion; few channels.
- Lateral connections fuse temporal information from fast to slow.
X3D: family of efficient 3D CNNs found via axis-wise neural architecture search. Progressive expansion of time, space, channels, and depth. Outperforms I3D with a fraction of the FLOPs.
Video Transformers
TimeSformer (2021): apply standard ViT to video by factorizing attention.
- Divided space-time attention: first attend over spatial tokens at the same time step; then attend over temporal tokens at the same spatial location. Linear in $T \times H \times W$ instead of quadratic.
Video Swin Transformer: 3D shifted window attention on video tubes (temporal extension of Swin). State-of-the-art accuracy/efficiency tradeoff.
VideoMAE: mask 90% of video patches; reconstruct pixel values. High masking ratio is key because of temporal redundancy. Strong pretraining for action recognition.
Temporal Action Detection and Segmentation
Temporal action detection: find the start and end times of actions in an untrimmed video.
Temporal Action Proposal Networks: generate proposals for action boundaries (analogous to RPN for spatial detection). ActionFormer refines proposals with a Transformer over temporal features.
Temporal action segmentation: label every frame with an action class. Used in procedure understanding, cooking videos, surgical workflow analysis.
MS-TCN (Multi-Stage Temporal Convolutional Network): 1D dilated convolutions over frame-level features; multiple refinement stages.
Video Object Detection
Extend image detection to video, exploiting temporal consistency.
Approaches:
- Frame-by-frame: apply image detector per frame. Fast; inconsistent across frames.
- Feature aggregation: aggregate features from neighboring frames via attention or flow-warping. More stable; more compute.
- Tracking-by-detection: detect in key frames; propagate with a tracker.
Video Generation
VideoGAN / TGAN: extend GAN to generate short video clips.
VQVAE + Transformer: tokenize video to discrete codes; train an autoregressive Transformer. VideoGPT.
Video diffusion: extend latent diffusion to video. Key challenges: temporal consistency, long-range coherence.
Sora (OpenAI, 2024): DiT-based video diffusion model. Trained on videos of variable duration and resolution. Generates high-quality 1-minute videos with coherent physics and camera motion.
Stable Video Diffusion (SVD): image-to-video diffusion model; generates short clips from a single reference image.
Video Captioning and Retrieval
Video captioning: generate a natural language description of a video clip.
Dense video captioning: caption all events in a long video.
Video-text retrieval: given a text query, retrieve the most relevant video from a database (or vice versa).
Models: CLIP4Clip extends CLIP to video; VideoCLIP; InternVideo. Text and video are projected into a shared embedding space; cosine similarity for retrieval.
Evaluation
| Task | Metric |
|---|---|
| Action recognition | Top-1 / Top-5 accuracy |
| Temporal detection | mAP at various IoU thresholds |
| Video captioning | CIDER, METEOR, BLEU |
| Video retrieval | Recall@1/5/10 |
| Video generation | FVD (Fréchet Video Distance), human evaluation |