Vision Transformers
Vision Transformers (ViT) apply the Transformer architecture directly to images, replacing convolutional inductive biases with global self-attention. They have become the dominant architecture for large-scale vision pretraining.
ViT: Vision Transformer
Dosovitskiy et al. (2020). “An Image is Worth 16x16 Words.”
Patch tokenization: divide the image into non-overlapping $P \times P$ patches. For a 224×224 image with $P=16$: $14 \times 14 = 196$ patches.
Patch embedding: flatten each patch to a vector; project linearly:
\[z_i = W_e \cdot \text{flatten}(x_i^P) + b_e, \quad z_i \in \mathbb{R}^D\][CLS] token: prepend a learnable class token $z_0$. Its final representation is used for classification.
Position embedding: add learnable 1D position embeddings to each patch embedding (standard absolute positions).
Transformer encoder: $L$ layers, each with multi-head self-attention + MLP + pre-LayerNorm.
Classification head: linear layer on $z_0^L$ (the final [CLS] representation).
\[p(y|x) = \text{softmax}(W_\text{cls} z_0^L)\]ViT Variants
| Model | Layers | Hidden dim | Heads | Params | Notes |
|---|---|---|---|---|---|
| ViT-S/16 | 12 | 384 | 6 | 22M | Small |
| ViT-B/16 | 12 | 768 | 12 | 86M | Base |
| ViT-L/16 | 24 | 1024 | 16 | 307M | Large |
| ViT-H/14 | 32 | 1280 | 16 | 632M | Huge |
The /16 or /14 notation denotes the patch size $P$.
Key finding: ViT requires large-scale pretraining (JFT-300M or ImageNet-21k) to match CNN performance. With sufficient data, ViT surpasses CNNs.
Data-Efficient ViTs
DeiT (Data-Efficient Image Transformers)
Touvron et al. (2021). Trains ViT on ImageNet only (no JFT) using:
- Knowledge distillation: a CNN teacher (RegNet); a distillation token added alongside [CLS].
- Heavy augmentation: CutMix, Mixup, RandAugment, Random Erasing.
- Repeated augmentation: sample each image multiple times per batch.
DeiT-B matches ViT-B/16 JFT performance using only ImageNet-1k training.
BEiT (BERT Pre-Training of Image Transformers)
Masked image modeling pretraining: mask random patches; predict discrete visual tokens (from a DALL-E tokenizer). Analogous to BERT MLM.
MAE (Masked Autoencoders)
He et al. (2022). Mask 75% of patches; reconstruct pixel values of masked patches with a lightweight decoder.
Key insight: pixel reconstruction requires much less capacity than language token prediction. A large asymmetric design (heavy encoder, light decoder) is optimal. Simple, scalable, and produces strong representations.
DINO / DINOv2
Self-distillation without labels. Student network is supervised to match the output of a momentum-updated teacher. DINOv2 scales to ViT-G (1B params) trained on a curated 142M image dataset; produces features that achieve strong performance on classification, depth estimation, segmentation, and retrieval without fine-tuning.
Efficient ViT Variants
Swin Transformer
Liu et al. (2021). Hierarchical Transformer with shifted windows.
Window self-attention: restrict attention to a local $M \times M$ window. Complexity: $O(M^2 \cdot HW)$ instead of $O((HW)^2)$.
Shifted windows: alternate between regular and shifted windows to enable cross-window connections.
Hierarchical stages: 4 stages with patch merging between stages (2×2 merge → halve resolution, double channels). Produces multi-scale feature maps like a CNN; plug-in for FPN-based detection/segmentation.
Swin-B achieves 83.5% ImageNet top-1 and serves as backbone for state-of-the-art detection and segmentation models.
PVT (Pyramid Vision Transformer)
Produces multi-scale features with spatial reduction attention: reduce K, V resolution by a factor $R$ before self-attention. Lighter than Swin for the same performance.
MixFormer / ConvNeXt v2 (Hybrid models)
Combine convolutions and attention. Convolutions capture local patterns efficiently; attention captures global context. Hybrids often outperform pure ViT at the same parameter count on constrained data.
ViT for Dense Prediction
ViT produces a single-scale feature map (no hierarchy). Adaptations for detection/segmentation:
ViTDet (He et al. 2022): use a plain ViT-L with windowed attention (for efficiency) + 4 global attention blocks. Add a simple FPN to build multi-scale features. Achieves state-of-the-art detection without hierarchical design.
SAM (Segment Anything Model): ViT-H encoder + lightweight prompt encoder + mask decoder. Trained on SA-1B (1 billion masks). Enables zero-shot segmentation from point, box, or text prompts.
Attention Patterns in ViT
Self-attention heads in ViT learn semantically meaningful patterns:
- Some heads attend globally; some locally.
- [CLS] token attends to semantically relevant patches (foreground objects) in deeper layers.
- Different heads in the same layer capture complementary aspects (shape, texture, semantic region).
Attention rollout: propagate attention maps through all layers to visualize which input patches each token attends to at the final layer.
Computational Comparison
| Property | CNN (ResNet) | ViT |
|---|---|---|
| Inductive biases | Strong (locality, equivariance) | Weak |
| Data efficiency | High | Lower without pretraining |
| Long-range dependencies | Limited (large receptive field) | Excellent (global attention) |
| Multi-scale features | Native (FPN) | Requires adaptation |
| Inference memory | $O(HW)$ | $O((HW/P^2)^2)$ naive |
| Training at scale | Strong | Better with scale |