Image Classification
Image classification assigns a label from a fixed set of categories to an input image. It is the canonical computer vision benchmark and the entry point for most pretrained models.
Problem Formulation
Given an image $x \in \mathbb{R}^{H \times W \times 3}$ and a label set $\mathcal{Y} = {1, \ldots, K}$, learn a function $f_\theta: x \mapsto \hat{y}$ that minimizes classification error.
Softmax output:
\[p(y = k | x) = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}\]where $z = f_\theta(x) \in \mathbb{R}^K$ are the logits.
Cross-entropy loss:
\[\mathcal{L} = -\sum_{k=1}^K y_k \log p_k = -\log p_{y^*}\]Benchmarks
| Dataset | Classes | Train images | Notes |
|---|---|---|---|
| MNIST | 10 | 60k | Handwritten digits; near-saturated |
| CIFAR-10 / 100 | 10 / 100 | 50k | 32×32 images |
| ImageNet (ILSVRC) | 1000 | 1.2M | Standard large-scale benchmark |
| ImageNet-21k | 21k | 14M | Pretraining dataset |
| JFT-300M | ~30k | 300M | Google internal; used for ViT pretraining |
Top-1 accuracy: fraction of test images where the highest-probability class is correct.
Top-5 accuracy: fraction where the correct class is in the top 5 predictions.
Training Pipeline
- Data loading: random resize crop to 224×224, random horizontal flip, ColorJitter, normalization.
- Model: pretrained backbone + classification head.
- Optimizer: SGD with momentum (classical) or AdamW (modern).
- Learning rate schedule: cosine annealing; warmup for first few epochs.
- Regularization: weight decay, dropout, label smoothing.
- Inference: center crop 224×224 or multi-scale crop with averaging.
Transfer Learning
Pretraining on ImageNet (or larger datasets) and fine-tuning on a target task is the standard workflow.
Feature extraction: freeze the backbone; train only the classification head. Fast; good when target dataset is small.
Fine-tuning: unfreeze all or some layers; use a smaller learning rate for early layers. Better when target dataset is large enough.
Rule of thumb:
| Target dataset | Similar to source | Strategy |
|---|---|---|
| Small | Yes | Feature extraction |
| Small | No | Fine-tune top layers |
| Large | Yes | Fine-tune all |
| Large | No | Fine-tune all (or train from scratch) |
Key Milestones on ImageNet
| Model | Year | Top-1 | Key idea |
|---|---|---|---|
| AlexNet | 2012 | 63.3% | Deep CNN + GPU |
| VGG-16 | 2014 | 73.0% | Uniform 3×3 convs |
| GoogLeNet | 2014 | 74.8% | Inception module |
| ResNet-152 | 2015 | 78.6% | Residual connections |
| DenseNet-264 | 2017 | 80.8% | Dense connections |
| EfficientNet-B7 | 2019 | 84.3% | Compound scaling |
| ViT-H/14 | 2021 | 88.5% | Pure attention |
| CoAtNet-7 | 2021 | 90.9% | Conv + attention hybrid |
| CoCa | 2022 | 91.0% | Contrastive + captioning |
Human top-5 error is approximately 5.1%.
Regularization Techniques
Label smoothing: replace hard one-hot labels with $(1-\epsilon)$ for the correct class and $\epsilon/(K-1)$ for others. Prevents overconfident predictions. $\epsilon = 0.1$ standard.
Mixup: blend pairs of training images and their labels proportionally with $\lambda \sim \text{Beta}(\alpha, \alpha)$.
CutMix: cut a rectangular region from one image into another; blend labels by area ratio.
Dropout: randomly zero activations during training. Applied after FC layers; rarely after conv layers in modern architectures.
Stochastic depth: randomly drop entire residual blocks during training. Acts as dropout over the network depth.
Self-Supervised Pretraining for Classification
Pretrain on unlabeled images; fine-tune on labeled data.
Contrastive learning (SimCLR, MoCo): create two augmented views of the same image; maximize agreement between their representations while pushing other images apart.
\[\mathcal{L}_\text{SimCLR} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}\]BYOL / SimSiam: no negative pairs; use a momentum encoder (BYOL) or stop-gradient (SimSiam) to prevent collapse.
MAE (Masked Autoencoders, He et al. 2022): mask 75% of image patches; train a ViT encoder-decoder to reconstruct masked pixels. Simple and highly effective for ViT pretraining.
DINO / DINOv2: self-distillation with no labels. A student network matches the output of a momentum teacher. DINOv2 produces features that are competitive with supervised pretraining for many downstream tasks.
Multi-Label Classification
Each image can have multiple correct labels (e.g., “dog”, “grass”, “outdoors”).
Output: sigmoid per class (not softmax); each $p_k = \sigma(z_k)$.
Loss: binary cross-entropy over all classes.
Metrics: mean Average Precision (mAP), precision@k, F1.
Fine-Grained Classification
Distinguish between similar subcategories (bird species, car models, aircraft types).
Challenges: high inter-class similarity; low intra-class variation. Requires localizing discriminative parts.
Approaches: attention-based part localization, bilinear pooling (outer product of two feature maps captures pairwise feature interactions), specialized augmentation.