Self Supervised Learning
Self-supervised learning (SSL) trains a model on a pretext task derived from unlabeled data, using the data itself to generate supervisory signals. The learned representations then transfer to downstream tasks with few or no labels.
Unlike semi-supervised learning, there is no labeled set at pretraining time. Labels are created automatically from structure in the data.
Why Self-Supervision Works
Good representations should capture semantic structure. Self-supervised objectives that require understanding data content (rather than memorizing surface statistics) force the model to learn such structure.
Transfer learning pipeline:
- Pretrain encoder $f_\theta$ on large unlabeled corpus via pretext task.
- Fine-tune $f_\theta$ (or a linear head on top of it) on small labeled downstream dataset.
Pretext Tasks
For Images
| Pretext Task | Description |
|---|---|
| Rotation prediction | Predict rotation angle ($0°, 90°, 180°, 270°$) |
| Jigsaw puzzle | Predict permutation of shuffled image patches |
| Colorization | Predict color channels from grayscale |
| Inpainting | Reconstruct masked regions |
| Relative patch location | Predict spatial relationship between two patches |
For Text
| Pretext Task | Description |
|---|---|
| Masked Language Modeling (MLM) | Predict masked tokens (BERT) |
| Causal Language Modeling (CLM) | Predict next token (GPT) |
| Sentence order prediction | Predict if two sentences are in order |
| Next sentence prediction (NSP) | Predict if sentence B follows sentence A |
Contrastive Learning
The dominant paradigm for visual SSL. Learns representations by pulling positive pairs (different views of the same image) together and pushing negative pairs apart.
View generation: two augmentations $v, v’$ of the same image $x$ form a positive pair. All other images in the batch form negatives.
InfoNCE Loss (NT-Xent)
\[\mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_i') / \tau)}{\sum_{j=1}^{2N} \mathbf{1}[j \neq i] \exp(\text{sim}(z_i, z_j) / \tau)}\]where $z_i = g(f_\theta(v_i))$ is the projected representation, $\text{sim}$ is cosine similarity, $\tau$ is temperature, and $N$ is batch size.
SimCLR
Framework: augment $x$ twice, encode with shared $f_\theta$, project with small MLP $g$, apply NT-Xent. Projection head is discarded after pretraining; representations from $f_\theta$ are used for downstream tasks.
Key findings:
- Strong augmentations (random crop, color jitter, Gaussian blur) are critical.
- Larger batch sizes and longer training improve quality.
- Nonlinear projection head significantly boosts representation quality.
MoCo (Momentum Contrast)
Maintains a queue of negative keys from past batches, decoupling batch size from number of negatives. Key encoder updated as exponential moving average (EMA) of query encoder:
\[\theta_k \leftarrow m \theta_k + (1 - m) \theta_q\]Allows large effective number of negatives without huge batch sizes.
BYOL
Bootstrap Your Own Latent. Eliminates explicit negatives. Two networks: online ($\theta$) and target ($\xi$, EMA of $\theta$). Online predicts target representation:
\[\mathcal{L} = 2 - 2 \cdot \frac{\langle q_\theta(z_\theta), z_\xi \rangle}{\|q_\theta(z_\theta)\| \cdot \|z_\xi\|}\]Avoids collapse via the asymmetric architecture and stop-gradient on the target.
SimSiam
Similar to BYOL without EMA. Stop-gradient on one branch is the key to avoiding collapse. Simple and effective: no large batches, no negatives, no momentum encoder.
Masked Autoencoding
Masked Autoencoders (MAE)
For images: mask a large fraction ($\sim 75\%$) of patches, encode only visible patches with a ViT, decode to reconstruct masked patches.
\[\mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{p \in \mathcal{M}} \|x_p - \hat{x}_p\|^2\]High masking ratio forces the encoder to learn semantic representations rather than interpolating local statistics.
Comparison of Approaches
| Method | Negatives | Momentum Encoder | Projection Head | Key Idea |
|---|---|---|---|---|
| SimCLR | Large batch | No | MLP | NT-Xent with strong augmentations |
| MoCo | Queue | Yes | MLP | Decoupled negatives via memory bank |
| BYOL | No | Yes (EMA) | Predictor | Asymmetric network prevents collapse |
| SimSiam | No | No | Predictor | Stop-gradient avoids collapse |
| MAE | No | No | Decoder | Reconstruction of masked patches |
Evaluation Protocol
Linear evaluation: freeze pretrained encoder; train a linear classifier on top. Measures representation quality independent of fine-tuning.
Fine-tuning: update all parameters. Measures transfer performance; typically higher than linear evaluation.
Few-shot: fine-tune with very few labeled examples per class. Tests sample efficiency.
Key Augmentations for Visual SSL
| Augmentation | Role |
|---|---|
| Random crop + resize | Invariance to scale and location |
| Color jitter | Invariance to color and brightness |
| Gaussian blur | Invariance to sharpness |
| Grayscale | Removes color-based shortcuts |
| Horizontal flip | Invariance to orientation |
Augmentation strength must be calibrated: too weak and the task is trivial; too strong and positives become semantically different.