Image Segmentation
Image segmentation assigns a label to every pixel in an image. It is a denser output task than classification or detection.
Segmentation Types
| Task | Output | Example |
|---|---|---|
| Semantic segmentation | Class label per pixel (no instances) | All cars = “car”, no distinction |
| Instance segmentation | Per-instance binary mask + class | Each car gets its own mask |
| Panoptic segmentation | Semantic for stuff + instance for things | Unified output over all categories |
| Medical image segmentation | Organ or lesion boundaries | CT, MRI scans |
“Things”: countable objects (people, cars, chairs).
“Stuff”: amorphous regions (sky, road, grass).
Semantic Segmentation
Fully Convolutional Network (FCN)
Long et al. (2015). Replace FC layers with $1 \times 1$ convolutions; produce a spatial output map at a coarse resolution; upsample with transposed convolutions (deconvolution). First end-to-end trainable semantic segmentation model.
Encoder-Decoder: U-Net
Ronneberger et al. (2015). Designed for biomedical image segmentation.
Encoder (contracting path): repeated 3×3 conv, ReLU, 2×2 max pool. Doubles channels at each stage.
Decoder (expanding path): 2×2 transposed conv upsampling; concatenate skip connection from the corresponding encoder stage; 3×3 conv.
Skip connections preserve fine spatial detail lost during downsampling. U-Net is standard for medical image segmentation and many dense prediction tasks.
Dilated Convolutions: DeepLab Family
Chen et al. Use atrous (dilated) convolutions to maintain high-resolution feature maps without downsampling.
Atrous Spatial Pyramid Pooling (ASPP): apply dilated convolutions with multiple dilation rates ($d = 6, 12, 18$) in parallel; concatenate their outputs. Captures context at multiple scales.
DeepLabv3+: combines ASPP encoder with a lightweight U-Net-style decoder to recover sharp boundaries.
Evaluation: Mean Intersection over Union (mIoU)
For class $k$:
\[\text{IoU}_k = \frac{TP_k}{TP_k + FP_k + FN_k}\]mIoU = mean over all classes. Standard benchmark metric.
Pixel accuracy: fraction of correctly labeled pixels. Biased toward large classes.
Instance Segmentation
Mask R-CNN
He et al. (2017). Extends Faster R-CNN with a mask head.
Pipeline:
- FPN backbone extracts multi-scale features.
- RPN generates region proposals.
- RoIAlign: extracts fixed-size features per proposal. Uses bilinear interpolation (vs. RoI Pooling which uses quantization); preserves spatial precision.
- Box head: class + bounding box.
- Mask head: $28 \times 28$ binary mask per class, independent per class. Only the ground-truth class mask is used during training.
Mask head and box head run in parallel; mask does not depend on the predicted class to avoid feedback loops.
YOLACT (Real-Time Instance Segmentation)
Generates $k$ global prototype masks; predicts per-instance linear combination coefficients. Assembles instance masks as:
\[M_\text{inst} = \sigma(PC^T)\]where $P \in \mathbb{R}^{H \times W \times k}$ are prototypes and $C \in \mathbb{R}^{k}$ are predicted coefficients.
Real-time inference on GPUs.
CondInst / SOLOv2
Dynamically generate per-instance convolutional kernels conditioned on instance-level features. Apply the generated kernel to a feature map to produce the mask.
Panoptic Segmentation
Unifies semantic and instance segmentation into a single task.
Output: each pixel is assigned either a (class, instance_id) pair for “things” or a class label for “stuff”.
Panoptic Quality (PQ):
\[PQ = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p, g)}{|TP|}}_{\text{segmentation quality}} \times \underbrace{\frac{|TP|}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}}_{\text{recognition quality}}\]Panoptic FPN
Adds a semantic segmentation head alongside the instance segmentation head on top of FPN features. Merges outputs with heuristic rules (things override stuff within detected boxes).
Mask2Former
Cheng et al. (2021). Universal architecture for all three tasks. A Transformer decoder with $N$ learned queries; each query attends to multi-scale features and predicts (class, binary mask).
Masked attention: restricts each query’s attention to the predicted foreground region of the previous layer, not the full feature map. Improves training convergence and prediction quality.
State of the art for panoptic, instance, and semantic segmentation on COCO and ADE20k.
Medical Image Segmentation
Challenges: limited labeled data; fine-grained boundaries; 3D volumes (CT/MRI); class imbalance.
3D U-Net: 3D convolutions for volumetric segmentation.
nnU-Net: automated, self-configuring U-Net pipeline. Automatically sets architecture, preprocessing, and training for any medical dataset. Strong empirical baseline; winner of many medical segmentation challenges.
SAM (Segment Anything Model): large-scale pretraining for prompt-based segmentation (point, box, or mask prompt). Zero-shot generalization to arbitrary segmentation tasks.