Image Representation

An image is a discrete 2D signal. How it is represented in memory and how features are computed from it form the foundation of all computer vision algorithms.

Pixel Representation

A grayscale image is a 2D array $I \in \mathbb{R}^{H \times W}$, where each element $I(y, x)$ is the intensity at row $y$, column $x$.

A color image (RGB) is a 3D tensor $I \in \mathbb{R}^{H \times W \times 3}$, where the three channels are red, green, and blue intensities.

Bit depth:

8-bit: pixel values in $[0, 255]$. Standard for most images.
16-bit: $[0, 65535]$. Medical imaging, HDR.
Float32: $[0.0, 1.0]$. Neural network inputs after normalization.

Channel conventions:

RGB: standard (Pillow, torchvision).
BGR: OpenCV default. Swap channels when mixing libraries.
NCHW: (batch, channels, height, width). PyTorch convention.
NHWC: (batch, height, width, channels). TensorFlow convention.

Color Spaces

RGB: additive color model. Intuitive but channels are highly correlated.

HSV (Hue, Saturation, Value): separates color (hue) from intensity (value). More robust for color-based segmentation; hue is invariant to lighting changes.

LAB (CIELAB): perceptually uniform. L = lightness, A = green-red, B = blue-yellow. Euclidean distance in LAB correlates with perceived color difference. Used for color normalization in histopathology.

YCbCr: separates luma (Y) from chroma (Cb, Cr). Used in JPEG compression; separates lighting from color.

Grayscale conversion:

$$ I_\text{gray} = 0.299 R + 0.587 G + 0.114 B $$

Coefficients reflect human perception (higher sensitivity to green).

Image Normalization

Before feeding into a neural network, normalize pixel values:

Scale to [0, 1]: divide by 255.

Standardize (ImageNet statistics):

$$ x_\text{norm} = \frac{x - \mu}{\sigma} $$

where $\mu = [0.485, 0.456, 0.406]$ and $\sigma = [0.229, 0.224, 0.225]$ for RGB channels (computed on ImageNet). Standard for pretrained models.

Image Transforms and Augmentation

Geometric transforms:

Resize: bilinear or bicubic interpolation to a target resolution.
Crop: random crop during training; center crop at inference.
Flip: horizontal flip (label-preserving for most tasks).
Rotation: rotate by $[-\theta, \theta]$ degrees.
Affine: combined scale, shear, translate.
Perspective warp: simulate viewpoint change.

Photometric transforms:

Brightness / contrast / saturation / hue jitter (ColorJitter).
Gaussian blur, sharpen.
Random grayscale.
Gaussian noise.
Random erasing (Cutout): mask out rectangular patches.

Advanced augmentation:

Mixup: blend two images and their labels: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$, $\tilde{y} = \lambda y_i + (1-\lambda) y_j$.
CutMix: paste a rectangular patch from one image into another; blend labels proportionally to the patch area.
RandAugment: automatically select from a large set of augmentation policies.
AugMix: mix multiple augmented versions to improve consistency.

Spatial Pyramid and Multi-Scale Representation

Many vision features are scale-dependent. Multi-scale representations capture structure at different resolutions.

Image pyramid: compute the image at multiple scales by repeated downsampling.

$$ I_0 = I, \quad I_{k+1} = \text{downsample}(I_k) $$

Gaussian pyramid: smooth before downsampling to prevent aliasing (anti-aliased).

Laplacian pyramid: difference of Gaussian pyramid levels; captures bandpass features at each scale. Used in image blending and compression.

Feature pyramid (FPN): compute feature maps at multiple CNN stages and combine top-down pathways with lateral connections to produce multi-scale feature maps. Foundational for multi-scale object detection.

Frequency Domain Representation

Discrete Fourier Transform (DFT): decomposes the image into sinusoidal frequency components.

$$ F(u, v) = \sum_{y=0}^{H-1} \sum_{x=0}^{W-1} I(y, x) \, e^{-2\pi i (uy/H + vx/W)} $$

Low frequencies (center of spectrum): global structure. High frequencies (periphery): edges, fine detail.

DCT (Discrete Cosine Transform): real-valued frequency decomposition. Basis of JPEG compression: transform 8×8 blocks, quantize high-frequency coefficients.

High-frequency components carry edge information. Low-pass filtering blurs; high-pass filtering sharpens / detects edges.

Classical Feature Descriptors (Pre-Deep Learning)

Before neural networks, features were hand-engineered.

SIFT (Scale-Invariant Feature Transform): detect keypoints at multiple scales via difference-of-Gaussians; describe each with a 128-dim histogram of gradient orientations. Invariant to scale and rotation.

HOG (Histogram of Oriented Gradients): compute gradient orientation histograms in local cells; normalize over blocks. Standard feature for pedestrian detection (DPM, SVM+HOG).

LBP (Local Binary Pattern): encode the relationship of each pixel to its neighbors as a binary pattern. Efficient; used in face recognition.

These descriptors are largely replaced by CNN features, but remain useful for interpretability, lightweight deployment, and understanding what neurons learn.

Depth and Range Images

Depth image: each pixel stores the distance from the camera to the scene point. $D \in \mathbb{R}^{H \times W}$.

Sources: structured light (Intel RealSense), time-of-flight (Azure Kinect), stereo disparity, LiDAR.

RGBD: RGB image paired with a depth map. Used for 3D scene understanding, robotics, AR.

Point cloud: unordered set of 3D points ${(x_i, y_i, z_i)}$ from LiDAR or depth image unprojection. See 3D Vision.