Distance Metrics

A metric $d: X \times X \to \mathbb{R}$ satisfies:

Non-negativity: $d(x, y) \geq 0$
Identity of indiscernibles: $d(x, y) = 0 \iff x = y$
Symmetry: $d(x, y) = d(y, x)$
Triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$

Similarity measures (e.g., cosine similarity) are related but don’t always satisfy all metric axioms.

Euclidean Distance (L2)

\[d(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2 = \sqrt{\sum_i (x_i - y_i)^2}\]

Most common; natural in geometric space
Sensitive to scale → normalize features first
Grows with dimension (curse of dimensionality)

Squared Euclidean: $d^2 = \lVert\mathbf{x} - \mathbf{y}\rVert_2^2$, often used since it avoids the square root and has the same minimizers.

Manhattan Distance (L1, City Block)

\[d(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_1 = \sum_i |x_i - y_i|\]

More robust to outliers than L2
Encourages sparsity (L1 regularization analogy)
Preferred in high dimensions

Minkowski Distance (Lp)

Generalization of L1 and L2:

\[d_p(\mathbf{x}, \mathbf{y}) = \left(\sum_i |x_i - y_i|^p\right)^{1/p}\]

$p=1$: Manhattan
$p=2$: Euclidean
$p \to \infty$: Chebyshev ($\max_i \lvert x_i - y_i \rvert$)

Cosine Similarity and Distance

Similarity:

\[\cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}\]

Range: $[-1, 1]$. Measures the angle between vectors, ignoring magnitude.

Cosine distance: $d = 1 - \cos(\mathbf{x}, \mathbf{y})$

Not a proper metric (violates triangle inequality in general), but widely used
Preferred for high-dimensional sparse data: NLP embeddings, document similarity
Equivalent to L2 distance on unit-normalized vectors

Dot Product Similarity

\[s(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} = \sum_i x_i y_i\]

Used in attention mechanisms (scaled dot product)
Combines both angle and magnitude
Fast to compute via matrix multiplication

Hamming Distance

Number of positions where two binary strings (or sequences) differ:

\[d_H(\mathbf{x}, \mathbf{y}) = \sum_i \mathbf{1}[x_i \neq y_i]\]

Used in: error-correcting codes, comparing categorical vectors, locality-sensitive hashing.

Jaccard Similarity

For sets $A$ and $B$:

\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}\]

Jaccard distance: $1 - J(A, B)$

Used in: document similarity (token sets), recommendation systems, near-duplicate detection.

Edit Distance (Levenshtein)

Minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another.

$O(mn)$ dynamic programming algorithm
Used in: spell checking, NLP, DNA sequence alignment

Mahalanobis Distance

Generalizes Euclidean distance to account for correlations between features:

\[d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T \Sigma^{-1} (\mathbf{x}-\mathbf{y})}\]

where $\Sigma$ is the covariance matrix.

Reduces to Euclidean when $\Sigma = I$
Scale-invariant and accounts for feature correlations
Used in: anomaly detection, Gaussian distribution (exponent), metric learning

KL Divergence (Probabilistic Distance)

Not a true metric (asymmetric), but widely used as a distributional distance:

\[D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]

See Information Theory for full details.

Jensen-Shannon Divergence

Symmetrized, smoothed version of KL:

\[\text{JSD}(p \| q) = \frac{1}{2} D_{KL}(p \| m) + \frac{1}{2} D_{KL}(q \| m), \quad m = \frac{p+q}{2}\]

Bounded in $[0, 1]$ (with $\log_2$)
$\sqrt{\text{JSD}}$ is a proper metric
Used in: GANs (original GAN objective minimizes JSD)

Wasserstein Distance (Earth Mover’s Distance)

Minimum “work” to transform one distribution into another:

\[W_1(p, q) = \inf_{\gamma \in \Gamma(p,q)} \mathbb{E}_{(x,y)\sim\gamma}[\|x - y\|]\]

where $\Gamma(p, q)$ = set of joint distributions with marginals $p$ and $q$.

Meaningful even when distributions have non-overlapping support (unlike KL)
$W_1$: used in Wasserstein GAN (WGAN) for stable training
Computationally expensive for high dimensions (approximated via Sinkhorn algorithm)

Kernel-Based Distances

Maximum Mean Discrepancy (MMD):

\[\text{MMD}(p, q) = \|\mu_p - \mu_q\|_\mathcal{H}\]

where $\mu_p$ is the mean embedding of $p$ in a reproducing kernel Hilbert space.

Used in: MMD GAN, two-sample testing, domain adaptation.

Summary Table

Metric	Input Type	Key Property	ML Use
Euclidean (L2)	Vectors	Geometric distance	k-NN, clustering
Manhattan (L1)	Vectors	Robust to outliers	Sparse settings
Cosine	Vectors	Angle only	NLP embeddings
Dot product	Vectors	Angle + magnitude	Attention
Hamming	Bits/strings	Discrete	Hashing, codes
Jaccard	Sets	Overlap ratio	Document similarity
Mahalanobis	Vectors	Accounts for covariance	Anomaly detection
KL divergence	Distributions	Information loss	Variational inference
Wasserstein	Distributions	Geometry-aware	GANs
MMD	Distributions	Kernel-based	Domain adaptation

Curse of Dimensionality

In high dimensions:

Distance concentrates: all points become nearly equidistant
$\min d / \max d \to 1$ as $d \to \infty$
k-NN becomes unreliable

Mitigations: dimensionality reduction (PCA, t-SNE, UMAP), cosine distance, learned metrics.

Metric Learning

Learn a distance function $d_\theta(\mathbf{x}, \mathbf{y})$ from data (rather than using a fixed one).

Approaches:

Siamese networks: learn embeddings where $d(x, y)$ is small iff $x$ and $y$ are similar
Triplet loss: $d(a, p) + \text{margin} < d(a, n)$ (anchor, positive, negative)
Contrastive loss: minimize distance between positive pairs, maximize for negatives

Used in: face recognition, few-shot learning, retrieval.