Distance Metrics
A metric $d: X \times X \to \mathbb{R}$ satisfies:
- Non-negativity: $d(x, y) \geq 0$
- Identity of indiscernibles: $d(x, y) = 0 \iff x = y$
- Symmetry: $d(x, y) = d(y, x)$
- Triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$
Similarity measures (e.g., cosine similarity) are related but don’t always satisfy all metric axioms.
Euclidean Distance (L2)
\[d(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2 = \sqrt{\sum_i (x_i - y_i)^2}\]- Most common; natural in geometric space
- Sensitive to scale → normalize features first
- Grows with dimension (curse of dimensionality)
Squared Euclidean: $d^2 = \lVert\mathbf{x} - \mathbf{y}\rVert_2^2$, often used since it avoids the square root and has the same minimizers.
Manhattan Distance (L1, City Block)
\[d(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_1 = \sum_i |x_i - y_i|\]- More robust to outliers than L2
- Encourages sparsity (L1 regularization analogy)
- Preferred in high dimensions
Minkowski Distance (Lp)
Generalization of L1 and L2:
\[d_p(\mathbf{x}, \mathbf{y}) = \left(\sum_i |x_i - y_i|^p\right)^{1/p}\]- $p=1$: Manhattan
- $p=2$: Euclidean
- $p \to \infty$: Chebyshev ($\max_i \lvert x_i - y_i \rvert$)
Cosine Similarity and Distance
Similarity:
\[\cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}\]Range: $[-1, 1]$. Measures the angle between vectors, ignoring magnitude.
Cosine distance: $d = 1 - \cos(\mathbf{x}, \mathbf{y})$
- Not a proper metric (violates triangle inequality in general), but widely used
- Preferred for high-dimensional sparse data: NLP embeddings, document similarity
- Equivalent to L2 distance on unit-normalized vectors
Dot Product Similarity
\[s(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} = \sum_i x_i y_i\]- Used in attention mechanisms (scaled dot product)
- Combines both angle and magnitude
- Fast to compute via matrix multiplication
Hamming Distance
Number of positions where two binary strings (or sequences) differ:
\[d_H(\mathbf{x}, \mathbf{y}) = \sum_i \mathbf{1}[x_i \neq y_i]\]Used in: error-correcting codes, comparing categorical vectors, locality-sensitive hashing.
Jaccard Similarity
For sets $A$ and $B$:
\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}\]Jaccard distance: $1 - J(A, B)$
Used in: document similarity (token sets), recommendation systems, near-duplicate detection.
Edit Distance (Levenshtein)
Minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another.
- $O(mn)$ dynamic programming algorithm
- Used in: spell checking, NLP, DNA sequence alignment
Mahalanobis Distance
Generalizes Euclidean distance to account for correlations between features:
\[d_M(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x}-\mathbf{y})^T \Sigma^{-1} (\mathbf{x}-\mathbf{y})}\]where $\Sigma$ is the covariance matrix.
- Reduces to Euclidean when $\Sigma = I$
- Scale-invariant and accounts for feature correlations
- Used in: anomaly detection, Gaussian distribution (exponent), metric learning
KL Divergence (Probabilistic Distance)
Not a true metric (asymmetric), but widely used as a distributional distance:
\[D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}\]See Information Theory for full details.
Jensen-Shannon Divergence
Symmetrized, smoothed version of KL:
\[\text{JSD}(p \| q) = \frac{1}{2} D_{KL}(p \| m) + \frac{1}{2} D_{KL}(q \| m), \quad m = \frac{p+q}{2}\]- Bounded in $[0, 1]$ (with $\log_2$)
- $\sqrt{\text{JSD}}$ is a proper metric
- Used in: GANs (original GAN objective minimizes JSD)
Wasserstein Distance (Earth Mover’s Distance)
Minimum “work” to transform one distribution into another:
\[W_1(p, q) = \inf_{\gamma \in \Gamma(p,q)} \mathbb{E}_{(x,y)\sim\gamma}[\|x - y\|]\]where $\Gamma(p, q)$ = set of joint distributions with marginals $p$ and $q$.
- Meaningful even when distributions have non-overlapping support (unlike KL)
- $W_1$: used in Wasserstein GAN (WGAN) for stable training
- Computationally expensive for high dimensions (approximated via Sinkhorn algorithm)
Kernel-Based Distances
Maximum Mean Discrepancy (MMD):
\[\text{MMD}(p, q) = \|\mu_p - \mu_q\|_\mathcal{H}\]where $\mu_p$ is the mean embedding of $p$ in a reproducing kernel Hilbert space.
Used in: MMD GAN, two-sample testing, domain adaptation.
Summary Table
| Metric | Input Type | Key Property | ML Use |
|---|---|---|---|
| Euclidean (L2) | Vectors | Geometric distance | k-NN, clustering |
| Manhattan (L1) | Vectors | Robust to outliers | Sparse settings |
| Cosine | Vectors | Angle only | NLP embeddings |
| Dot product | Vectors | Angle + magnitude | Attention |
| Hamming | Bits/strings | Discrete | Hashing, codes |
| Jaccard | Sets | Overlap ratio | Document similarity |
| Mahalanobis | Vectors | Accounts for covariance | Anomaly detection |
| KL divergence | Distributions | Information loss | Variational inference |
| Wasserstein | Distributions | Geometry-aware | GANs |
| MMD | Distributions | Kernel-based | Domain adaptation |
Curse of Dimensionality
In high dimensions:
- Distance concentrates: all points become nearly equidistant
- $\min d / \max d \to 1$ as $d \to \infty$
- k-NN becomes unreliable
Mitigations: dimensionality reduction (PCA, t-SNE, UMAP), cosine distance, learned metrics.
Metric Learning
Learn a distance function $d_\theta(\mathbf{x}, \mathbf{y})$ from data (rather than using a fixed one).
Approaches:
- Siamese networks: learn embeddings where $d(x, y)$ is small iff $x$ and $y$ are similar
- Triplet loss: $d(a, p) + \text{margin} < d(a, n)$ (anchor, positive, negative)
- Contrastive loss: minimize distance between positive pairs, maximize for negatives
Used in: face recognition, few-shot learning, retrieval.