Unsupervised Learning

Unsupervised learning discovers structure in data ${x_i}_{i=1}^n$ without any labels. The goal is to model the underlying distribution $P(X)$ or find meaningful groupings, representations, or lower-dimensional embeddings.

Clustering

Groups data points such that intra-cluster similarity is high and inter-cluster similarity is low.

k-Means

Partitions $n$ points into $k$ clusters by minimizing within-cluster variance:

\[\min_{\{C_k\}, \{\mu_k\}} \sum_{k=1}^K \sum_{x_i \in C_k} \|x_i - \mu_k\|^2\]

Algorithm (Lloyd’s):

Initialize $k$ centroids (random, k-means++, etc.)
Assign each point to nearest centroid.
Recompute centroids as mean of assigned points.
Repeat until convergence.

Complexity: $O(nkd)$ per iteration. Converges to local optimum; sensitive to initialization.

Choosing $k$: elbow method on inertia, silhouette score, gap statistic.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Does not require specifying $k$; discovers clusters of arbitrary shape.

Key parameters:

$\epsilon$: neighborhood radius
$\text{minPts}$: minimum neighbors to be a core point

Point types:

Core point: at least minPts points within $\epsilon$-neighborhood.
Border point: within $\epsilon$ of a core point but not a core itself.
Noise point: neither core nor border.

Complexity: $O(n \log n)$ with spatial indexing.

Hierarchical Clustering

Builds a tree (dendrogram) of nested clusters.

Agglomerative (bottom-up): each point starts as its own cluster; merge closest pairs.
Divisive (top-down): start with all points; recursively split.

Linkage criteria:

Linkage	Distance Between Clusters
Single	Minimum pairwise distance
Complete	Maximum pairwise distance
Average	Mean pairwise distance
Ward	Minimizes variance increase on merge

Gaussian Mixture Models (GMM)

Assumes data is generated from a mixture of $K$ Gaussians:

\[P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x; \mu_k, \Sigma_k)\]

Fit via EM algorithm:

E-step: compute soft cluster assignments $\gamma_{ik} = P(z_i = k \mid x_i)$
M-step: update $\pi_k$, $\mu_k$, $\Sigma_k$ using weighted MLEs

Generalizes k-Means (which is GMM with spherical, equal-variance clusters and hard assignments).

Cluster Evaluation Metrics

Metric	Notes
Silhouette score	Compares intra-cluster to nearest-cluster distance. Range $[-1, 1]$, higher is better.
Davies-Bouldin index	Ratio of within-cluster scatter to between-cluster separation. Lower is better.
Adjusted Rand Index (ARI)	Compares to ground truth if available.
Normalized Mutual Information (NMI)	Information-theoretic comparison to ground truth.

Dimensionality Reduction

Maps high-dimensional data to a lower-dimensional space while preserving structure.

Principal Component Analysis (PCA)

Finds orthogonal directions of maximum variance.

Objective: find projection matrix $W \in \mathbb{R}^{d \times k}$ such that variance in projected space is maximized.

Solution: eigenvectors of the covariance matrix $\Sigma = \frac{1}{n} X^T X$ corresponding to the $k$ largest eigenvalues.

Equivalently: via SVD of the centered data matrix $X = U \Sigma V^T$; principal components are columns of $V$.

Explained variance ratio:

\[\text{EVR}_j = \frac{\lambda_j}{\sum_i \lambda_i}\]

Cumulative EVR guides choice of $k$.

t-SNE

t-Distributed Stochastic Neighbor Embedding. Nonlinear reduction for visualization (typically $k = 2$ or $3$).

Defines joint probabilities $p_{ij}$ (Gaussian in high-dim) and $q_{ij}$ (Student-t in low-dim), then minimizes their KL divergence:

\[\mathcal{L} = \text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\]

Heavy-tailed $q_{ij}$ prevents crowding. Not suitable for new points (no projection formula). Perplexity hyperparameter controls neighborhood size.

UMAP

Uniform Manifold Approximation and Projection. Faster than t-SNE; preserves more global structure. Constructs a fuzzy topological representation, then finds low-dimensional equivalent.

Method	Global structure	Scalability	New points
PCA	Yes	$O(nd^2)$	Yes
t-SNE	Partial	$O(n^2)$ (approx $O(n \log n)$)	No
UMAP	Better than t-SNE	$O(n)$ approx	Yes

Density Estimation

Models the underlying data distribution $P(X)$.

Kernel Density Estimation (KDE)

Non-parametric estimate:

\[\hat{p}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)\]

Common kernels: Gaussian, Epanechnikov. Bandwidth $h$ controls smoothness (bias-variance tradeoff).

Autoencoders

Neural network encoder $z = f_\phi(x)$ and decoder $\hat{x} = g_\theta(z)$ trained to minimize reconstruction loss:

\[\mathcal{L} = \frac{1}{n}\sum_{i=1}^n \|x_i - g_\theta(f_\phi(x_i))\|^2\]

Bottleneck dimension forces a compressed representation. Variants: sparse autoencoders, denoising autoencoders, variational autoencoders (VAEs).

Practical Considerations

Curse of dimensionality: distances become uninformative in high dimensions; apply dimensionality reduction before clustering.
Feature scaling: required for distance-based methods (k-Means, DBSCAN, KDE).
Evaluating without labels: use internal metrics (silhouette, Davies-Bouldin) or downstream task performance.