Eigenvalues and Eigenvectors
Definition
For a square matrix $A \in \mathbb{R}^{n \times n}$, a non-zero vector $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$ if:
\[A\mathbf{v} = \lambda\mathbf{v}\]Geometrically: $A$ only scales $\mathbf{v}$ (does not rotate it). $\lambda$ tells you by how much.
Computing Eigenvalues
Rewrite $A\mathbf{v} = \lambda\mathbf{v}$ as $(A - \lambda I)\mathbf{v} = \mathbf{0}$.
This has a non-trivial solution iff $A - \lambda I$ is singular:
\[\det(A - \lambda I) = 0\]This is the characteristic polynomial of $A$. Its roots are the eigenvalues.
Example for 2×2:
\[\det\begin{pmatrix}a-\lambda & b \\ c & d-\lambda\end{pmatrix} = (a-\lambda)(d-\lambda) - bc = 0\]Properties
- An $n \times n$ matrix has exactly $n$ eigenvalues (counting multiplicity, in $\mathbb{C}$)
- $\text{tr}(A) = \sum_i \lambda_i$
- $\det(A) = \prod_i \lambda_i$
- Eigenvectors for distinct eigenvalues are linearly independent
- Real symmetric matrices have real eigenvalues and orthogonal eigenvectors
Eigendecomposition
If $A$ has $n$ linearly independent eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$:
\[A = Q \Lambda Q^{-1}\]where $Q$ is the matrix with eigenvectors as columns and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$.
For symmetric matrices: $A = Q \Lambda Q^T$ (since $Q$ is orthogonal: $Q^{-1} = Q^T$)
When does it exist? Only for diagonalizable matrices (all $n$ eigenvectors linearly independent).
Spectral Theorem
Any real symmetric matrix $A$ can be decomposed as:
\[A = Q \Lambda Q^T = \sum_i \lambda_i \mathbf{q}_i \mathbf{q}_i^T\]where $Q$ is orthogonal and $\Lambda$ is real diagonal.
Covariance matrices and kernel matrices are symmetric positive semi-definite → always diagonalizable with real non-negative eigenvalues.
Powers and Functions of Matrices
Eigendecomposition makes matrix powers easy:
\[A^k = Q \Lambda^k Q^{-1}\] \[f(A) = Q f(\Lambda) Q^{-1}\]where $f(\Lambda) = \text{diag}(f(\lambda_1), \ldots, f(\lambda_n))$.
Used in: computing matrix exponential $e^A$ for ODEs, RNNs, state-space models.
Positive Definite Matrices
$A$ is positive definite (PD) iff all eigenvalues $\lambda_i > 0$.
$A$ is positive semi-definite (PSD) iff all $\lambda_i \geq 0$.
Equivalently: $A$ is PD iff $\mathbf{x}^T A \mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$.
In ML:
- Covariance matrices are always PSD
- Gram matrices $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ must be PSD for valid kernel functions
- The Hessian at a local minimum is PD
Relationship to SVD
For $A \in \mathbb{R}^{m \times n}$ with SVD $A = U\Sigma V^T$:
- Left singular vectors $U$ = eigenvectors of $AA^T$
- Right singular vectors $V$ = eigenvectors of $A^T A$
- Singular values $\sigma_i = \sqrt{\lambda_i(A^T A)}$
For square symmetric matrices: singular values equal the absolute values of eigenvalues.
Principal Component Analysis (PCA)
PCA finds the directions of maximum variance in data.
Given centered data matrix $X \in \mathbb{R}^{n \times d}$:
- Compute covariance matrix: $C = \frac{1}{n} X^T X$
- Eigendecompose: $C = Q\Lambda Q^T$
- Principal components = eigenvectors $\mathbf{q}_i$ (ordered by decreasing $\lambda_i$)
- Project: $Z = XQ_k$ where $Q_k$ = top-$k$ eigenvectors
The eigenvalue $\lambda_i$ = variance explained along $\mathbf{q}_i$.
Spectral Methods in Graphs
For a graph with adjacency matrix $A$ or Laplacian $L = D - A$:
- Eigenvalues of $L$ (spectrum) encode graph structure
- Second smallest eigenvalue (Fiedler value) measures connectivity
- Eigenvectors of $L$ = smooth graph signals, used in spectral GNNs
Eigenvalues in Optimization
The Hessian’s eigenvalues determine the loss landscape curvature:
- All positive → convex local bowl → gradient descent converges
- Mix of positive/negative → saddle point
- Smallest eigenvalue → sharpest direction → sets gradient descent step size limit: $\eta < 2/\lambda_{\max}$
Condition number $\kappa = \lambda_{\max} / \lambda_{\min}$ determines convergence rate.
Power Iteration
Iterative algorithm to find the dominant eigenvector:
v = random unit vector
repeat:
v = Av
v = v / norm(v)
Converges to the eigenvector for $\lambda_{\max}$. Rate depends on $\lvert \lambda_1 / \lambda_2 \rvert$.
Used in: PageRank, spectral clustering initialization.