Numerical Methods
Why Numerical Methods?
Most real-world problems in ML have no closed-form solution. Numerical methods find approximate solutions efficiently on a computer.
Key concerns:
- Accuracy: how close is the approximation to the true answer?
- Stability: does small error in input cause large error in output?
- Efficiency: how many operations does it take?
Floating Point Arithmetic
Computers represent real numbers in IEEE 754 floating point (32-bit float, 64-bit double).
- Machine epsilon: $\epsilon_{\text{mach}} \approx 1.2 \times 10^{-7}$ (float32), $\approx 2.2 \times 10^{-16}$ (float64)
- Underflow / overflow: very small or large numbers lose precision
- Catastrophic cancellation: subtracting two nearly equal numbers amplifies relative error
In ML: float32 standard for training; bfloat16 or float16 used in mixed-precision training to save memory and speed up computation.
Numerical Stability
An algorithm is numerically stable if errors do not amplify through computation.
Example: computing variance:
- Naive: $\text{Var}(X) = \frac{1}{n}\sum x_i^2 - \bar{x}^2$ → catastrophic cancellation
- Stable: Welford’s online algorithm
Softmax stability: avoid overflow by subtracting max:
\[\text{softmax}(x_i) = \frac{e^{x_i - \max(\mathbf{x})}}{\sum_j e^{x_j - \max(\mathbf{x})}}\]Log-sum-exp trick: $\log \sum_i e^{x_i} = \max(\mathbf{x}) + \log \sum_i e^{x_i - \max(\mathbf{x})}$
Solving Linear Systems
$A\mathbf{x} = \mathbf{b}$
Gaussian Elimination
Forward elimination + back substitution. $O(n^3)$ flops.
LU Decomposition
$A = LU$ (with partial pivoting for stability). Solve $L\mathbf{y} = \mathbf{b}$, then $U\mathbf{x} = \mathbf{y}$.
- Efficient when solving multiple systems with the same $A$
Cholesky Decomposition
For symmetric positive definite $A$: $A = LL^T$. Twice as fast as LU.
- Used in Gaussian processes, Kalman filters, covariance matrix operations
Iterative Methods
For large sparse systems: Conjugate Gradient (CG), GMRES. These converge without forming the full matrix, which is critical for large-scale ML.
Numerical Differentiation
Approximate derivatives when analytical form is unavailable.
Forward difference: $f’(x) \approx \frac{f(x+h) - f(x)}{h}$, error $O(h)$
Central difference: $f’(x) \approx \frac{f(x+h) - f(x-h)}{2h}$, error $O(h^2)$
Choice of $h$: too large → truncation error; too small → cancellation error. Typically $h \approx \sqrt{\epsilon_{\text{mach}}}$.
In ML: used for gradient checking, to verify analytic gradients by comparing to numerical estimates.
Automatic Differentiation (Autograd)
Distinct from numerical differentiation. Computes exact gradients by applying the chain rule to elementary operations.
Forward mode: compute $\dot{z} = \frac{\partial z}{\partial x}$ alongside the value. Efficient for few inputs.
Reverse mode (backpropagation): efficient for many inputs, few outputs (common in ML: one scalar loss). One forward pass + one backward pass.
In practice: PyTorch, JAX, TensorFlow implement reverse-mode autograd.
Eigenvalue Computation
Exact formulas only exist for $n \leq 4$. For larger matrices:
- Power iteration: converges to the dominant eigenvector
- QR algorithm: iteratively applies QR decomposition to converge to eigenvalues
- Lanczos algorithm: efficient for large sparse symmetric matrices
Used in PCA, spectral methods, graph neural networks.
Numerical Integration (Quadrature)
Approximate $\int_a^b f(x)\, dx$.
| Method | Rule | Error |
|---|---|---|
| Midpoint | $f(\frac{a+b}{2})(b-a)$ | $O(h^2)$ |
| Trapezoidal | $\frac{b-a}{2}[f(a)+f(b)]$ | $O(h^2)$ |
| Simpson’s | $\frac{b-a}{6}[f(a)+4f(\frac{a+b}{2})+f(b)]$ | $O(h^4)$ |
| Gaussian quadrature | Optimal node placement | very high accuracy |
Monte Carlo integration: $\int f(x)p(x)dx \approx \frac{1}{N}\sum_i f(x_i)$, $x_i \sim p$.
- Error $O(1/\sqrt{N})$, dimension-independent. The only practical method in high dimensions.
Interpolation and Approximation
Polynomial interpolation: fit a polynomial through data points.
- Lagrange interpolation, Newton’s divided differences
- Runge’s phenomenon: high-degree polynomials oscillate badly at boundaries
Splines: piecewise polynomials. Cubic splines are smooth ($C^2$) and stable.
In ML: used in feature engineering, upsampling, and signal processing.
Root Finding
Find $x$ such that $f(x) = 0$.
| Method | Convergence | Notes |
|---|---|---|
| Bisection | Linear | Always converges if bracketed |
| Newton-Raphson | Quadratic | Requires $f’$, may diverge |
| Secant | Superlinear | No $f’$ needed, two starting points |
Newton-Raphson: $x_{t+1} = x_t - \frac{f(x_t)}{f’(x_t)}$
Condition Number
Measures sensitivity of a problem to perturbations in input:
\[\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}\]- $\kappa \approx 1$: well-conditioned
- $\kappa \gg 1$: ill-conditioned, small changes in $\mathbf{b}$ cause large changes in $\mathbf{x}$
In ML: ill-conditioned Gram matrices ($X^T X$) make training unstable → addressed via regularization or batch normalization.