Neural Networks
A neural network is a parametric function $f_\theta: \mathbb{R}^d \to \mathbb{R}^k$ composed of alternating linear transformations and element-wise nonlinearities. It is a universal function approximator: given sufficient width, a single hidden layer can approximate any continuous function on a compact domain to arbitrary precision.
The Neuron (Perceptron)
The basic computational unit:
\[z = \mathbf{w}^T \mathbf{x} + b, \quad a = \sigma(z)\]- $\mathbf{x} \in \mathbb{R}^d$: input vector
- $\mathbf{w} \in \mathbb{R}^d$: weight vector
- $b \in \mathbb{R}$: bias term
- $\sigma$: activation function (nonlinearity)
- $a$: activation (output)
Without the nonlinearity $\sigma$, stacking neurons is equivalent to a single linear transformation.
Feedforward Neural Network (MLP)
A Multi-Layer Perceptron is a sequence of $L$ layers:
\[\mathbf{h}^{(0)} = \mathbf{x}\] \[\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \ldots, L\] \[\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})\] \[\hat{\mathbf{y}} = \mathbf{h}^{(L)}\]where $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$ and $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$.
Parameter count: $\sum_{l=1}^L (n_l \cdot n_{l-1} + n_l)$
Layer Types
| Layer | Operation | Parameters |
|---|---|---|
| Fully connected (Dense) | $W\mathbf{h} + \mathbf{b}$ | $n_l \cdot n_{l-1} + n_l$ |
| Convolutional | Sliding kernel over spatial dims | $k \times k \times C_\text{in} \times C_\text{out}$ |
| Recurrent (RNN) | $h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$ | $n_h(n_h + n_x) + n_h$ |
| Attention | $\text{softmax}(QK^T/\sqrt{d_k})V$ | $3 \cdot d \cdot d_k + d_v \cdot d$ |
Network Architecture Terminology
- Width: number of neurons per layer ($n_l$)
- Depth: number of layers ($L$)
- Hidden layer: any layer that is not input or output
- Input layer: $\mathbf{h}^{(0)} = \mathbf{x}$; no parameters
- Output layer: $\mathbf{h}^{(L)}$; activation depends on task
Output activations by task:
| Task | Output activation | Loss |
|---|---|---|
| Binary classification | Sigmoid | Binary cross-entropy |
| Multi-class classification | Softmax | Categorical cross-entropy |
| Regression | None (linear) | MSE |
| Multi-label classification | Sigmoid per output | Binary cross-entropy |
Universal Approximation Theorem
A feedforward network with a single hidden layer of width $N$ and a non-polynomial activation function can approximate any continuous function $f: [0,1]^d \to \mathbb{R}$ to within $\epsilon$ in the sup-norm, for any $\epsilon > 0$.
Limitations of the theorem:
- Existence does not imply learnability (gradient descent may not find the weights).
- The required width $N$ may be exponential in $d$.
- Depth provides exponential expressiveness gains over width for certain functions.
Computational Graph
A neural network is a directed acyclic graph (DAG) where:
- Nodes represent operations or variables.
- Edges represent data flow (tensors).
This representation enables automatic differentiation via the chain rule. See Backpropagation.
Matrix Form (Batch Processing)
For a batch of $n$ examples, stack inputs row-wise: $X \in \mathbb{R}^{n \times d}$.
\[Z^{(l)} = H^{(l-1)} W^{(l)T} + \mathbf{1}\mathbf{b}^{(l)T}\] \[H^{(l)} = \sigma(Z^{(l)})\]Modern frameworks (PyTorch, JAX, TensorFlow) use this batched form with broadcasting.
Expressiveness vs. Depth
For piecewise linear networks (ReLU), a depth-$L$ network with width $n$ can represent up to $O((n/d)^{(L-1)d} \cdot n^d)$ linear regions in input space. Depth increases expressiveness exponentially; doubling width increases it polynomially.
Functions with hierarchical structure (images, language) are efficiently represented by deep networks but require exponentially wide shallow networks.