Neural Networks

A neural network is a parametric function $f_\theta: \mathbb{R}^d \to \mathbb{R}^k$ composed of alternating linear transformations and element-wise nonlinearities. It is a universal function approximator: given sufficient width, a single hidden layer can approximate any continuous function on a compact domain to arbitrary precision.

The Neuron (Perceptron)

The basic computational unit:

\[z = \mathbf{w}^T \mathbf{x} + b, \quad a = \sigma(z)\]
  • $\mathbf{x} \in \mathbb{R}^d$: input vector
  • $\mathbf{w} \in \mathbb{R}^d$: weight vector
  • $b \in \mathbb{R}$: bias term
  • $\sigma$: activation function (nonlinearity)
  • $a$: activation (output)

Without the nonlinearity $\sigma$, stacking neurons is equivalent to a single linear transformation.

Feedforward Neural Network (MLP)

A Multi-Layer Perceptron is a sequence of $L$ layers:

\[\mathbf{h}^{(0)} = \mathbf{x}\] \[\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad l = 1, \ldots, L\] \[\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})\] \[\hat{\mathbf{y}} = \mathbf{h}^{(L)}\]

where $W^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$ and $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$.

Parameter count: $\sum_{l=1}^L (n_l \cdot n_{l-1} + n_l)$

Layer Types

Layer Operation Parameters
Fully connected (Dense) $W\mathbf{h} + \mathbf{b}$ $n_l \cdot n_{l-1} + n_l$
Convolutional Sliding kernel over spatial dims $k \times k \times C_\text{in} \times C_\text{out}$
Recurrent (RNN) $h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$ $n_h(n_h + n_x) + n_h$
Attention $\text{softmax}(QK^T/\sqrt{d_k})V$ $3 \cdot d \cdot d_k + d_v \cdot d$

Network Architecture Terminology

  • Width: number of neurons per layer ($n_l$)
  • Depth: number of layers ($L$)
  • Hidden layer: any layer that is not input or output
  • Input layer: $\mathbf{h}^{(0)} = \mathbf{x}$; no parameters
  • Output layer: $\mathbf{h}^{(L)}$; activation depends on task

Output activations by task:

Task Output activation Loss
Binary classification Sigmoid Binary cross-entropy
Multi-class classification Softmax Categorical cross-entropy
Regression None (linear) MSE
Multi-label classification Sigmoid per output Binary cross-entropy

Universal Approximation Theorem

A feedforward network with a single hidden layer of width $N$ and a non-polynomial activation function can approximate any continuous function $f: [0,1]^d \to \mathbb{R}$ to within $\epsilon$ in the sup-norm, for any $\epsilon > 0$.

Limitations of the theorem:

  • Existence does not imply learnability (gradient descent may not find the weights).
  • The required width $N$ may be exponential in $d$.
  • Depth provides exponential expressiveness gains over width for certain functions.

Computational Graph

A neural network is a directed acyclic graph (DAG) where:

  • Nodes represent operations or variables.
  • Edges represent data flow (tensors).

This representation enables automatic differentiation via the chain rule. See Backpropagation.

Matrix Form (Batch Processing)

For a batch of $n$ examples, stack inputs row-wise: $X \in \mathbb{R}^{n \times d}$.

\[Z^{(l)} = H^{(l-1)} W^{(l)T} + \mathbf{1}\mathbf{b}^{(l)T}\] \[H^{(l)} = \sigma(Z^{(l)})\]

Modern frameworks (PyTorch, JAX, TensorFlow) use this batched form with broadcasting.

Expressiveness vs. Depth

For piecewise linear networks (ReLU), a depth-$L$ network with width $n$ can represent up to $O((n/d)^{(L-1)d} \cdot n^d)$ linear regions in input space. Depth increases expressiveness exponentially; doubling width increases it polynomially.

Functions with hierarchical structure (images, language) are efficiently represented by deep networks but require exponentially wide shallow networks.