Shallow Neural Networks
A shallow neural network has exactly one hidden layer between input and output. Despite their simplicity, shallow networks are the theoretical basis for the universal approximation theorem and serve as the building block for understanding deeper architectures.
\[\hat{y} = W^{(2)} \sigma(W^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}\]Architecture
Input layer Hidden layer Output layer
x_1 ─┐
x_2 ─┤──► [h_1, h_2, ..., h_n] ──► [y_1, ..., y_k]
... ─┘
x_d
- Input: $\mathbf{x} \in \mathbb{R}^d$
- Hidden: $\mathbf{h} = \sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)}) \in \mathbb{R}^n$
- Output: $\hat{\mathbf{y}} = W^{(2)}\mathbf{h} + \mathbf{b}^{(2)} \in \mathbb{R}^k$
Total parameters: $d \cdot n + n + n \cdot k + k = n(d + k + 1) + k$
Forward Pass
For a single input $\mathbf{x} \in \mathbb{R}^d$:
Layer 1 (hidden):
\[z_j^{(1)} = \sum_{i=1}^d w_{ji}^{(1)} x_i + b_j^{(1)}, \quad j = 1, \ldots, n\] \[h_j = \sigma(z_j^{(1)})\]Layer 2 (output):
\[\hat{y}_k = \sum_{j=1}^n w_{kj}^{(2)} h_j + b_k^{(2)}\]Learning in a Shallow Network
The network minimizes empirical risk:
\[\hat{\theta} = \arg\min_{W^{(1)}, \mathbf{b}^{(1)}, W^{(2)}, \mathbf{b}^{(2)}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(\hat{y}_i, y_i)\]The output weights $W^{(2)}$ enter the loss linearly (fixing $\mathbf{h}$, the problem is convex in $W^{(2)}$). The hidden weights $W^{(1)}$ make the overall problem non-convex.
Representational Capacity
For a width-$n$ ReLU network:
- Creates at most $n$ “kinks” (breakpoints) in 1-D: can represent any piecewise linear function with $\leq n$ pieces.
- In $d$ dimensions: partitions input space into at most $\sum_{j=0}^d \binom{n}{j}$ regions, each with a linear function.
Width needed for $\epsilon$-approximation: Typically exponential in $d$ for multivariate functions. This curse of dimensionality motivates depth.
Universal Approximation (Formal Statement)
Cybenko (1989): For any continuous $f: [0,1]^d \to \mathbb{R}$ and $\epsilon > 0$, there exist $n$, $W^{(1)}$, $W^{(2)}$, $\mathbf{b}$ such that:
\[\sup_{x \in [0,1]^d} |f(x) - \hat{f}(x)| < \epsilon\]for any non-constant, bounded, monotone continuous activation $\sigma$.
Hornik (1991): Extends to any non-polynomial $\sigma$.
The theorem guarantees existence but not efficient constructibility or learnability via gradient descent.
When Shallow Networks Suffice
- Low-dimensional inputs ($d \leq$ a few hundred).
- Smooth, simple target functions.
- Small datasets where deep networks overfit.
- Interpretability is required.
Examples: logistic regression (no hidden layer), one-hidden-layer MLP for tabular data, radial basis function networks.
Comparison to Deep Networks
| Property | Shallow | Deep |
|---|---|---|
| Parameters for same function | Exponentially more | Polynomial |
| Optimization landscape | Fewer saddle points | More complex |
| Feature learning | One level of features | Hierarchical |
| Generalization (typical) | Worse on structured data | Better |
| Interpretability | Easier | Harder |
Shallow networks are best understood as the foundation: understanding their forward pass, loss landscape, and gradient flow directly extends to deeper architectures. See Deep Neural Networks.