Deep Neural Networks

What makes a Neural Network deep?

A deep neural network (DNN) has multiple hidden layers. Depth enables hierarchical feature learning: early layers detect low-level patterns; later layers compose them into abstract representations. This is the key advantage over shallow networks for structured data such as images, text, and audio.

Architecture

For an $L$-layer network:

\[\mathbf{h}^{(0)} = \mathbf{x}\] \[\mathbf{z}^{(l)} = W^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\] \[\mathbf{h}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)}), \quad l = 1, \ldots, L-1\] \[\hat{\mathbf{y}} = \text{out}(\mathbf{z}^{(L)})\]

where $\text{out}$ is task-specific (softmax, sigmoid, linear).

Depth vs. Width

Depth advantage: many functions require exponentially fewer neurons when implemented with depth.

For Boolean functions: a depth-$k$ circuit can represent functions that require depth-$(k-1)$ circuits exponentially larger.

For ReLU networks: a depth-$L$, width-$n$ network partitions input space into $O((n/d)^{(L-1)d})$ linear regions. This grows exponentially in $L$; comparable width growth would be polynomial.

Practical width choices:

Architecture	Typical hidden width
Small MLP (tabular)	64–512
Medium MLP	512–2048
ResNet-50	256–2048 per block
Transformer (GPT-2)	768–3072
LLM (GPT-4 class)	8192–16384+

Key Deep Architectures

Convolutional Neural Networks (CNNs)

Exploit spatial locality and translation invariance via learnable filters.

Convolution layer: $Z_{i,j,c} = \sum_{m,n,c’} K_{m,n,c’,c} \cdot H_{i+m, j+n, c’} + b_c$

Weight sharing: same filter applied at every spatial location.
Receptive field: grows with depth via pooling and strided convolutions.

Architecture pattern: [Conv → BN → ReLU]* → Pool → Flatten → Dense

Key variants: LeNet, AlexNet, VGG, ResNet, EfficientNet.

Residual Networks (ResNets)

Introduces skip connections (residual connections):

\[\mathbf{h}^{(l+1)} = \sigma(\mathbf{z}^{(l+1)} + \mathbf{h}^{(l)})\]

The block learns the residual $F(\mathbf{h}) = \mathbf{z}^{(l+1)}$ rather than the full mapping. This solves the vanishing gradient and degradation problems.

Vanishing gradient: gradients are multiplied by $W^T$ at each layer. For deep nets with saturating activations, gradients shrink to zero.

With skip connections: gradient has a direct path through the identity shortcut:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \prod_{k=l}^{L-1} \left(I + \frac{\partial F_k}{\partial \mathbf{h}^{(k)}}\right)\]

The identity term $I$ prevents vanishing. Enables training of networks with 100–1000+ layers.

Recurrent Neural Networks (RNNs)

Processes sequences by maintaining hidden state $\mathbf{h}_t$:

\[\mathbf{h}_t = \sigma(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b})\]

Vanishing/exploding gradients over long sequences led to:

LSTM: adds cell state $C_t$ with input, forget, and output gates.
GRU: simplified gating; fewer parameters than LSTM.

Largely superseded by Transformers for sequence modeling.

Transformers

See NLP and Attention mechanism. Core operation:

\[\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Uses self-attention instead of recurrence; fully parallelizable; scales to very long sequences with architectural modifications.

Problem	Cause	Solution
Vanishing gradients	Repeated small multiplications in backprop	ReLU, skip connections, careful init
Exploding gradients	Repeated large multiplications	Gradient clipping, careful init
Degradation	Deeper plain nets perform worse	Residual connections
Internal covariate shift	Changing activation distributions during training	Batch Normalization
Overfitting	High capacity	Dropout, weight decay, early stopping

Representation Learning

Intermediate layers learn representations (features) that are:

Distributed: concepts encoded across many neurons.
Hierarchical: lower layers capture simple features; higher layers abstract patterns.
Transferable: representations learned on one task often transfer to related tasks (transfer learning, fine-tuning).

In CNNs trained on ImageNet, layer-wise visualization shows: edges (layer 1) → textures (layer 2) → parts (layer 3) → objects (layer 4+).

Depth in Practice

Start with proven architectures (ResNet, EfficientNet, ViT) rather than designing from scratch.
Depth is not always better: for small datasets, shallow networks or pre-trained models generalize better.
Residual connections and normalization layers are nearly always beneficial in deep networks.
Increase depth before width when scaling up, but follow empirical scaling laws (see EfficientNet, Chinchilla).