Deep Q Networks
Deep Q-Networks (DQN) extend Q-learning to high-dimensional state spaces (e.g., raw pixels) by approximating the Q-function with a deep neural network. DQN was the first algorithm to learn control policies directly from raw pixel inputs across a wide range of Atari games.
DQN Architecture
Q-network: $Q(s, a; \theta) \approx Q^*(s, a)$, parameterized by weights $\theta$.
Input: stack of 4 grayscale frames (84×84×4) to provide temporal information.
Network: 3 convolutional layers + 2 fully connected layers. Output: one Q-value per action.
Key innovation: instead of outputting a single $Q(s,a)$ for a given $(s,a)$ pair, output all $\lvert\mathcal{A}\rvert$ Q-values simultaneously. Single forward pass covers all actions.
Instability of Naive Deep Q-Learning
Three sources of instability when naively combining Q-learning with neural networks:
-
Correlated updates: consecutive transitions $(s_t, a_t, r_t, s_{t+1})$ are highly correlated; networks trained on correlated sequences diverge.
-
Non-stationary targets: the target $r + \gamma \max_{a’} Q(s’, a’; \theta)$ changes every step because $\theta$ changes. Chasing a moving target causes oscillations.
-
Overestimation bias: the $\max$ operator amplifies noise in Q-value estimates.
Experience Replay
Store transitions $(s_t, a_t, r_t, s_{t+1}, \text{done})$ in a replay buffer $\mathcal{D}$ of fixed size $N$.
At each training step, sample a random mini-batch of size $B$ from $\mathcal{D}$:
\[\{(s_j, a_j, r_j, s_j')\}_{j=1}^B \sim \mathcal{D}\]Benefits:
- Breaks temporal correlations: mini-batch samples are approximately i.i.d.
- Data efficiency: each transition is used multiple times.
- Off-policy: old data can be replayed even after the policy has changed.
Buffer size: typically $10^5$–$10^6$ transitions. Too small: recent bias; too large: stale data.
Target Network
Maintain a separate target network $Q(s, a; \theta^-)$ with weights $\theta^-$ that are updated slowly.
Loss:
\[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2\right]\]Update schedule: copy $\theta \to \theta^-$ every $C$ steps (e.g., $C = 10000$). Keeps the target approximately fixed during updates, reducing oscillations.
DQN Training Loop
Initialize Q(·; θ), Q(·; θ−) = Q(·; θ), replay buffer D
for each step t:
a_t ← ε-greedy policy from Q(s_t; θ)
execute a_t, observe r_t, s_{t+1}
store (s_t, a_t, r_t, s_{t+1}) in D
sample mini-batch from D
compute targets: y_j = r_j + γ max_{a'} Q(s_j', a'; θ−)
update θ by gradient descent on (y_j − Q(s_j, a_j; θ))²
every C steps: θ− ← θ
DQN Improvements: Rainbow
Double DQN: decouple action selection (use $\theta$) and evaluation (use $\theta^-$):
\[y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)\]Reduces overestimation bias.
Dueling DQN: decompose the Q-value into state value and advantage:
\[Q(s, a; \theta) = V(s; \theta_V) + \left(A(s, a; \theta_A) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a'; \theta_A)\right)\]$V(s)$ captures the state value; $A(s,a)$ captures action-relative advantage. Better generalization when many actions are equally good.
Prioritized Experience Replay (PER): sample transitions with priority proportional to their TD error magnitude $\lvert\delta\rvert^\omega$. Transitions with large errors are revisited more; compensate with importance sampling weights.
Multi-step returns: use $n$-step Bellman targets instead of one-step. Reduces bootstrap bias.
Distributional RL (C51): instead of learning the expected return $\mathbb{E}[G]$, learn the full return distribution $P(G \mid s,a)$ parameterized as a categorical distribution over $N$ atoms. Minimizes KL divergence.
Noisy Nets: replace $\epsilon$-greedy with learned stochastic weights in the network. Exploration is state-dependent and learnable.
Rainbow DQN: combine all six improvements. Substantially outperforms individual components on Atari.
Hyperparameters
| Parameter | Typical value |
|---|---|
| Replay buffer size | $10^6$ |
| Batch size | 32 |
| Target update frequency | 10000 steps |
| $\epsilon$ initial | 1.0 |
| $\epsilon$ final | 0.1 |
| $\epsilon$ annealing steps | $10^6$ |
| Learning rate | $2.5 \times 10^{-4}$ |
| $\gamma$ | 0.99 |
Limitations of DQN
Discrete actions only: the $\arg\max_a$ over the output requires a finite action set. Does not extend to continuous actions.
Solution: for continuous actions, use policy gradient methods (SAC, TD3) or discretize the action space.
Sample efficiency: DQN requires $10^7$–$10^8$ environment steps on Atari to reach human-level performance. Model-based methods can be orders of magnitude more sample-efficient.