Q-Learning

Q-learning is a model-free, off-policy temporal difference algorithm that directly estimates the optimal action-value function $Q^*$ without requiring knowledge of the environment’s transition model.

The Q-Learning Update

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \underbrace{\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]}_{\delta_t \text{ (TD error)}} $$

Target: $r_t + \gamma \max_{a’} Q(s_{t+1}, a’)$. This is the Bellman optimality target: take the best action at the next step.

Off-policy: the $\max$ over $a’$ in the target uses the greedy policy, regardless of which action was actually taken. The behavior policy can be anything that ensures sufficient exploration (e.g., $\epsilon$-greedy).

Update rule: move $Q(s_t, a_t)$ toward the target by step size $\alpha$.

Convergence Guarantees

Theorem: for a finite MDP, Q-learning converges to $Q^*$ with probability 1 if:

All state-action pairs are visited infinitely often.
Step sizes satisfy the Robbins-Monro conditions: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$.

The conditions ensure sufficient exploration (1) and that the learning rate decays appropriately (2). In practice, a fixed small $\alpha$ works for most applications.

SARSA: On-Policy TD Control

SARSA is the on-policy counterpart to Q-learning:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right] $$

The target uses the actual next action $a_{t+1}$ taken under the current policy, not the greedy action.

SARSA converges to $Q^{\pi_{\epsilon}}$ (the Q-function of the $\epsilon$-greedy policy), not $Q^{\ast}$. As $\epsilon \to 0$, $Q^{\pi_{\epsilon}} \to Q^{\ast}$.

Cliff Walking: a standard example where SARSA is safer than Q-learning. Q-learning learns the optimal path along the cliff (risky); SARSA learns a safer path away from the cliff because it accounts for exploratory actions.

Expected SARSA

Replace the sampled next action with the expected value:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_t + \gamma \sum_{a'} \pi(a'|s_{t+1}) Q(s_{t+1}, a') - Q(s_t, a_t)\right] $$

Lower variance than SARSA; more stable. Reduces to Q-learning when $\pi$ is greedy.

$n$-Step Q-Learning

Extend the target to $n$ steps:

$$ G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n \max_{a'} Q(s_{t+n}, a') $$

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[G_t^{(n)} - Q(s_t, a_t)] $$

Longer $n$: lower bias, higher variance; less sensitive to bootstrapping errors.

Double Q-Learning

Standard Q-learning overestimates values because the $\max$ operator is applied to noisy estimates:

$$ \mathbb{E}[\max_a Q(s', a)] \geq \max_a \mathbb{E}[Q(s', a)] $$

Double Q-learning decouples action selection from action evaluation using two Q-networks $Q_A$ and $Q_B$:

$$ \text{target} = r + \gamma Q_B(s', \arg\max_{a'} Q_A(s', a')) $$

Select the action with $Q_A$; evaluate it with $Q_B$ (and vice versa). Reduces overestimation bias significantly.

Tabular Q-Learning Algorithm

Initialize Q(s, a) = 0 for all s, a
for each episode:
    s ← initial state
    while s is not terminal:
        a ← ε-greedy(Q, s)
        take action a; observe r, s'
        Q(s, a) ← Q(s, a) + α[r + γ max_{a'} Q(s', a') - Q(s, a)]
        s ← s'

Tabular Q-table: $\lvert\mathcal{S}\rvert \times \lvert\mathcal{A}\rvert$ matrix. Feasible only for small state spaces.

Limitations and the Need for Function Approximation

State space explosion: a typical Atari game has $256^{3 \times 84 \times 84}$ possible pixel states; tabular representation is impossible.

Solution: approximate $Q(s,a) \approx Q(s,a;\theta)$ with a neural network. This leads to Deep Q-Networks (DQN); see Deep Q Networks.

Deadly triad: combining function approximation + bootstrapping + off-policy learning can cause divergence. DQN addresses this with experience replay and a target network.