Bellman Equations

The Bellman equations are recursive consistency conditions for value functions. They are the central equations of RL theory: every major algorithm is either a direct solution to, or an approximation of, the Bellman equations.

Bellman Expectation Equation

The value of a state equals the expected immediate reward plus the discounted value of the next state:

For $V^\pi$:

\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a) + \gamma V^\pi(s')\right]\]

For $Q^\pi$:

\[Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a')\]

In compact operator notation, defining the Bellman expectation operator $\mathcal{T}^\pi$:

\[(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V(s')]\]

$V^\pi$ is the unique fixed point: $\mathcal{T}^\pi V^\pi = V^\pi$.

$\mathcal{T}^\pi$ is a contraction mapping with modulus $\gamma$ under the $\ell_\infty$ norm. Repeated application converges to $V^\pi$ from any initialization.

Bellman Optimality Equation

The optimal value function satisfies a nonlinear equation: the agent takes the best action:

For $V^{\ast}$:

\[V^*(s) = \max_a \sum_{s'} P(s'|s,a)\left[R(s,a) + \gamma V^*(s')\right]\]

For $Q^{\ast}$:

\[Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')\]

The Bellman optimality operator $\mathcal{T}^{\ast}$:

\[(\mathcal{T}^* V)(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V(s')]\]

$V^{\ast}$ is the unique fixed point of $\mathcal{T}^{\ast}$. Contraction with modulus $\gamma$. Value iteration applies $\mathcal{T}^{\ast}$ repeatedly to converge to $V^{\ast}$.

TD Error

The temporal difference (TD) error is the difference between the current value estimate and the Bellman target:

\[\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]

Interpretation: $r_t + \gamma V(s_{t+1})$ is a one-step estimate of $V(s_t)$ (the TD target). $\delta_t$ is the prediction error; a non-zero TD error means the value estimate is inconsistent with the Bellman equation.

TD(0) update:

\[V(s_t) \leftarrow V(s_t) + \alpha \delta_t\]

Drives $V$ toward the TD target. Converges to $V^\pi$ for tabular MDPs with appropriate step sizes.

SARSA (on-policy Q-update):

\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]\]

Q-learning (off-policy):

\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]\]

Multi-Step Bellman Targets

The one-step TD target can be extended to $n$ steps:

\[G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\]

$n$-step TD update:

\[V(s_t) \leftarrow V(s_t) + \alpha [G_t^{(n)} - V(s_t)]\]

$n=1$: standard TD; low variance, biased.
$n = T$ (end of episode): Monte Carlo; zero bias, high variance.
Intermediate $n$: tradeoff.

Multi-step returns are used in A2C, PPO, and Rainbow DQN.

Bellman Equations in Matrix Form

For a finite MDP with $\lvert\mathcal{S}\rvert$ states under a fixed policy $\pi$:

\[\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma P^\pi \mathbf{v}^\pi\] \[\mathbf{v}^\pi = (I - \gamma P^\pi)^{-1} \mathbf{r}^\pi\]

where $P^\pi_{ss’} = \sum_a \pi(a \mid s) P(s’ \mid s,a)$ and $r^\pi_s = \sum_a \pi(a \mid s) R(s,a)$.

Direct solution is $O(\lvert\mathcal{S}\rvert^3)$; feasible only for small state spaces. Dynamic programming uses iterative methods instead.

Contraction and Convergence

The Bellman operators $\mathcal{T}^\pi$ and $\mathcal{T}^*$ are $\gamma$-contractions under $|\cdot|_\infty$:

\[\|\mathcal{T}^\pi V - \mathcal{T}^\pi U\|_\infty \leq \gamma \|V - U\|_\infty\]

By Banach’s fixed point theorem, repeated application converges to the unique fixed point at a geometric rate.

Error propagation: if $V$ is an $\epsilon$-approximate fixed point ($|\mathcal{T}^\pi V - V|_\infty \leq \epsilon$), then:

\[\|V - V^\pi\|_\infty \leq \frac{\epsilon}{1-\gamma}\]

This $\frac{1}{1-\gamma}$ factor explains why small discount factors make RL easier: errors amplify less.