Policies and Value Functions
Policies define how an agent acts; value functions quantify how good it is to be in a state or to take an action. Together they are the central objects of RL theory and the building blocks of most algorithms.
The Optimal Policy
The goal is to find a policy $\pi^*$ that maximizes the expected return from every state:
\[\pi^* = \arg\max_\pi V^\pi(s) \quad \forall s \in \mathcal{S}\]Theorem (existence): for any finite MDP with discount $\gamma < 1$, there exists an optimal policy $\pi^*$ that is deterministic and stationary.
Optimal value function:
\[V^*(s) = \max_\pi V^\pi(s)\] \[Q^*(s, a) = \max_\pi Q^\pi(s, a)\]Given $Q^*$, the optimal policy is:
\[\pi^*(s) = \arg\max_a Q^*(s, a)\]State Value Function $V^\pi$
\[V^\pi(s) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\Big|\; S_t = s\right]\]$V^\pi(s)$ tells us how good it is to be in state $s$ when following policy $\pi$.
Computing $V^\pi$ (policy evaluation):
For a finite MDP, $V^\pi$ satisfies a linear system:
\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a) + \gamma V^\pi(s')]\]This can be solved directly (matrix inversion) or iteratively (iterative policy evaluation).
Action Value Function $Q^\pi$
\[Q^\pi(s,a) = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\Big|\; S_t = s, A_t = a\right]\]$Q^\pi(s,a)$ is the expected return when taking action $a$ in state $s$, then following $\pi$ thereafter.
Relationship to $V^\pi$:
\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)\] \[Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')\]Advantage Function
The advantage function measures how much better action $a$ is compared to the average action under $\pi$:
\[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]$A^\pi(s,a) > 0$ means $a$ is better than average; $A^\pi(s,a) < 0$ means worse.
The advantage function is central to policy gradient methods because it reduces variance while keeping the policy update unbiased.
Policy Improvement
Policy improvement theorem: given a policy $\pi$, define $\pi’$ as the greedy policy w.r.t. $Q^\pi$:
\[\pi'(s) = \arg\max_a Q^\pi(s,a)\]Then $V^{\pi’}(s) \geq V^\pi(s)$ for all $s$. If equality holds everywhere, $\pi$ is already optimal.
Policy iteration: alternate between policy evaluation (compute $V^\pi$) and policy improvement (compute greedy $\pi’$). Converges to $\pi^*$ in finite steps.
Generalized Policy Iteration (GPI)
The unifying principle behind most RL algorithms. Maintain two interacting processes:
- Policy evaluation: make the value function consistent with the current policy.
- Policy improvement: make the policy greedy with respect to the current value function.
These two processes work against each other in the short term but converge together to optimality.
V ←→ π
evaluation improvement
GPI: the processes don’t need to run to completion each time. Take one step of each, alternating (TD learning + greedy policy update).
On-Policy vs. Off-Policy
On-policy: the policy being evaluated and improved is the same as the policy generating data.
Off-policy: the policy generating data (behavior policy $b$) is different from the policy being learned (target policy $\pi$).
Off-policy learning enables:
- Learning from demonstrations (expert behavior policy).
- Reusing old experience (experience replay).
- Learning multiple policies simultaneously.
Importance sampling correction for off-policy value estimation:
\[V^\pi(s) = \mathbb{E}_b\left[\frac{\pi(A_t|S_t)}{b(A_t|S_t)} G_t \;\Big|\; S_t = s\right]\]The ratio $\rho_t = \pi(A_t \mid S_t)/b(A_t \mid S_t)$ is the importance weight.
Eligibility Traces and $\text{TD}(\lambda)$
Eligibility traces blend multi-step returns:
$\lambda$-return:
\[G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}\]where $G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(S_{t+n})$ is the $n$-step return.
- $\lambda = 0$: one-step TD (TD(0)).
- $\lambda = 1$: full Monte Carlo return.
- Intermediate $\lambda$: balances bias (low $\lambda$) and variance (high $\lambda$).
Eligibility trace $e_t(s)$: accumulates recent visit frequency of a state; used to propagate the TD error backward efficiently.
\[e_t(s) = \gamma \lambda e_{t-1}(s) + \mathbf{1}[S_t = s]\] \[V(S_t) \leftarrow V(S_t) + \alpha \delta_t e_t(s)\]