Policy Gradients
Policy gradient methods directly optimize the policy $\pi_\theta$ by gradient ascent on the expected return. Unlike value-based methods, they work naturally with continuous actions, stochastic policies, and can represent multi-modal policies.
Policy Gradient Theorem
Parameterize the policy as $\pi_\theta(a \mid s)$. The objective is:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T r_t\right] = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)]\]Policy gradient theorem (Sutton et al. 2000):
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]\]The gradient is an expectation over trajectories; no model of $P$ or $R$ is required.
Derivation sketch (log-derivative trick):
\[\nabla_\theta J = \int \nabla_\theta p_\theta(\tau) G(\tau) d\tau = \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} G(\tau) d\tau = \mathbb{E}_\tau[\nabla_\theta \log p_\theta(\tau) \cdot G(\tau)]\]Since $\log p_\theta(\tau) = \sum_t \log \pi_\theta(a_t \mid s_t) + \text{const}$ (transition probabilities cancel), only the policy log-probs remain.
REINFORCE (Monte Carlo Policy Gradient)
Sample full episodes; use the trajectory return as the signal:
for each episode:
sample τ = (s0, a0, r0, ..., sT) from π_θ
for each step t:
G_t ← Σ_{k≥t} γ^(k-t) r_k
θ ← θ + α ∇_θ log π_θ(a_t|s_t) G_t
Intuition: increase the log-probability of actions that led to high returns; decrease for low returns.
High variance: $G_t$ is a sum of many random rewards; Monte Carlo returns have high variance. Makes REINFORCE slow to converge.
Baseline: Variance Reduction
Adding a baseline $b(s_t)$ to the return does not bias the gradient but reduces variance:
\[\nabla_\theta J = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - b(s_t))\right]\]Proof that baseline doesn’t bias: $\mathbb{E}[\nabla_\theta \log \pi_\theta(a \mid s) b(s)] = b(s) \mathbb{E}[\nabla_\theta \log \pi_\theta(a \mid s)] = 0$ (since $\sum_a \pi(a \mid s) = 1$, its gradient is 0).
Optimal baseline: $b^*(s) = V^\pi(s)$ (the state value function). Makes the gradient signal the advantage function $A(s,a) = Q(s,a) - V(s)$.
Trust Region Policy Optimization (TRPO)
Schulman et al. (2015). Policy improvement by maximizing a surrogate objective subject to a KL divergence constraint:
\[\max_\theta \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_\text{old}}(a|s)} A^{\pi_{\theta_\text{old}}}(s,a)\right]\] \[\text{subject to} \quad \mathbb{E}\left[D_\text{KL}(\pi_{\theta_\text{old}}(\cdot|s) \| \pi_\theta(\cdot|s))\right] \leq \delta\]The ratio $\rho_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ is the importance weight; allows off-policy data reuse.
TRPO uses a conjugate gradient method + line search. Complex and expensive but stable.
Proximal Policy Optimization (PPO)
Schulman et al. (2017). Simplifies TRPO by replacing the hard constraint with a clipped surrogate objective:
\[\mathcal{L}^\text{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(\rho_t(\theta) A_t, \;\text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]Clipping: if $\rho_t$ moves too far from 1, the objective is clipped to discourage large policy updates. $\epsilon = 0.1$–$0.2$ typical.
Combined loss:
\[\mathcal{L} = \mathcal{L}^\text{CLIP} - c_1 \mathcal{L}^\text{VF} + c_2 \mathcal{H}[\pi_\theta]\]Value function loss $\mathcal{L}^\text{VF}$: MSE of value predictions. Entropy bonus $\mathcal{H}$: encourages exploration.
PPO is the most widely used on-policy RL algorithm. It is simple to implement, stable, and effective across a wide range of tasks including robotics, games, and RLHF for LLMs.
Generalized Advantage Estimation (GAE)
Schulman et al. (2016). Compute the advantage as an exponentially weighted average of $n$-step advantages:
\[\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\]where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.
- $\lambda = 0$: TD advantage, low variance, high bias.
- $\lambda = 1$: Monte Carlo advantage, high variance, zero bias.
- $\lambda \approx 0.95$: standard choice, balances both.
GAE is used in PPO, TRPO, and most actor-critic implementations.
Comparison: Value-Based vs. Policy Gradient
| Property | DQN (value-based) | PPO (policy gradient) |
|---|---|---|
| Action space | Discrete | Discrete or continuous |
| Policy type | Implicit (greedy) | Explicit (parameterized) |
| On/off-policy | Off-policy | On-policy |
| Variance | Low | Higher (with baseline: moderate) |
| Sample efficiency | Better | Lower |
| Stability | Moderate | High (PPO) |
| Stochastic policies | Indirect | Natural |