Actor-Critic Methods
Actor-critic methods combine the strengths of policy gradient methods (actor) and value-based methods (critic). The critic estimates value functions to reduce variance; the actor updates the policy using the critic’s feedback.
Architecture
Actor: the policy $\pi_\theta(a \mid s)$. Takes actions and is updated to maximize expected return.
Critic: a value function $V_\phi(s)$ or $Q_\phi(s,a)$. Evaluates the actor’s actions and is updated by TD learning.
The actor uses the critic’s estimate as a baseline, producing the advantage:
\[A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \quad (\text{one-step TD advantage})\]Actor update:
\[\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) A_t\]Critic update:
\[\phi \leftarrow \phi - \alpha_\phi \nabla_\phi (r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2\]A2C / A3C
A3C (Asynchronous Advantage Actor-Critic, Mnih et al. 2016): run multiple agents in parallel environments asynchronously; each agent computes gradients and applies them to a shared network. Breaks correlations without a replay buffer.
A2C: synchronous version of A3C. Wait for all parallel workers to complete a rollout; update once. More reproducible; often comparable performance.
$n$-step returns: use $n$-step TD targets for both actor and critic. GAE (see Policy Gradients) replaces fixed $n$-step returns in most implementations.
Soft Actor-Critic (SAC)
Haarnoja et al. (2018). The dominant off-policy actor-critic algorithm for continuous control.
Maximum entropy RL: augment the objective with an entropy bonus to encourage exploration and robustness:
\[J(\pi) = \mathbb{E}\left[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]\]The optimal policy is a Boltzmann policy that allocates probability mass to all near-optimal actions, not just the single best.
Soft Bellman equations:
\[Q^*(s,a) = r + \gamma \mathbb{E}_{s'}[V^*(s')]\] \[V^*(s) = \mathbb{E}_{a \sim \pi^*}[Q^*(s,a) - \alpha \log \pi^*(a|s)]\]SAC components:
- Stochastic actor $\pi_\theta$: parameterized as a Gaussian; actions sampled via reparameterization trick.
- Two Q-networks $Q_{\phi_1}$, $Q_{\phi_2}$: minimize overestimation by using $\min(Q_1, Q_2)$ for targets.
- Soft value network (or compute implicitly).
- Target networks $\bar{\phi}_1$, $\bar{\phi}_2$: exponential moving average of Q-network weights.
- Automatic entropy tuning: adjust $\alpha$ so that the policy entropy matches a target entropy $\mathcal{H}_\text{target}$.
Actor loss: maximize
\[\mathbb{E}_{a \sim \pi_\theta}[\min(Q_1, Q_2)(s,a) - \alpha \log \pi_\theta(a \mid s)]\]Q-network loss: MSE to Bellman target using min-double-Q and target networks.
SAC is sample-efficient, stable, and works well for manipulation, locomotion, and continuous control.
Twin Delayed Deep Deterministic Policy Gradient (TD3)
Fujimoto et al. (2018). Extends DDPG (deterministic off-policy actor-critic) with three key fixes:
DDPG base: deterministic actor $\mu_\theta(s)$; Q-critic; experience replay; target networks. Policy gradient:
\[\nabla_\theta J = \mathbb{E}_s[\nabla_a Q_\phi(s,a)|_{a=\mu_\theta(s)} \cdot \nabla_\theta \mu_\theta(s)]\]TD3 improvements:
-
Clipped double Q-learning: two Q-networks; target uses $\min(Q_1, Q_2)$ to reduce overestimation.
-
Delayed policy updates: update the actor every $d$ steps (typically $d=2$) but update critics every step. The policy update is more reliable when the critic is more accurate.
-
Target policy smoothing: add small clipped noise to target actions to smooth the Q-landscape:
TD3 and SAC are the two standard benchmarks for continuous control. SAC is generally preferred for its entropy regularization and more stable training.
Proximal Policy Optimization (PPO) as Actor-Critic
PPO (covered in Policy Gradients) shares the actor-critic structure. The critic estimates $V_\phi(s)$ for GAE. The actor is updated with the clipped surrogate objective.
PPO is the standard on-policy algorithm; SAC/TD3 are the standard off-policy algorithms.
Comparison
| Algorithm | On/Off-policy | Action space | Key feature |
|---|---|---|---|
| A2C | On-policy | Discrete/Continuous | Synchronous parallel |
| PPO | On-policy | Discrete/Continuous | Clipped surrogate |
| DDPG | Off-policy | Continuous | Deterministic policy |
| TD3 | Off-policy | Continuous | Double Q + delay |
| SAC | Off-policy | Continuous | Max entropy, sample-efficient |