Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) studies settings where multiple agents interact in a shared environment, each learning its own policy. The agents may cooperate, compete, or both.

Problem Setup

$N$ agents interact in a shared environment. At each step $t$:

Each agent $i$ observes $o_t^i$ (may be partial or full state).
Each agent selects action $a_t^i \sim \pi^i(\cdot \mid o_t^i)$.
The environment transitions to $s_{t+1} \sim P(s_{t+1} \mid s_t, \mathbf{a}_t)$.
Each agent receives reward $r_t^i = R^i(s_t, \mathbf{a}_t)$.

The joint action $\mathbf{a}_t = (a_t^1, \ldots, a_t^N)$ affects the transition and rewards of all agents.

Interaction Types

Cooperative: all agents share the same reward ($r^i = r^j$ for all $i,j$). Goal: maximize joint return. Examples: multi-robot coordination, multi-agent navigation, team sports.

Competitive (zero-sum): one agent’s gain is another’s loss ($\sum_i r^i = 0$). Examples: chess, poker, StarCraft.

Mixed (general-sum): agents have partially aligned interests. Examples: autonomous driving, economics, social dilemmas.

Non-Stationarity

The primary challenge in MARL: from agent $i$’s perspective, the environment is non-stationary because other agents’ policies change during training. Standard single-agent convergence guarantees break down.

Even in a two-player game, if both agents update simultaneously via gradient descent, the joint update may oscillate or diverge.

Centralized Training with Decentralized Execution (CTDE)

The dominant paradigm for cooperative MARL. During training, agents share information (global state, other agents’ observations and actions). At execution, each agent acts using only its local observation.

Motivation: sharing information during training is fine (offline process); at execution, communication bandwidth or latency may be limited.

QMIX

Rashid et al. (2018). Value decomposition for cooperative MARL with per-agent Q-functions.

Individual Q-functions: each agent $i$ has $Q_i(o^i, a^i; \theta_i)$.

Monotonicity constraint: the joint action should be the argmax of a mixed Q-function $Q_\text{tot}$. QMIX enforces:

$$ \frac{\partial Q_\text{tot}}{\partial Q_i} \geq 0 \quad \forall i $$

via a hypernetwork-parameterized mixing network with non-negative weights.

Result: decentralized greedy execution is optimal w.r.t. $Q_\text{tot}$.

MAPPO

Multi-agent PPO. Each agent runs PPO with a centralized critic (observes global state) and a decentralized actor. Simple extension of PPO; strong cooperative baseline.

Independent Learners

Each agent runs a single-agent algorithm independently, treating other agents as part of the environment.

Independent Q-Learning (IQL): each agent runs DQN independently. Simple; works in practice for small numbers of agents with slowly changing policies.

Issues: non-stationarity from other learning agents; no theoretical guarantees.

Competitive MARL and Game Theory

Nash equilibrium: a joint policy $(\pi^{1}, \ldots, \pi^{N})$ where no agent can improve by unilaterally changing its policy.

$$ V^i(\pi^{*i}, \pi^{*-i}) \geq V^i(\pi^i, \pi^{*-i}) \quad \forall i, \forall \pi^i $$

Every finite game has at least one Nash equilibrium (Nash 1950).

Self-play: train agents by playing against their past selves or current copies. Prevents exploitation of non-stationarity; converges to Nash in two-player zero-sum games.

Fictitious play: agents best-respond to the empirical average of opponents’ past actions. Converges to Nash in two-player zero-sum games.

AlphaStar

DeepMind (2019). Superhuman StarCraft II. Uses:

League training: main agents + exploiters + main exploiters evolve together.
Each agent uses UPGO (upward-looking policy gradient) + V-trace for off-policy correction.
Pointer network for unit selection; Transformer for global context.

Communication in MARL

Agents can share information via communication channels.

CommNet (Sukhbaatar et al. 2016): broadcast mean of all agents’ hidden states; each agent conditions on the aggregate message.

DIAL (Differentiable Inter-Agent Learning): pass gradients through communication; discrete communication emerges.

QMIX / MAPPO with communication: agents share observations or messages during centralized training.

Multi-Agent Challenges

Challenge	Description
Non-stationarity	Other agents’ policies change during training
Credit assignment	Who contributed to the joint reward?
Scalability	Exponential joint action space with many agents
Emergent behavior	Complex strategies not explicitly programmed
Coordination	Agents must align without explicit communication

Emergent Communication

When communication protocols are not pre-specified, agents can develop emergent communication systems (grounded language-like protocols). Studied in referential games (speaker describes an object; listener must identify it). Emergent protocols are efficient but often not human-interpretable.

Applications

Domain	Type	Algorithm
StarCraft II	Competitive/cooperative	AlphaStar
Multi-robot tasks	Cooperative	QMIX, MAPPO
Traffic signal control	Cooperative	MAPPO
OpenAI Five (Dota 2)	Cooperative team vs. team	PPO + self-play
Poker	Competitive	Counterfactual Regret Minimization