Reinforcement Learning Overview
Reinforcement learning (RL) is a framework for learning to make sequences of decisions by interacting with an environment. An agent takes actions, receives rewards, and learns a policy that maximizes cumulative reward over time. Unlike supervised learning, no labeled examples are provided; the agent must discover good behavior through trial and error.
The RL Framework
The interaction between agent and environment proceeds in discrete time steps:
- At time $t$, the agent observes state $s_t \in \mathcal{S}$.
- The agent selects action $a_t \in \mathcal{A}$ according to its policy $\pi$.
- The environment transitions to $s_{t+1} \sim P(s_{t+1} \mid s_t, a_t)$.
- The agent receives reward $r_t = R(s_t, a_t)$.
The agent’s goal is to maximize the expected cumulative discounted reward:
\[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\]where $\gamma \in [0, 1)$ is the discount factor. $\gamma$ close to 1 values future rewards nearly as much as immediate ones; $\gamma = 0$ is purely greedy.
Key Components
State $s \in \mathcal{S}$: the complete description of the environment relevant to decision-making.
Observation $o$: what the agent actually perceives. May be a partial view of the full state (partially observable).
Action $a \in \mathcal{A}$: the agent’s decision. May be discrete (games, robotics joints) or continuous (torques, velocity).
Reward $r = R(s, a)$: scalar feedback signal. The only learning signal; must be carefully designed.
Policy $\pi(a \mid s)$: the agent’s behavior. Stochastic: a distribution over actions; deterministic: $a = \pi(s)$.
Value function: expected cumulative reward from a state; see Policies and Value Functions.
Model (optional): the agent’s internal estimate of $P(s’ \mid s,a)$ and $R(s,a)$. Model-free RL does not use a model; model-based RL learns or is given one.
RL vs. Supervised and Unsupervised Learning
| Property | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Labels | Yes (input-output pairs) | No | Reward signal |
| Feedback timing | Immediate | None | Delayed |
| Data distribution | Fixed dataset | Fixed dataset | Generated by policy |
| Goal | Generalize on held-out data | Discover structure | Maximize return |
RL’s unique challenges: temporal credit assignment (which past actions caused a reward?), exploration vs. exploitation, and non-stationary data distributions (since the data distribution changes as the policy improves).
The Exploration-Exploitation Tradeoff
Exploration: try new actions to gather information.
Exploitation: use current knowledge to maximize reward.
$\epsilon$-greedy: with probability $\epsilon$ take a random action; otherwise take the greedy action. Anneal $\epsilon$ over training.
UCB (Upper Confidence Bound): select the action with the highest upper confidence bound on its value:
\[A_t = \arg\max_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right]\]Thompson sampling: maintain a posterior over action values; sample from the posterior.
Taxonomy of RL Methods
RL
├── Model-free
│ ├── Value-based
│ │ ├── Q-learning, SARSA
│ │ └── DQN, Rainbow
│ ├── Policy-based
│ │ └── REINFORCE, TRPO, PPO
│ └── Actor-Critic
│ └── A2C, A3C, SAC, TD3
└── Model-based
├── Given model: AlphaZero, MuZero
└── Learned model: Dyna, MBPO, Dreamer
Applications
| Domain | Task | Algorithm |
|---|---|---|
| Games | Atari, Chess, Go | DQN, AlphaZero, MuZero |
| Robotics | Manipulation, locomotion | SAC, TD3, PPO |
| Language | RLHF for LLMs | PPO, DPO |
| Finance | Portfolio optimization | PPO, SAC |
| Recommendation | Sequential recommendation | DQN, policy gradients |
| Science | Protein folding, drug discovery | AlphaFold (supervised+RL) |
Markov Property
The environment is Markovian if the future depends only on the current state, not the history:
\[P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)\]Most RL theory assumes the Markov property. When the state is partially observed, the agent may maintain a history or a recurrent hidden state as a sufficient statistic.