Reinforcement Learning Overview

Reinforcement learning (RL) is a framework for learning to make sequences of decisions by interacting with an environment. An agent takes actions, receives rewards, and learns a policy that maximizes cumulative reward over time. Unlike supervised learning, no labeled examples are provided; the agent must discover good behavior through trial and error.

The RL Framework

The interaction between agent and environment proceeds in discrete time steps:

  1. At time $t$, the agent observes state $s_t \in \mathcal{S}$.
  2. The agent selects action $a_t \in \mathcal{A}$ according to its policy $\pi$.
  3. The environment transitions to $s_{t+1} \sim P(s_{t+1} \mid s_t, a_t)$.
  4. The agent receives reward $r_t = R(s_t, a_t)$.

The agent’s goal is to maximize the expected cumulative discounted reward:

\[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\]

where $\gamma \in [0, 1)$ is the discount factor. $\gamma$ close to 1 values future rewards nearly as much as immediate ones; $\gamma = 0$ is purely greedy.

Key Components

State $s \in \mathcal{S}$: the complete description of the environment relevant to decision-making.

Observation $o$: what the agent actually perceives. May be a partial view of the full state (partially observable).

Action $a \in \mathcal{A}$: the agent’s decision. May be discrete (games, robotics joints) or continuous (torques, velocity).

Reward $r = R(s, a)$: scalar feedback signal. The only learning signal; must be carefully designed.

Policy $\pi(a \mid s)$: the agent’s behavior. Stochastic: a distribution over actions; deterministic: $a = \pi(s)$.

Value function: expected cumulative reward from a state; see Policies and Value Functions.

Model (optional): the agent’s internal estimate of $P(s’ \mid s,a)$ and $R(s,a)$. Model-free RL does not use a model; model-based RL learns or is given one.

RL vs. Supervised and Unsupervised Learning

Property Supervised Unsupervised Reinforcement
Labels Yes (input-output pairs) No Reward signal
Feedback timing Immediate None Delayed
Data distribution Fixed dataset Fixed dataset Generated by policy
Goal Generalize on held-out data Discover structure Maximize return

RL’s unique challenges: temporal credit assignment (which past actions caused a reward?), exploration vs. exploitation, and non-stationary data distributions (since the data distribution changes as the policy improves).

The Exploration-Exploitation Tradeoff

Exploration: try new actions to gather information.

Exploitation: use current knowledge to maximize reward.

$\epsilon$-greedy: with probability $\epsilon$ take a random action; otherwise take the greedy action. Anneal $\epsilon$ over training.

UCB (Upper Confidence Bound): select the action with the highest upper confidence bound on its value:

\[A_t = \arg\max_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right]\]

Thompson sampling: maintain a posterior over action values; sample from the posterior.

Taxonomy of RL Methods

RL
├── Model-free
│   ├── Value-based
│   │   ├── Q-learning, SARSA
│   │   └── DQN, Rainbow
│   ├── Policy-based
│   │   └── REINFORCE, TRPO, PPO
│   └── Actor-Critic
│       └── A2C, A3C, SAC, TD3
└── Model-based
    ├── Given model: AlphaZero, MuZero
    └── Learned model: Dyna, MBPO, Dreamer

Applications

Domain Task Algorithm
Games Atari, Chess, Go DQN, AlphaZero, MuZero
Robotics Manipulation, locomotion SAC, TD3, PPO
Language RLHF for LLMs PPO, DPO
Finance Portfolio optimization PPO, SAC
Recommendation Sequential recommendation DQN, policy gradients
Science Protein folding, drug discovery AlphaFold (supervised+RL)

Markov Property

The environment is Markovian if the future depends only on the current state, not the history:

\[P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)\]

Most RL theory assumes the Markov property. When the state is partially observed, the agent may maintain a history or a recurrent hidden state as a sufficient statistic.