Reinforcement Learning Overview

Reinforcement learning (RL) is a framework for learning to make sequences of decisions by interacting with an environment. An agent takes actions, receives rewards, and learns a policy that maximizes cumulative reward over time. Unlike supervised learning, no labeled examples are provided; the agent must discover good behavior through trial and error.

The RL Framework

The interaction between agent and environment proceeds in discrete time steps:

At time $t$, the agent observes state $s_t \in \mathcal{S}$.
The agent selects action $a_t \in \mathcal{A}$ according to its policy $\pi$.
The environment transitions to $s_{t+1} \sim P(s_{t+1} \mid s_t, a_t)$.
The agent receives reward $r_t = R(s_t, a_t)$.

The agent’s goal is to maximize the expected cumulative discounted reward:

$$ G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} $$

where $\gamma \in [0, 1)$ is the discount factor. $\gamma$ close to 1 values future rewards nearly as much as immediate ones; $\gamma = 0$ is purely greedy.

Key Components

State $s \in \mathcal{S}$: the complete description of the environment relevant to decision-making.

Observation $o$: what the agent actually perceives. May be a partial view of the full state (partially observable).

Action $a \in \mathcal{A}$: the agent’s decision. May be discrete (games, robotics joints) or continuous (torques, velocity).

Reward $r = R(s, a)$: scalar feedback signal. The only learning signal; must be carefully designed.

Policy $\pi(a \mid s)$: the agent’s behavior. Stochastic: a distribution over actions; deterministic: $a = \pi(s)$.

Value function: expected cumulative reward from a state; see Policies and Value Functions.

Model (optional): the agent’s internal estimate of $P(s’ \mid s,a)$ and $R(s,a)$. Model-free RL does not use a model; model-based RL learns or is given one.

RL vs. Supervised and Unsupervised Learning

Property	Supervised	Unsupervised	Reinforcement
Labels	Yes (input-output pairs)	No	Reward signal
Feedback timing	Immediate	None	Delayed
Data distribution	Fixed dataset	Fixed dataset	Generated by policy
Goal	Generalize on held-out data	Discover structure	Maximize return

RL’s unique challenges: temporal credit assignment (which past actions caused a reward?), exploration vs. exploitation, and non-stationary data distributions (since the data distribution changes as the policy improves).

The Exploration-Exploitation Tradeoff

Exploration: try new actions to gather information.

Exploitation: use current knowledge to maximize reward.

$\epsilon$-greedy: with probability $\epsilon$ take a random action; otherwise take the greedy action. Anneal $\epsilon$ over training.

UCB (Upper Confidence Bound): select the action with the highest upper confidence bound on its value:

$$ A_t = \arg\max_a \left[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\right] $$

Thompson sampling: maintain a posterior over action values; sample from the posterior.

Taxonomy of RL Methods

RL
├── Model-free
│   ├── Value-based
│   │   ├── Q-learning, SARSA
│   │   └── DQN, Rainbow
│   ├── Policy-based
│   │   └── REINFORCE, TRPO, PPO
│   └── Actor-Critic
│       └── A2C, A3C, SAC, TD3
└── Model-based
    ├── Given model: AlphaZero, MuZero
    └── Learned model: Dyna, MBPO, Dreamer

Applications

Domain	Task	Algorithm
Games	Atari, Chess, Go	DQN, AlphaZero, MuZero
Robotics	Manipulation, locomotion	SAC, TD3, PPO
Language	RLHF for LLMs	PPO, DPO
Finance	Portfolio optimization	PPO, SAC
Recommendation	Sequential recommendation	DQN, policy gradients
Science	Protein folding, drug discovery	AlphaFold (supervised+RL)

Markov Property

The environment is Markovian if the future depends only on the current state, not the history:

$$ P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t) $$

Most RL theory assumes the Markov property. When the state is partially observed, the agent may maintain a history or a recurrent hidden state as a sufficient statistic.