Markov Decision Processes

A Markov Decision Process (MDP) is the formal mathematical framework for sequential decision-making under uncertainty. It provides the foundation for virtually all of reinforcement learning theory.

Formal Definition

An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

$\mathcal{S}$: state space (finite, countably infinite, or continuous).
$\mathcal{A}$: action space.
$P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0,1]$: transition probability function. $P(s’ \mid s,a)$ is the probability of reaching $s’$ from $s$ via action $a$.
$R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$: reward function. $R(s,a)$ is the expected immediate reward.
$\gamma \in [0,1)$: discount factor.

Markov property: $P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)$. The future is independent of the past given the present state.

Finite vs. Infinite Horizon

Finite horizon: the episode terminates after exactly $T$ steps. Return is $G_t = \sum_{k=0}^{T-t-1} r_{t+k}$ (undiscounted or discounted).

Infinite horizon with discount: $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$. Discounting ensures $G_t$ is bounded as long as rewards are bounded.

Episodic MDP: finite horizon with a terminal state. After reaching the terminal state, the environment resets. Atari games, board games.

Continuing MDP: no terminal state; the agent interacts indefinitely. Average reward criterion may be used instead of discounted sum.

Policies

A policy $\pi$ specifies the agent’s behavior.

Stochastic policy: $\pi(a \mid s) = P(A_t = a \mid S_t = s)$. The probability of taking action $a$ in state $s$.

Deterministic policy: $\pi: \mathcal{S} \to \mathcal{A}$.

Stationary policy: does not depend on time $t$. For infinite-horizon discounted MDPs, the optimal policy is always stationary.

Markov policy: depends only on the current state (not on history). Sufficient for MDPs.

Return and Value Functions

Return from step $t$:

\[G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = r_t + \gamma G_{t+1}\]

State value function under policy $\pi$:

\[V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]\]

Action value (Q) function under policy $\pi$:

\[Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]\]

Relationship: $V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s,a)$

Partially Observable MDPs (POMDPs)

When the agent cannot directly observe the full state, the process is a POMDP:

$(\mathcal{S}, \mathcal{A}, \mathcal{O}, P, R, Z, \gamma)$

where $\mathcal{O}$ is the observation space and $Z(o \mid s’,a)$ is the observation emission probability.

The agent maintains a belief state $b(s) = P(S_t = s \mid o_1, a_1, \ldots, o_t)$, updated by Bayes’ rule after each observation.

Exact POMDP solution is PSPACE-hard. Practical approaches:

Maintain a history of observations as a proxy for the belief state.
Use a recurrent neural network to maintain an implicit belief state.
Frame-stacking (Atari DQN): input the last 4 frames to provide temporal context.

Reward Design

The reward function fully specifies the agent’s objective. Poor design leads to reward hacking.

Sparse rewards: reward only at goal achievement. Hard to learn with; requires efficient exploration.

Dense rewards: shaped rewards at intermediate steps. Easier to learn; may misalign with the true objective.

Reward shaping: add a potential-based shaping term $F(s,a,s’) = \gamma \Phi(s’) - \Phi(s)$ without changing the optimal policy (potential-based shaping theorem).

Inverse RL: infer the reward function from expert demonstrations. Applicable when specifying rewards is difficult.

Absorbing States and Termination

Terminal (absorbing) state $s_\text{term}$: $P(s_\text{term} \mid s_\text{term}, a) = 1$ for all $a$; $R(s_\text{term}, a) = 0$.

Once reached, the episode ends. Modeling termination as an absorbing state with zero reward preserves the standard MDP framework.

Finite MDP Example: Grid World

A grid of cells. States: cell positions. Actions: up, down, left, right.

Moving into a wall: stay in place.
Terminal states: goal (reward +1), pit (reward -1).
All other transitions: reward -0.01 (step cost to encourage efficiency).

This is fully specified as a finite MDP and can be solved exactly with dynamic programming.