Markov Decision Processes (MDP)
Key Points
- •A Markov Decision Process (MDP) models decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
- •An MDP is defined by states S, actions A, transition probabilities P, rewards R, and a discount factor \(\).
- •The goal is to find a policy that maximizes expected discounted return, usually via Bellman equations.
- •Value Iteration and Policy Iteration are fundamental dynamic programming algorithms for solving MDPs when P and R are known.
- •Q-learning learns optimal behavior from experience without knowing the transition model, making it model-free.
- •Discounting with \(0 < 1\) ensures that future rewards are worth slightly less, stabilizing solutions and guaranteeing convergence.
- •Contraction properties of the Bellman operator guarantee convergence of value iteration to the optimal value function.
- •C++ implementations typically represent P as sparse adjacency lists over (next state, probability, reward) triples to save memory.
Prerequisites
- →Basic Probability — Transition dynamics use conditional probabilities and expectations.
- →Linear Algebra — Policy evaluation can be written and solved as linear systems.
- →Dynamic Programming — Bellman recursions are dynamic programming equations over states.
- →Algorithmic Complexity — Understanding runtime and memory of iterative solvers guides practical choices.
- →C++ Vectors and Iteration — Implementing tabular MDP algorithms requires manipulating arrays and loops.
- →Optimization Basics — Policies are obtained by maximizing expected returns over actions.
- →Stochastic Processes — MDPs are discrete-time stochastic control models with Markov transitions.
Detailed Explanation
Tap terms for definitions01Overview
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision making under uncertainty. It captures the idea that at each time step, the system is in some state, an agent chooses an action, the system moves to a new state according to stochastic transition rules, and a numerical reward is provided. The Markov property says the next state and reward depend only on the current state and action, not on the full past. An MDP is specified by a 5-tuple (S, A, P, R, (\gamma)), where S is the set of states, A is the set of actions, P encodes transition probabilities, R encodes rewards, and (\gamma) is a discount factor in [0,1). The central objective is to find a policy—a mapping from states to actions (possibly probabilistic)—that maximizes the expected sum of discounted rewards over time. This problem underlies many applications: robot navigation, inventory management, game playing, and medical treatment planning. Solutions hinge on the Bellman equations, which relate the value of a state to the immediate reward plus the expected value of future states. Algorithms like Value Iteration and Policy Iteration exploit these equations to compute optimal policies when the model is known, while reinforcement learning methods like Q-learning learn from sampled interactions when it is not.
02Intuition & Analogies
Imagine you are navigating a city with uncertain traffic. Each intersection is a state, and choosing a road to take is an action. Even if you pick the same road twice, travel time varies due to random traffic conditions—this randomness is captured by transition probabilities. After each move, you get a reward: perhaps negative minutes (so smaller is better) or a toll fee. You care about getting to your destination quickly while avoiding tolls, which creates a trade-off between immediate and future outcomes. The discount factor (\gamma) is like valuing a minute saved now slightly more than a minute saved later—it prevents you from overcommitting to distant gains at the expense of what you can gain soon, and it mathematically ensures everything sums to a finite number. A policy is your driving strategy: at each intersection, which road do you take? The value of an intersection under a given strategy is the expected total minutes saved (or cost incurred) if you keep following that strategy. The Bellman equation is the common-sense idea that the worth of a place equals what you get immediately when you take an action plus the expected worth of where you’ll end up next. Value Iteration repeatedly updates your estimates of how good each intersection is, improving your strategy as you learn which roads reliably lead to better neighborhoods. Policy Iteration alternates between asking, "If I keep using this route plan, how good is it?" and "Given those values, can I pick even better roads?" If you don’t know the traffic model, Q-learning learns from experience by trying routes, noting outcomes, and gradually favoring choices that work out well.
03Formal Definition
04When to Use
Use MDPs when modeling sequential decisions with uncertainty and the Markov property holds—future depends only on the current state and action. If you know the transition model and rewards (model-based), planning methods like Value Iteration or Policy Iteration are appropriate: for example, deterministic scheduling with random machine failures, inventory control with stochastic demand, or robot path planning with slip probabilities. When the model is unknown but you can interact with the environment (sample transitions), model-free reinforcement learning like Q-learning or SARSA is suitable: for instance, online ad selection with delayed stochastic clicks, game playing with unknown opponent tendencies, or autonomous control learned through simulation. If long-term consequences matter but you must balance them with immediate payoffs, discounting via (\gamma) helps enforce stability and tractability. Choose finite-horizon dynamic programming for problems with a fixed end time (e.g., planning over the next 12 months), and infinite-horizon discounted MDPs for continuing tasks (e.g., never-ending maintenance scheduling). For very large state spaces, consider approximate methods and function approximation (e.g., value function approximation, Monte Carlo tree search) rather than exact tabular solutions.
⚠️Common Mistakes
• Ignoring the Markov property: Including irrelevant history in the state can bloat the state space; omitting necessary information breaks the Markov assumption. Define states so that the future is conditionally independent of the past, given the present state. • Mixing reward definitions: Some texts use (R(s,a)) while others use (R(s,a,s')). Be consistent in code—if you store rewards on transitions, ensure you sum over next states when computing expectations. • Using (\gamma = 1) without care: Without terminal states or bounded returns, convergence may fail. Prefer (\gamma < 1) or prove boundedness/termination. • Incorrect stopping criteria: In Value Iteration, stopping when the maximum change (\delta) is below (\epsilon \frac{1-\gamma}{2\gamma}) ensures an (\epsilon)-optimal policy. Stopping too early yields suboptimal policies; too late wastes computation. • Not normalizing transition probabilities: Each (\sum_{s'} P(s'\mid s,a)) must equal 1. Small floating-point errors can accumulate; renormalize or validate inputs. • Ties in argmax: Greedy policy extraction with ties can oscillate. Break ties consistently (e.g., smallest index) for determinism. • Overfitting to sparse rewards: In Q-learning, without sufficient exploration (e.g., (\epsilon)-greedy), the agent may never discover rewarding states. Schedule exploration decay thoughtfully. • Ignoring sparsity: Storing full (|S|\times|A|\times|S|) arrays is wasteful when transitions are sparse. Use adjacency lists to cut memory and runtime.
Key Formulas
Discounted Return
Explanation: The total value from time t is the sum of future rewards, each discounted by \(\) per step. This ensures distant rewards contribute less and the sum is finite when \(<1\).
Bellman Expectation (State Value)
Explanation: The value of a state under policy \(\) equals the expected immediate reward plus discounted value of next states. Expectation is taken over the policy’s action probabilities and the transition model.
Bellman Expectation (Action Value)
Explanation: The value of taking action a in state s, then following \(\), includes the immediate reward and the expected value of subsequent actions under \(\).
Bellman Optimality (State Value)
Explanation: The optimal value of a state equals the best expected one-step return achievable by choosing the best action now and behaving optimally thereafter.
Bellman Optimality (Action Value)
Explanation: The optimal action value is immediate reward plus discounted optimal value of next states. It underpins Q-learning and greedy policies.
Bellman Optimality Operator
Explanation: Applying T to a value function performs a one-step optimal lookahead. Repeated application converges to \(\) because T is a contraction.
Contraction Property
Explanation: The Bellman optimality operator shrinks the maximum absolute difference between two value functions by at least a factor \(\). This guarantees uniqueness of the fixed point and convergence of Value Iteration.
Value Iteration Update
Explanation: This recurrence updates value estimates using one-step optimal lookahead from the previous estimates. Iterating this for all states yields better approximations to \(\).
Probability Normalization
Explanation: Transition probabilities from a given state-action pair must sum to one. This encodes that some next state will occur with certainty.
Policy Improvement
Explanation: Given a value function for a policy \(\), choosing actions greedily with respect to it yields a policy \('\) that is at least as good. Repeating evaluation and improvement converges to an optimal policy.
Stopping Criterion for Value Iteration
Explanation: If successive value estimates differ by less than a scaled \(\), the greedy policy is within \(\) of optimal. This provides a principled termination rule.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <tuple> 4 #include <limits> 5 #include <iomanip> 6 #include <cmath> 7 using namespace std; 8 9 // Transition represented as (next_state, probability, reward) 10 struct Transition { 11 int next; 12 double prob; 13 double reward; 14 }; 15 16 // Value Iteration for finite MDP with sparse transitions R(s,a,s') 17 pair<vector<double>, vector<int>> value_iteration( 18 const vector<vector<vector<Transition>>> &P, // P[s][a] -> list of transitions 19 int S, int A, double gamma, double theta, int max_iters = 10000) { 20 21 vector<double> V(S, 0.0); // value function initialized to zero 22 vector<int> policy(S, 0); // greedy policy derived from V 23 24 for (int iter = 0; iter < max_iters; ++iter) { 25 double delta = 0.0; 26 vector<double> Vnext = V; // keep old V for synchronous updates 27 for (int s = 0; s < S; ++s) { 28 double best = -numeric_limits<double>::infinity(); 29 int best_a = 0; 30 for (int a = 0; a < A; ++a) { 31 double q = 0.0; 32 for (const auto &t : P[s][a]) { 33 q += t.prob * (t.reward + gamma * V[t.next]); 34 } 35 if (q > best) { 36 best = q; 37 best_a = a; 38 } 39 } 40 Vnext[s] = best; 41 policy[s] = best_a; // track current greedy action 42 delta = max(delta, fabs(Vnext[s] - V[s])); 43 } 44 V.swap(Vnext); 45 // Stopping criterion ensuring epsilon-optimality (scaled by gamma) 46 if (delta < theta) break; 47 } 48 return {V, policy}; 49 } 50 51 int main() { 52 // Example MDP with 3 states (0,1,2) and 2 actions (0,1) 53 // Deterministic transitions for clarity; rewards on transitions 54 int S = 3, A = 2; 55 vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A)); 56 57 double gamma = 0.9; 58 59 // State 0: 60 // Action 0: go to state 1, reward +5 61 P[0][0].push_back({1, 1.0, 5.0}); 62 // Action 1: stay in 0, reward +1 63 P[0][1].push_back({0, 1.0, 1.0}); 64 65 // State 1: 66 // Action 0: go to state 2, reward +2 67 P[1][0].push_back({2, 1.0, 2.0}); 68 // Action 1: go to state 0, reward 0 69 P[1][1].push_back({0, 1.0, 0.0}); 70 71 // State 2 (terminal-like): both actions loop with 0 reward 72 P[2][0].push_back({2, 1.0, 0.0}); 73 P[2][1].push_back({2, 1.0, 0.0}); 74 75 double theta = 1e-6; // absolute tolerance on value updates 76 77 auto [V, policy] = value_iteration(P, S, A, gamma, theta); 78 79 cout << fixed << setprecision(6); 80 cout << "Optimal state values:\n"; 81 for (int s = 0; s < S; ++s) { 82 cout << "V[" << s << "] = " << V[s] << "\n"; 83 } 84 cout << "\nGreedy optimal policy (action indices):\n"; 85 for (int s = 0; s < S; ++s) { 86 cout << "pi[" << s << "] = " << policy[s] << "\n"; 87 } 88 return 0; 89 } 90
This program encodes a small MDP with rewards on transitions and runs Value Iteration. For each state, it computes the best expected one-step return over actions using the current value function and updates V. It stops when value changes are below a tolerance, then reports the optimal values and the greedy policy.
1 #include <iostream> 2 #include <vector> 3 #include <tuple> 4 #include <limits> 5 #include <iomanip> 6 #include <cmath> 7 using namespace std; 8 9 struct Transition { int next; double prob; double reward; }; 10 11 // Evaluate a fixed (deterministic) policy via Gauss-Seidel updates 12 vector<double> policy_evaluation( 13 const vector<int> &policy, 14 const vector<vector<vector<Transition>>> &P, 15 int S, int A, double gamma, double theta, int max_iters = 100000) { 16 vector<double> V(S, 0.0); 17 for (int it = 0; it < max_iters; ++it) { 18 double delta = 0.0; 19 for (int s = 0; s < S; ++s) { 20 int a = policy[s]; 21 double v_old = V[s]; 22 double v_new = 0.0; 23 for (const auto &t : P[s][a]) { 24 v_new += t.prob * (t.reward + gamma * V[t.next]); 25 } 26 V[s] = v_new; // Gauss-Seidel: immediate write-back 27 delta = max(delta, fabs(v_new - v_old)); 28 } 29 if (delta < theta) break; 30 } 31 return V; 32 } 33 34 pair<vector<double>, vector<int>> policy_iteration( 35 const vector<vector<vector<Transition>>> &P, 36 int S, int A, double gamma, double theta_eval) { 37 38 vector<int> policy(S, 0); // initial arbitrary policy 39 vector<double> V(S, 0.0); 40 41 while (true) { 42 // 1) Policy evaluation 43 V = policy_evaluation(policy, P, S, A, gamma, theta_eval); 44 45 // 2) Policy improvement 46 bool policy_stable = true; 47 for (int s = 0; s < S; ++s) { 48 int old_a = policy[s]; 49 double best = -numeric_limits<double>::infinity(); 50 int best_a = old_a; 51 for (int a = 0; a < A; ++a) { 52 double q = 0.0; 53 for (const auto &t : P[s][a]) { 54 q += t.prob * (t.reward + gamma * V[t.next]); 55 } 56 if (q > best) { 57 best = q; 58 best_a = a; 59 } 60 } 61 policy[s] = best_a; 62 if (best_a != old_a) policy_stable = false; 63 } 64 if (policy_stable) break; // converged to optimal policy 65 } 66 return {V, policy}; 67 } 68 69 int main() { 70 // Reuse the same 3-state, 2-action MDP as in Value Iteration example 71 int S = 3, A = 2; 72 vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A)); 73 double gamma = 0.9; 74 75 P[0][0].push_back({1, 1.0, 5.0}); 76 P[0][1].push_back({0, 1.0, 1.0}); 77 78 P[1][0].push_back({2, 1.0, 2.0}); 79 P[1][1].push_back({0, 1.0, 0.0}); 80 81 P[2][0].push_back({2, 1.0, 0.0}); 82 P[2][1].push_back({2, 1.0, 0.0}); 83 84 double theta_eval = 1e-8; // tolerance for policy evaluation 85 86 auto [V, policy] = policy_iteration(P, S, A, gamma, theta_eval); 87 88 cout << fixed << setprecision(6); 89 cout << "Optimal values (Policy Iteration):\n"; 90 for (int s = 0; s < S; ++s) cout << "V[" << s << "] = " << V[s] << "\n"; 91 92 cout << "\nOptimal policy (action indices):\n"; 93 for (int s = 0; s < S; ++s) cout << "pi[" << s << "] = " << policy[s] << "\n"; 94 95 return 0; 96 } 97
This code alternates between evaluating the current deterministic policy using Gauss–Seidel iterative updates and improving the policy greedily with respect to the evaluated values. It terminates when the policy no longer changes, yielding an optimal policy for the given MDP.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <iomanip> 5 #include <algorithm> 6 using namespace std; 7 8 struct StepResult { int next_s; double reward; bool done; }; 9 10 struct Gridworld { 11 int W, H; 12 int goal_r, goal_c; 13 double step_reward; 14 15 Gridworld(int W_, int H_, int gr, int gc, double step_r) 16 : W(W_), H(H_), goal_r(gr), goal_c(gc), step_reward(step_r) {} 17 18 int to_state(int r, int c) const { return r * W + c; } 19 pair<int,int> from_state(int s) const { return {s / W, s % W}; } 20 21 // Actions: 0=up,1=right,2=down,3=left 22 StepResult step(int s, int a) const { 23 auto [r, c] = from_state(s); 24 int nr = r, nc = c; 25 if (a == 0) nr = max(0, r - 1); 26 if (a == 1) nc = min(W - 1, c + 1); 27 if (a == 2) nr = min(H - 1, r + 1); 28 if (a == 3) nc = max(0, c - 1); 29 int ns = to_state(nr, nc); 30 bool done = (nr == goal_r && nc == goal_c); 31 double rwd = done ? 0.0 : step_reward; // 0 at goal, negative elsewhere 32 return {ns, rwd, done}; 33 } 34 }; 35 36 int main() { 37 const int W = 4, H = 4; 38 Gridworld env(W, H, H-1, W-1, -1.0); // goal at bottom-right 39 const int S = W * H, A = 4; 40 41 vector<vector<double>> Q(S, vector<double>(A, 0.0)); 42 43 // Hyperparameters 44 double gamma = 0.95; 45 double alpha = 0.5; // learning rate (can decay) 46 double eps = 0.2; // epsilon-greedy exploration (can decay) 47 int episodes = 2000; 48 int max_steps = 200; 49 50 mt19937 rng(123); 51 uniform_real_distribution<double> uni(0.0, 1.0); 52 uniform_int_distribution<int> act_dist(0, A-1); 53 54 auto greedy_action = [&](int s) { 55 int best_a = 0; double best_q = Q[s][0]; 56 for (int a = 1; a < A; ++a) if (Q[s][a] > best_q) { best_q = Q[s][a]; best_a = a; } 57 return best_a; 58 }; 59 60 for (int ep = 0; ep < episodes; ++ep) { 61 // start at (0,0) 62 int s = 0; 63 bool done = false; 64 for (int t = 0; t < max_steps && !done; ++t) { 65 // epsilon-greedy 66 int a = (uni(rng) < eps) ? act_dist(rng) : greedy_action(s); 67 auto sr = env.step(s, a); 68 // Q-learning update: Q(s,a) <- Q(s,a) + alpha [ r + gamma max_{a'} Q(s',a') - Q(s,a) ] 69 double max_next = *max_element(Q[sr.next_s].begin(), Q[sr.next_s].end()); 70 double target = sr.reward + (sr.done ? 0.0 : gamma * max_next); 71 Q[s][a] += alpha * (target - Q[s][a]); 72 73 s = sr.next_s; 74 done = sr.done; 75 } 76 // optional decay 77 eps = max(0.01, eps * 0.995); 78 alpha = max(0.05, alpha * 0.999); 79 } 80 81 // Derive greedy policy from learned Q 82 vector<int> policy(S, 0); 83 for (int s = 0; s < S; ++s) { 84 policy[s] = int(max_element(Q[s].begin(), Q[s].end()) - Q[s].begin()); 85 } 86 87 cout << fixed << setprecision(2); 88 cout << "Greedy policy (0=U,1=R,2=D,3=L):\n"; 89 for (int r = 0; r < H; ++r) { 90 for (int c = 0; c < W; ++c) { 91 int s = env.to_state(r, c); 92 cout << policy[s] << ' '; 93 } 94 cout << '\n'; 95 } 96 97 cout << "\nState values V(s)=max_a Q(s,a):\n"; 98 for (int r = 0; r < H; ++r) { 99 for (int c = 0; c < W; ++c) { 100 int s = env.to_state(r, c); 101 double v = *max_element(Q[s].begin(), Q[s].end()); 102 cout << setw(6) << v << ' '; 103 } 104 cout << '\n'; 105 } 106 return 0; 107 } 108
This program learns a shortest-path-like policy in a 4x4 grid using Q-learning. It repeatedly runs episodes from the start, selecting actions with an \(\epsilon\)-greedy rule, and updates Q-values toward the target r + \(\gamma\) max Q at the next state. After training, it prints the greedy policy and the induced state values.