∑MathIntermediate

Markov Decision Processes (MDP)

Key Points

•
A Markov Decision Process (MDP) models decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
•
An MDP is defined by states S, actions A, transition probabilities P, rewards R, and a discount factor $ $γ$ $.
•
The goal is to find a policy that maximizes expected discounted return, usually via Bellman equations.
•
Value Iteration and Policy Iteration are fundamental dynamic programming algorithms for solving MDPs when P and R are known.
•
Q-learning learns optimal behavior from experience without knowing the transition model, making it model-free.
•
Discounting with $0 $\leq$ $γ$ < 1$ ensures that future rewards are worth slightly less, stabilizing solutions and guaranteeing convergence.
•
Contraction properties of the Bellman operator guarantee convergence of value iteration to the optimal value function.
•
C++ implementations typically represent P as sparse adjacency lists over (next state, probability, reward) triples to save memory.

Prerequisites

→Basic Probability — Transition dynamics use conditional probabilities and expectations.
→Linear Algebra — Policy evaluation can be written and solved as linear systems.
→Dynamic Programming — Bellman recursions are dynamic programming equations over states.
→Algorithmic Complexity — Understanding runtime and memory of iterative solvers guides practical choices.
→C++ Vectors and Iteration — Implementing tabular MDP algorithms requires manipulating arrays and loops.
→Optimization Basics — Policies are obtained by maximizing expected returns over actions.
→Stochastic Processes — MDPs are discrete-time stochastic control models with Markov transitions.

Detailed Explanation

Tap terms for definitions

01Overview

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision making under uncertainty. It captures the idea that at each time step, the system is in some state, an agent chooses an action, the system moves to a new state according to stochastic transition rules, and a numerical reward is provided. The Markov property says the next state and reward depend only on the current state and action, not on the full past. An MDP is specified by a 5-tuple (S, A, P, R, (\gamma)), where S is the set of states, A is the set of actions, P encodes transition probabilities, R encodes rewards, and (\gamma) is a discount factor in [0,1). The central objective is to find a policy—a mapping from states to actions (possibly probabilistic)—that maximizes the expected sum of discounted rewards over time. This problem underlies many applications: robot navigation, inventory management, game playing, and medical treatment planning. Solutions hinge on the Bellman equations, which relate the value of a state to the immediate reward plus the expected value of future states. Algorithms like Value Iteration and Policy Iteration exploit these equations to compute optimal policies when the model is known, while reinforcement learning methods like Q-learning learn from sampled interactions when it is not.

02Intuition & Analogies

Imagine you are navigating a city with uncertain traffic. Each intersection is a state, and choosing a road to take is an action. Even if you pick the same road twice, travel time varies due to random traffic conditions—this randomness is captured by transition probabilities. After each move, you get a reward: perhaps negative minutes (so smaller is better) or a toll fee. You care about getting to your destination quickly while avoiding tolls, which creates a trade-off between immediate and future outcomes. The discount factor (\gamma) is like valuing a minute saved now slightly more than a minute saved later—it prevents you from overcommitting to distant gains at the expense of what you can gain soon, and it mathematically ensures everything sums to a finite number. A policy is your driving strategy: at each intersection, which road do you take? The value of an intersection under a given strategy is the expected total minutes saved (or cost incurred) if you keep following that strategy. The Bellman equation is the common-sense idea that the worth of a place equals what you get immediately when you take an action plus the expected worth of where you’ll end up next. Value Iteration repeatedly updates your estimates of how good each intersection is, improving your strategy as you learn which roads reliably lead to better neighborhoods. Policy Iteration alternates between asking, "If I keep using this route plan, how good is it?" and "Given those values, can I pick even better roads?" If you don’t know the traffic model, Q-learning learns from experience by trying routes, noting outcomes, and gradually favoring choices that work out well.

03Formal Definition

An MDP is a tuple \(

M

= (

S

A

, P, R,

γ

)\) where: (1) \(

S

\) is a (finite or countable) set of states; (2) \(

A

\) is a set of actions (possibly state-dependent \(

A

(s)\)); (3) \(P(s'

∣

s,a)\) is a transition kernel satisfying \(

\sum_{s^{'} \in S}

P(s'

∣

s,a) = 1\) for all $s, a$; (4) $R(s,a,s')$ is the (possibly stochastic) immediate reward when transitioning from $s$ to $s'$ via $a$; (5) \(

γ

\in

[0,1)\) is a discount factor. A (stationary) policy \(

π

\) can be deterministic (\(

π

S

\to

A

\)) or stochastic (\(

π

∣

s)\) a distribution over actions). The return from time t is \(

G_{t}

\sum_{k = 0}^{\infty}

γ^{k}

R_{t + k + 1}

\). The state-value function under policy \(

π

\) is \(

V^{π}

(s) =

\mathbb{E}_{\pi}\left

[

G_{t}

∣

S_t=s\right]\

) and the action-value function is \(

Q^{π}

(s,a) =

\mathbb{E}_{\pi}\left

[

G_{t}

∣

S_{t} = s

A_t=a\right]\

). These satisfy the Bellman expectation equations. The optimal value functions \(

V^{*}

(s) =

sup_{π}

V^{π}

(s)\) and \(

Q^{*}

(s,a) =

sup_{π}

Q^{π}

(s,a)\) satisfy Bellman optimality equations, and an optimal policy can be chosen greedily with respect to \(

V^{*}

\) or \(

Q^{*}

\). The Bellman optimality operator $T$ defined by \((TV)(s) =

max_{a} \sum_{s^{'}}

P(s'

∣

s,a) [R(s,a,s') +

γ

V(s')]\) is a \(

γ

\)-contraction in the \(

ℓ_{\infty}

\)-norm, guaranteeing convergence of Value Iteration.

04When to Use

Use MDPs when modeling sequential decisions with uncertainty and the Markov property holds—future depends only on the current state and action. If you know the transition model and rewards (model-based), planning methods like Value Iteration or Policy Iteration are appropriate: for example, deterministic scheduling with random machine failures, inventory control with stochastic demand, or robot path planning with slip probabilities. When the model is unknown but you can interact with the environment (sample transitions), model-free reinforcement learning like Q-learning or SARSA is suitable: for instance, online ad selection with delayed stochastic clicks, game playing with unknown opponent tendencies, or autonomous control learned through simulation. If long-term consequences matter but you must balance them with immediate payoffs, discounting via (\gamma) helps enforce stability and tractability. Choose finite-horizon dynamic programming for problems with a fixed end time (e.g., planning over the next 12 months), and infinite-horizon discounted MDPs for continuing tasks (e.g., never-ending maintenance scheduling). For very large state spaces, consider approximate methods and function approximation (e.g., value function approximation, Monte Carlo tree search) rather than exact tabular solutions.

⚠️Common Mistakes

• Ignoring the Markov property: Including irrelevant history in the state can bloat the state space; omitting necessary information breaks the Markov assumption. Define states so that the future is conditionally independent of the past, given the present state. • Mixing reward definitions: Some texts use (R(s,a)) while others use (R(s,a,s')). Be consistent in code—if you store rewards on transitions, ensure you sum over next states when computing expectations. • Using (\gamma = 1) without care: Without terminal states or bounded returns, convergence may fail. Prefer (\gamma < 1) or prove boundedness/termination. • Incorrect stopping criteria: In Value Iteration, stopping when the maximum change (\delta) is below (\epsilon \frac{1-\gamma}{2\gamma}) ensures an (\epsilon)-optimal policy. Stopping too early yields suboptimal policies; too late wastes computation. • Not normalizing transition probabilities: Each (\sum_{s'} P(s'\mid s,a)) must equal 1. Small floating-point errors can accumulate; renormalize or validate inputs. • Ties in argmax: Greedy policy extraction with ties can oscillate. Break ties consistently (e.g., smallest index) for determinism. • Overfitting to sparse rewards: In Q-learning, without sufficient exploration (e.g., (\epsilon)-greedy), the agent may never discover rewarding states. Schedule exploration decay thoughtfully. • Ignoring sparsity: Storing full (|S|\times|A|\times|S|) arrays is wasteful when transitions are sparse. Use adjacency lists to cut memory and runtime.

Key Formulas

Discounted Return

G_{t} = k = 0 \sum \infty γ^{k} R_{t + k + 1}

Explanation: The total value from time t is the sum of future rewards, each discounted by $ $γ$ $ per step. This ensures distant rewards contribute less and the sum is finite when $ $γ$ <1$.

Bellman Expectation (State Value)

V^{π} (s) = a \in A \sum π (a ∣ s) s^{'} \in S \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]

Explanation: The value of a state under policy $ $π$ $ equals the expected immediate reward plus discounted value of next states. Expectation is taken over the policy’s action probabilities and the transition model.

Bellman Expectation (Action Value)

Q^{π} (s, a) = s^{'} \in S \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} \in A \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})]

Explanation: The value of taking action a in state s, then following $ $π$ $, includes the immediate reward and the expected value of subsequent actions under $ $π$ $.

Bellman Optimality (State Value)

V^{*} (s) = a \in A max s^{'} \in S \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]

Explanation: The optimal value of a state equals the best expected one-step return achievable by choosing the best action now and behaving optimally thereafter.

Bellman Optimality (Action Value)

Q^{*} (s, a) = s^{'} \in S \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} max Q^{*} (s^{'}, a^{'})]

Explanation: The optimal action value is immediate reward plus discounted optimal value of next states. It underpins Q-learning and greedy policies.

Bellman Optimality Operator

T V (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γV (s^{'})]

Explanation: Applying T to a value function performs a one-step optimal lookahead. Repeated application converges to $ $V^{*}$ $ because T is a contraction.

Contraction Property

∥ T V - T U ∥_{\infty} \leq γ ∥ V - U ∥_{\infty}

Explanation: The Bellman optimality operator shrinks the maximum absolute difference between two value functions by at least a factor $ $γ$ $. This guarantees uniqueness of the fixed point and convergence of Value Iteration.

Value Iteration Update

V_{k + 1} (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V_{k} (s^{'})]

Explanation: This recurrence updates value estimates using one-step optimal lookahead from the previous estimates. Iterating this for all states yields better approximations to $ $V^{*}$ $.

Probability Normalization

s^{'} \in S \sum P (s^{'} ∣ s, a) = 1 for all s, a

Explanation: Transition probabilities from a given state-action pair must sum to one. This encodes that some next state will occur with certainty.

Policy Improvement

π^{'} (s) = ar g a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]

Explanation: Given a value function for a policy $ $π$ $, choosing actions greedily with respect to it yields a policy $ $π$ '$ that is at least as good. Repeating evaluation and improvement converges to an optimal policy.

Stopping Criterion for Value Iteration

∥ V_{k + 1} - V_{k} ∥_{\infty} < ϵ \frac{1 - γ}{2 γ} \Rightarrow greedy (V_{k + 1}) is ϵ -optimal

Explanation: If successive value estimates differ by less than a scaled $ $ϵ$ $, the greedy policy is within $ $ϵ$ $ of optimal. This provides a principled termination rule.

Complexity Analysis

Let

∣ S ∣

be the number of states,

∣ A ∣

the number of actions, and let each (s,a) have at most d nonzero next-state probabilities (sparsity). Value Iteration performs, per sweep, for each state an action maximization over expectations that sum over at most d successors. Thus one sweep costs O(

∣ S ∣

∣ A ∣

· d). The number of sweeps to achieve an \(

ϵ

\)-optimal policy depends on the contraction factor \(

γ

\) and the reward scale; a common bound is O(

lo g

(1/

ϵ

) / (1-

γ

)). Space is O(

∣ S ∣

∣ S ∣∣ A ∣

d) if P and R are stored sparsely, plus O(

∣ S ∣

) for V and O(

∣ S ∣

) for a deterministic policy. Policy Iteration alternates: (1) policy evaluation to compute V^

π

, and (2) policy improvement. Exact evaluation by solving (I -

γ

π

) V = R_

π

costs roughly O(

∣ S ∣

^3) with dense linear algebra, which is prohibitive for large

∣ S ∣

. In practice, iterative policy evaluation (Gauss–Seidel/Jacobi) costs O(

k_{e} v a l

∣ S ∣

· d), where

k_{e} v a l

is the number of iterations to reach a given tolerance. Policy improvement costs O(

∣ S ∣

∣ A ∣

· d). The number of outer iterations is typically small in practice (

o f t e n < 10

), but worst-case can be large. Q-learning updates one (s,a) at a time using samples, costing O(1) per step to update Q(s,a) plus O(

∣ A ∣

) to compute a max for greedy behavior. Over E episodes with up to T steps each, runtime is O(E · T ·

∣ A ∣

). Space is O(

∣ S ∣∣ A ∣

) for tabular Q. Convergence to Q* requires sufficient exploration and a suitable learning-rate schedule. For large

∣ S ∣

, function approximation is used to reduce space/time at the cost of approximation error.

Code Examples

Value Iteration on a Small Tabular MDP

1 #include <iostream>
2 #include <vector>
3 #include <tuple>
4 #include <limits>
5 #include <iomanip>
6 #include <cmath>
7 using namespace std;
8 
9 // Transition represented as (next_state, probability, reward)
10 struct Transition {
11     int next;
12     double prob;
13     double reward;
14 };
15 
16 // Value Iteration for finite MDP with sparse transitions R(s,a,s')
17 pair<vector<double>, vector<int>> value_iteration(
18     const vector<vector<vector<Transition>>> &P, // P[s][a] -> list of transitions
19     int S, int A, double gamma, double theta, int max_iters = 10000) {
20 
21     vector<double> V(S, 0.0);       // value function initialized to zero
22     vector<int> policy(S, 0);       // greedy policy derived from V
23 
24     for (int iter = 0; iter < max_iters; ++iter) {
25         double delta = 0.0;
26         vector<double> Vnext = V; // keep old V for synchronous updates
27         for (int s = 0; s < S; ++s) {
28             double best = -numeric_limits<double>::infinity();
29             int best_a = 0;
30             for (int a = 0; a < A; ++a) {
31                 double q = 0.0;
32                 for (const auto &t : P[s][a]) {
33                     q += t.prob * (t.reward + gamma * V[t.next]);
34                 }
35                 if (q > best) {
36                     best = q;
37                     best_a = a;
38                 }
39             }
40             Vnext[s] = best;
41             policy[s] = best_a; // track current greedy action
42             delta = max(delta, fabs(Vnext[s] - V[s]));
43         }
44         V.swap(Vnext);
45         // Stopping criterion ensuring epsilon-optimality (scaled by gamma)
46         if (delta < theta) break;
47     }
48     return {V, policy};
49 }
50 
51 int main() {
52     // Example MDP with 3 states (0,1,2) and 2 actions (0,1)
53     // Deterministic transitions for clarity; rewards on transitions
54     int S = 3, A = 2;
55     vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A));
56 
57     double gamma = 0.9;
58 
59     // State 0:
60     // Action 0: go to state 1, reward +5
61     P[0][0].push_back({1, 1.0, 5.0});
62     // Action 1: stay in 0, reward +1
63     P[0][1].push_back({0, 1.0, 1.0});
64 
65     // State 1:
66     // Action 0: go to state 2, reward +2
67     P[1][0].push_back({2, 1.0, 2.0});
68     // Action 1: go to state 0, reward 0
69     P[1][1].push_back({0, 1.0, 0.0});
70 
71     // State 2 (terminal-like): both actions loop with 0 reward
72     P[2][0].push_back({2, 1.0, 0.0});
73     P[2][1].push_back({2, 1.0, 0.0});
74 
75     double theta = 1e-6; // absolute tolerance on value updates
76 
77     auto [V, policy] = value_iteration(P, S, A, gamma, theta);
78 
79     cout << fixed << setprecision(6);
80     cout << "Optimal state values:\n";
81     for (int s = 0; s < S; ++s) {
82         cout << "V[" << s << "] = " << V[s] << "\n";
83     }
84     cout << "\nGreedy optimal policy (action indices):\n";
85     for (int s = 0; s < S; ++s) {
86         cout << "pi[" << s << "] = " << policy[s] << "\n";
87     }
88     return 0;
89 }
90

This program encodes a small MDP with rewards on transitions and runs Value Iteration. For each state, it computes the best expected one-step return over actions using the current value function and updates V. It stops when value changes are below a tolerance, then reports the optimal values and the greedy policy.

Time: O(|S| · |A| · d · K), where d is average nonzero successors per (s,a) and K the number of sweeps until convergence.Space: O(|S| + |S||A|d) to store V and sparse transitions.

Policy Iteration with Iterative Policy Evaluation

1 #include <iostream>
2 #include <vector>
3 #include <tuple>
4 #include <limits>
5 #include <iomanip>
6 #include <cmath>
7 using namespace std;
8 
9 struct Transition { int next; double prob; double reward; };
10 
11 // Evaluate a fixed (deterministic) policy via Gauss-Seidel updates
12 vector<double> policy_evaluation(
13     const vector<int> &policy,
14     const vector<vector<vector<Transition>>> &P,
15     int S, int A, double gamma, double theta, int max_iters = 100000) {
16     vector<double> V(S, 0.0);
17     for (int it = 0; it < max_iters; ++it) {
18         double delta = 0.0;
19         for (int s = 0; s < S; ++s) {
20             int a = policy[s];
21             double v_old = V[s];
22             double v_new = 0.0;
23             for (const auto &t : P[s][a]) {
24                 v_new += t.prob * (t.reward + gamma * V[t.next]);
25             }
26             V[s] = v_new; // Gauss-Seidel: immediate write-back
27             delta = max(delta, fabs(v_new - v_old));
28         }
29         if (delta < theta) break;
30     }
31     return V;
32 }
33 
34 pair<vector<double>, vector<int>> policy_iteration(
35     const vector<vector<vector<Transition>>> &P,
36     int S, int A, double gamma, double theta_eval) {
37 
38     vector<int> policy(S, 0);  // initial arbitrary policy
39     vector<double> V(S, 0.0);
40 
41     while (true) {
42         // 1) Policy evaluation
43         V = policy_evaluation(policy, P, S, A, gamma, theta_eval);
44 
45         // 2) Policy improvement
46         bool policy_stable = true;
47         for (int s = 0; s < S; ++s) {
48             int old_a = policy[s];
49             double best = -numeric_limits<double>::infinity();
50             int best_a = old_a;
51             for (int a = 0; a < A; ++a) {
52                 double q = 0.0;
53                 for (const auto &t : P[s][a]) {
54                     q += t.prob * (t.reward + gamma * V[t.next]);
55                 }
56                 if (q > best) {
57                     best = q;
58                     best_a = a;
59                 }
60             }
61             policy[s] = best_a;
62             if (best_a != old_a) policy_stable = false;
63         }
64         if (policy_stable) break; // converged to optimal policy
65     }
66     return {V, policy};
67 }
68 
69 int main() {
70     // Reuse the same 3-state, 2-action MDP as in Value Iteration example
71     int S = 3, A = 2;
72     vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A));
73     double gamma = 0.9;
74 
75     P[0][0].push_back({1, 1.0, 5.0});
76     P[0][1].push_back({0, 1.0, 1.0});
77 
78     P[1][0].push_back({2, 1.0, 2.0});
79     P[1][1].push_back({0, 1.0, 0.0});
80 
81     P[2][0].push_back({2, 1.0, 0.0});
82     P[2][1].push_back({2, 1.0, 0.0});
83 
84     double theta_eval = 1e-8; // tolerance for policy evaluation
85 
86     auto [V, policy] = policy_iteration(P, S, A, gamma, theta_eval);
87 
88     cout << fixed << setprecision(6);
89     cout << "Optimal values (Policy Iteration):\n";
90     for (int s = 0; s < S; ++s) cout << "V[" << s << "] = " << V[s] << "\n";
91 
92     cout << "\nOptimal policy (action indices):\n";
93     for (int s = 0; s < S; ++s) cout << "pi[" << s << "] = " << policy[s] << "\n";
94 
95     return 0;
96 }
97

This code alternates between evaluating the current deterministic policy using Gauss–Seidel iterative updates and improving the policy greedily with respect to the evaluated values. It terminates when the policy no longer changes, yielding an optimal policy for the given MDP.

Time: O(I_outer · (k_eval · |S| · d + |S| · |A| · d)), where I_outer is the number of policy improvements and k_eval the iterations per evaluation.Space: O(|S| + |S||A|d) for values, policy, and sparse transitions.

Tabular Q-learning on a 4x4 Gridworld

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <iomanip>
5 #include <algorithm>
6 using namespace std;
7 
8 struct StepResult { int next_s; double reward; bool done; };
9 
10 struct Gridworld {
11     int W, H;
12     int goal_r, goal_c;
13     double step_reward;
14 
15     Gridworld(int W_, int H_, int gr, int gc, double step_r)
16         : W(W_), H(H_), goal_r(gr), goal_c(gc), step_reward(step_r) {}
17 
18     int to_state(int r, int c) const { return r * W + c; }
19     pair<int,int> from_state(int s) const { return {s / W, s % W}; }
20 
21     // Actions: 0=up,1=right,2=down,3=left
22     StepResult step(int s, int a) const {
23         auto [r, c] = from_state(s);
24         int nr = r, nc = c;
25         if (a == 0) nr = max(0, r - 1);
26         if (a == 1) nc = min(W - 1, c + 1);
27         if (a == 2) nr = min(H - 1, r + 1);
28         if (a == 3) nc = max(0, c - 1);
29         int ns = to_state(nr, nc);
30         bool done = (nr == goal_r && nc == goal_c);
31         double rwd = done ? 0.0 : step_reward; // 0 at goal, negative elsewhere
32         return {ns, rwd, done};
33     }
34 };
35 
36 int main() {
37     const int W = 4, H = 4;
38     Gridworld env(W, H, H-1, W-1, -1.0); // goal at bottom-right
39     const int S = W * H, A = 4;
40 
41     vector<vector<double>> Q(S, vector<double>(A, 0.0));
42 
43     // Hyperparameters
44     double gamma = 0.95;
45     double alpha = 0.5;          // learning rate (can decay)
46     double eps = 0.2;            // epsilon-greedy exploration (can decay)
47     int episodes = 2000;
48     int max_steps = 200;
49 
50     mt19937 rng(123);
51     uniform_real_distribution<double> uni(0.0, 1.0);
52     uniform_int_distribution<int> act_dist(0, A-1);
53 
54     auto greedy_action = [&](int s) {
55         int best_a = 0; double best_q = Q[s][0];
56         for (int a = 1; a < A; ++a) if (Q[s][a] > best_q) { best_q = Q[s][a]; best_a = a; }
57         return best_a;
58     };
59 
60     for (int ep = 0; ep < episodes; ++ep) {
61         // start at (0,0)
62         int s = 0;
63         bool done = false;
64         for (int t = 0; t < max_steps && !done; ++t) {
65             // epsilon-greedy
66             int a = (uni(rng) < eps) ? act_dist(rng) : greedy_action(s);
67             auto sr = env.step(s, a);
68             // Q-learning update: Q(s,a) <- Q(s,a) + alpha [ r + gamma max_{a'} Q(s',a') - Q(s,a) ]
69             double max_next = *max_element(Q[sr.next_s].begin(), Q[sr.next_s].end());
70             double target = sr.reward + (sr.done ? 0.0 : gamma * max_next);
71             Q[s][a] += alpha * (target - Q[s][a]);
72 
73             s = sr.next_s;
74             done = sr.done;
75         }
76         // optional decay
77         eps = max(0.01, eps * 0.995);
78         alpha = max(0.05, alpha * 0.999);
79     }
80 
81     // Derive greedy policy from learned Q
82     vector<int> policy(S, 0);
83     for (int s = 0; s < S; ++s) {
84         policy[s] = int(max_element(Q[s].begin(), Q[s].end()) - Q[s].begin());
85     }
86 
87     cout << fixed << setprecision(2);
88     cout << "Greedy policy (0=U,1=R,2=D,3=L):\n";
89     for (int r = 0; r < H; ++r) {
90         for (int c = 0; c < W; ++c) {
91             int s = env.to_state(r, c);
92             cout << policy[s] << ' ';
93         }
94         cout << '\n';
95     }
96 
97     cout << "\nState values V(s)=max_a Q(s,a):\n";
98     for (int r = 0; r < H; ++r) {
99         for (int c = 0; c < W; ++c) {
100             int s = env.to_state(r, c);
101             double v = *max_element(Q[s].begin(), Q[s].end());
102             cout << setw(6) << v << ' ';
103         }
104         cout << '\n';
105     }
106     return 0;
107 }
108

This program learns a shortest-path-like policy in a 4x4 grid using Q-learning. It repeatedly runs episodes from the start, selecting actions with an $\epsilon$-greedy rule, and updates Q-values toward the target r + $\gamma$ max Q at the next state. After training, it prints the greedy policy and the induced state values.

Time: O(episodes · max_steps · |A|) due to per-step action selection and max over actions.Space: O(|S||A|) to store Q-values.

1	#include <iostream>
2	#include <vector>
3	#include <tuple>
4	#include <limits>
5	#include <iomanip>
6	#include <cmath>
7	using namespace std;
8
9	// Transition represented as (next_state, probability, reward)
10	struct Transition {
11	int next;
12	double prob;
13	double reward;
14	};
15
16	// Value Iteration for finite MDP with sparse transitions R(s,a,s')
17	pair<vector<double>, vector<int>> value_iteration(
18	const vector<vector<vector<Transition>>> &P, // P[s][a] -> list of transitions
19	int S, int A, double gamma, double theta, int max_iters = 10000) {
20
21	vector<double> V(S, 0.0); // value function initialized to zero
22	vector<int> policy(S, 0); // greedy policy derived from V
23
24	for (int iter = 0; iter < max_iters; ++iter) {
25	double delta = 0.0;
26	vector<double> Vnext = V; // keep old V for synchronous updates
27	for (int s = 0; s < S; ++s) {
28	double best = -numeric_limits<double>::infinity();
29	int best_a = 0;
30	for (int a = 0; a < A; ++a) {
31	double q = 0.0;
32	for (const auto &t : P[s][a]) {
33	q += t.prob * (t.reward + gamma * V[t.next]);
34	}
35	if (q > best) {
36	best = q;
37	best_a = a;
38	}
39	}
40	Vnext[s] = best;
41	policy[s] = best_a; // track current greedy action
42	delta = max(delta, fabs(Vnext[s] - V[s]));
43	}
44	V.swap(Vnext);
45	// Stopping criterion ensuring epsilon-optimality (scaled by gamma)
46	if (delta < theta) break;
47	}
48	return {V, policy};
49	}
50
51	int main() {
52	// Example MDP with 3 states (0,1,2) and 2 actions (0,1)
53	// Deterministic transitions for clarity; rewards on transitions
54	int S = 3, A = 2;
55	vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A));
56
57	double gamma = 0.9;
58
59	// State 0:
60	// Action 0: go to state 1, reward +5
61	P[0][0].push_back({1, 1.0, 5.0});
62	// Action 1: stay in 0, reward +1
63	P[0][1].push_back({0, 1.0, 1.0});
64
65	// State 1:
66	// Action 0: go to state 2, reward +2
67	P[1][0].push_back({2, 1.0, 2.0});
68	// Action 1: go to state 0, reward 0
69	P[1][1].push_back({0, 1.0, 0.0});
70
71	// State 2 (terminal-like): both actions loop with 0 reward
72	P[2][0].push_back({2, 1.0, 0.0});
73	P[2][1].push_back({2, 1.0, 0.0});
74
75	double theta = 1e-6; // absolute tolerance on value updates
76
77	auto [V, policy] = value_iteration(P, S, A, gamma, theta);
78
79	cout << fixed << setprecision(6);
80	cout << "Optimal state values:\n";
81	for (int s = 0; s < S; ++s) {
82	cout << "V[" << s << "] = " << V[s] << "\n";
83	}
84	cout << "\nGreedy optimal policy (action indices):\n";
85	for (int s = 0; s < S; ++s) {
86	cout << "pi[" << s << "] = " << policy[s] << "\n";
87	}
88	return 0;
89	}
90

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <iomanip>
5	#include <algorithm>
6	using namespace std;
7
8	struct StepResult { int next_s; double reward; bool done; };
9
10	struct Gridworld {
11	int W, H;
12	int goal_r, goal_c;
13	double step_reward;
14
15	Gridworld(int W_, int H_, int gr, int gc, double step_r)
16	: W(W_), H(H_), goal_r(gr), goal_c(gc), step_reward(step_r) {}
17
18	int to_state(int r, int c) const { return r * W + c; }
19	pair<int,int> from_state(int s) const { return {s / W, s % W}; }
20
21	// Actions: 0=up,1=right,2=down,3=left
22	StepResult step(int s, int a) const {
23	auto [r, c] = from_state(s);
24	int nr = r, nc = c;
25	if (a == 0) nr = max(0, r - 1);
26	if (a == 1) nc = min(W - 1, c + 1);
27	if (a == 2) nr = min(H - 1, r + 1);
28	if (a == 3) nc = max(0, c - 1);
29	int ns = to_state(nr, nc);
30	bool done = (nr == goal_r && nc == goal_c);
31	double rwd = done ? 0.0 : step_reward; // 0 at goal, negative elsewhere
32	return {ns, rwd, done};
33	}
34	};
35
36	int main() {
37	const int W = 4, H = 4;
38	Gridworld env(W, H, H-1, W-1, -1.0); // goal at bottom-right
39	const int S = W * H, A = 4;
40
41	vector<vector<double>> Q(S, vector<double>(A, 0.0));
42
43	// Hyperparameters
44	double gamma = 0.95;
45	double alpha = 0.5; // learning rate (can decay)
46	double eps = 0.2; // epsilon-greedy exploration (can decay)
47	int episodes = 2000;
48	int max_steps = 200;
49
50	mt19937 rng(123);
51	uniform_real_distribution<double> uni(0.0, 1.0);
52	uniform_int_distribution<int> act_dist(0, A-1);
53
54	auto greedy_action = [&](int s) {
55	int best_a = 0; double best_q = Q[s][0];
56	for (int a = 1; a < A; ++a) if (Q[s][a] > best_q) { best_q = Q[s][a]; best_a = a; }
57	return best_a;
58	};
59
60	for (int ep = 0; ep < episodes; ++ep) {
61	// start at (0,0)
62	int s = 0;
63	bool done = false;
64	for (int t = 0; t < max_steps && !done; ++t) {
65	// epsilon-greedy
66	int a = (uni(rng) < eps) ? act_dist(rng) : greedy_action(s);
67	auto sr = env.step(s, a);
68	// Q-learning update: Q(s,a) <- Q(s,a) + alpha [ r + gamma max_{a'} Q(s',a') - Q(s,a) ]
69	double max_next = *max_element(Q[sr.next_s].begin(), Q[sr.next_s].end());
70	double target = sr.reward + (sr.done ? 0.0 : gamma * max_next);
71	Q[s][a] += alpha * (target - Q[s][a]);
72
73	s = sr.next_s;
74	done = sr.done;
75	}
76	// optional decay
77	eps = max(0.01, eps * 0.995);
78	alpha = max(0.05, alpha * 0.999);
79	}
80
81	// Derive greedy policy from learned Q
82	vector<int> policy(S, 0);
83	for (int s = 0; s < S; ++s) {
84	policy[s] = int(max_element(Q[s].begin(), Q[s].end()) - Q[s].begin());
85	}
86
87	cout << fixed << setprecision(2);
88	cout << "Greedy policy (0=U,1=R,2=D,3=L):\n";
89	for (int r = 0; r < H; ++r) {
90	for (int c = 0; c < W; ++c) {
91	int s = env.to_state(r, c);
92	cout << policy[s] << ' ';
93	}
94	cout << '\n';
95	}
96
97	cout << "\nState values V(s)=max_a Q(s,a):\n";
98	for (int r = 0; r < H; ++r) {
99	for (int c = 0; c < W; ++c) {
100	int s = env.to_state(r, c);
101	double v = *max_element(Q[s].begin(), Q[s].end());
102	cout << setw(6) << v << ' ';
103	}
104	cout << '\n';
105	}
106	return 0;
107	}
108