🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Markov Decision Processes (MDP)

Key Points

  • •
    A Markov Decision Process (MDP) models decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
  • •
    An MDP is defined by states S, actions A, transition probabilities P, rewards R, and a discount factor \(γ\).
  • •
    The goal is to find a policy that maximizes expected discounted return, usually via Bellman equations.
  • •
    Value Iteration and Policy Iteration are fundamental dynamic programming algorithms for solving MDPs when P and R are known.
  • •
    Q-learning learns optimal behavior from experience without knowing the transition model, making it model-free.
  • •
    Discounting with \(0 ≤ γ < 1\) ensures that future rewards are worth slightly less, stabilizing solutions and guaranteeing convergence.
  • •
    Contraction properties of the Bellman operator guarantee convergence of value iteration to the optimal value function.
  • •
    C++ implementations typically represent P as sparse adjacency lists over (next state, probability, reward) triples to save memory.

Prerequisites

  • →Basic Probability — Transition dynamics use conditional probabilities and expectations.
  • →Linear Algebra — Policy evaluation can be written and solved as linear systems.
  • →Dynamic Programming — Bellman recursions are dynamic programming equations over states.
  • →Algorithmic Complexity — Understanding runtime and memory of iterative solvers guides practical choices.
  • →C++ Vectors and Iteration — Implementing tabular MDP algorithms requires manipulating arrays and loops.
  • →Optimization Basics — Policies are obtained by maximizing expected returns over actions.
  • →Stochastic Processes — MDPs are discrete-time stochastic control models with Markov transitions.

Detailed Explanation

Tap terms for definitions

01Overview

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision making under uncertainty. It captures the idea that at each time step, the system is in some state, an agent chooses an action, the system moves to a new state according to stochastic transition rules, and a numerical reward is provided. The Markov property says the next state and reward depend only on the current state and action, not on the full past. An MDP is specified by a 5-tuple (S, A, P, R, (\gamma)), where S is the set of states, A is the set of actions, P encodes transition probabilities, R encodes rewards, and (\gamma) is a discount factor in [0,1). The central objective is to find a policy—a mapping from states to actions (possibly probabilistic)—that maximizes the expected sum of discounted rewards over time. This problem underlies many applications: robot navigation, inventory management, game playing, and medical treatment planning. Solutions hinge on the Bellman equations, which relate the value of a state to the immediate reward plus the expected value of future states. Algorithms like Value Iteration and Policy Iteration exploit these equations to compute optimal policies when the model is known, while reinforcement learning methods like Q-learning learn from sampled interactions when it is not.

02Intuition & Analogies

Imagine you are navigating a city with uncertain traffic. Each intersection is a state, and choosing a road to take is an action. Even if you pick the same road twice, travel time varies due to random traffic conditions—this randomness is captured by transition probabilities. After each move, you get a reward: perhaps negative minutes (so smaller is better) or a toll fee. You care about getting to your destination quickly while avoiding tolls, which creates a trade-off between immediate and future outcomes. The discount factor (\gamma) is like valuing a minute saved now slightly more than a minute saved later—it prevents you from overcommitting to distant gains at the expense of what you can gain soon, and it mathematically ensures everything sums to a finite number. A policy is your driving strategy: at each intersection, which road do you take? The value of an intersection under a given strategy is the expected total minutes saved (or cost incurred) if you keep following that strategy. The Bellman equation is the common-sense idea that the worth of a place equals what you get immediately when you take an action plus the expected worth of where you’ll end up next. Value Iteration repeatedly updates your estimates of how good each intersection is, improving your strategy as you learn which roads reliably lead to better neighborhoods. Policy Iteration alternates between asking, "If I keep using this route plan, how good is it?" and "Given those values, can I pick even better roads?" If you don’t know the traffic model, Q-learning learns from experience by trying routes, noting outcomes, and gradually favoring choices that work out well.

03Formal Definition

An MDP is a tuple \(M = (S, A, P, R, γ)\) where: (1) \(S\) is a (finite or countable) set of states; (2) \(A\) is a set of actions (possibly state-dependent \(A(s)\)); (3) \(P(s'∣ s,a)\) is a transition kernel satisfying \(∑s′∈S​ P(s'∣ s,a) = 1\) for all \(s, a\); (4) \(R(s,a,s')\) is the (possibly stochastic) immediate reward when transitioning from \(s\) to \(s'\) via \(a\); (5) \(γ ∈ [0,1)\) is a discount factor. A (stationary) policy \(π\) can be deterministic (\(π: S → A\)) or stochastic (\(π(a∣ s)\) a distribution over actions). The return from time t is \(Gt​ = ∑k=0∞​ γk Rt+k+1​\). The state-value function under policy \(π\) is \(Vπ(s) = \mathbb{E}_{\pi}\left[Gt​ ∣ S_t=s\right]\) and the action-value function is \(Qπ(s,a) = \mathbb{E}_{\pi}\left[Gt​ ∣ St​=s, A_t=a\right]\). These satisfy the Bellman expectation equations. The optimal value functions \(V∗(s) = supπ​ Vπ(s)\) and \(Q∗(s,a) = supπ​ Qπ(s,a)\) satisfy Bellman optimality equations, and an optimal policy can be chosen greedily with respect to \(V∗\) or \(Q∗\). The Bellman optimality operator \(T\) defined by \((TV)(s) = maxa​∑s′​ P(s'∣ s,a) [R(s,a,s') + γ V(s')]\) is a \(γ\)-contraction in the \(ℓ∞​\)-norm, guaranteeing convergence of Value Iteration.

04When to Use

Use MDPs when modeling sequential decisions with uncertainty and the Markov property holds—future depends only on the current state and action. If you know the transition model and rewards (model-based), planning methods like Value Iteration or Policy Iteration are appropriate: for example, deterministic scheduling with random machine failures, inventory control with stochastic demand, or robot path planning with slip probabilities. When the model is unknown but you can interact with the environment (sample transitions), model-free reinforcement learning like Q-learning or SARSA is suitable: for instance, online ad selection with delayed stochastic clicks, game playing with unknown opponent tendencies, or autonomous control learned through simulation. If long-term consequences matter but you must balance them with immediate payoffs, discounting via (\gamma) helps enforce stability and tractability. Choose finite-horizon dynamic programming for problems with a fixed end time (e.g., planning over the next 12 months), and infinite-horizon discounted MDPs for continuing tasks (e.g., never-ending maintenance scheduling). For very large state spaces, consider approximate methods and function approximation (e.g., value function approximation, Monte Carlo tree search) rather than exact tabular solutions.

⚠️Common Mistakes

• Ignoring the Markov property: Including irrelevant history in the state can bloat the state space; omitting necessary information breaks the Markov assumption. Define states so that the future is conditionally independent of the past, given the present state. • Mixing reward definitions: Some texts use (R(s,a)) while others use (R(s,a,s')). Be consistent in code—if you store rewards on transitions, ensure you sum over next states when computing expectations. • Using (\gamma = 1) without care: Without terminal states or bounded returns, convergence may fail. Prefer (\gamma < 1) or prove boundedness/termination. • Incorrect stopping criteria: In Value Iteration, stopping when the maximum change (\delta) is below (\epsilon \frac{1-\gamma}{2\gamma}) ensures an (\epsilon)-optimal policy. Stopping too early yields suboptimal policies; too late wastes computation. • Not normalizing transition probabilities: Each (\sum_{s'} P(s'\mid s,a)) must equal 1. Small floating-point errors can accumulate; renormalize or validate inputs. • Ties in argmax: Greedy policy extraction with ties can oscillate. Break ties consistently (e.g., smallest index) for determinism. • Overfitting to sparse rewards: In Q-learning, without sufficient exploration (e.g., (\epsilon)-greedy), the agent may never discover rewarding states. Schedule exploration decay thoughtfully. • Ignoring sparsity: Storing full (|S|\times|A|\times|S|) arrays is wasteful when transitions are sparse. Use adjacency lists to cut memory and runtime.

Key Formulas

Discounted Return

Gt​=k=0∑∞​γkRt+k+1​

Explanation: The total value from time t is the sum of future rewards, each discounted by \(γ\) per step. This ensures distant rewards contribute less and the sum is finite when \(γ<1\).

Bellman Expectation (State Value)

Vπ(s)=a∈A∑​π(a∣s)s′∈S∑​P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]

Explanation: The value of a state under policy \(π\) equals the expected immediate reward plus discounted value of next states. Expectation is taken over the policy’s action probabilities and the transition model.

Bellman Expectation (Action Value)

Qπ(s,a)=s′∈S∑​P(s′∣s,a)[R(s,a,s′)+γa′∈A∑​π(a′∣s′)Qπ(s′,a′)]

Explanation: The value of taking action a in state s, then following \(π\), includes the immediate reward and the expected value of subsequent actions under \(π\).

Bellman Optimality (State Value)

V∗(s)=a∈Amax​s′∈S∑​P(s′∣s,a)[R(s,a,s′)+γV∗(s′)]

Explanation: The optimal value of a state equals the best expected one-step return achievable by choosing the best action now and behaving optimally thereafter.

Bellman Optimality (Action Value)

Q∗(s,a)=s′∈S∑​P(s′∣s,a)[R(s,a,s′)+γa′max​Q∗(s′,a′)]

Explanation: The optimal action value is immediate reward plus discounted optimal value of next states. It underpins Q-learning and greedy policies.

Bellman Optimality Operator

TV(s)=amax​s′∑​P(s′∣s,a)[R(s,a,s′)+γV(s′)]

Explanation: Applying T to a value function performs a one-step optimal lookahead. Repeated application converges to \(V∗\) because T is a contraction.

Contraction Property

∥TV−TU∥∞​≤γ∥V−U∥∞​

Explanation: The Bellman optimality operator shrinks the maximum absolute difference between two value functions by at least a factor \(γ\). This guarantees uniqueness of the fixed point and convergence of Value Iteration.

Value Iteration Update

Vk+1​(s)=amax​s′∑​P(s′∣s,a)[R(s,a,s′)+γVk​(s′)]

Explanation: This recurrence updates value estimates using one-step optimal lookahead from the previous estimates. Iterating this for all states yields better approximations to \(V∗\).

Probability Normalization

s′∈S∑​P(s′∣s,a)=1for all s,a

Explanation: Transition probabilities from a given state-action pair must sum to one. This encodes that some next state will occur with certainty.

Policy Improvement

π′(s)=argamax​s′∑​P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]

Explanation: Given a value function for a policy \(π\), choosing actions greedily with respect to it yields a policy \(π'\) that is at least as good. Repeating evaluation and improvement converges to an optimal policy.

Stopping Criterion for Value Iteration

∥Vk+1​−Vk​∥∞​<ϵ2γ1−γ​⇒greedy(Vk+1​) is ϵ-optimal

Explanation: If successive value estimates differ by less than a scaled \(ϵ\), the greedy policy is within \(ϵ\) of optimal. This provides a principled termination rule.

Complexity Analysis

Let ∣S∣ be the number of states, ∣A∣ the number of actions, and let each (s,a) have at most d nonzero next-state probabilities (sparsity). Value Iteration performs, per sweep, for each state an action maximization over expectations that sum over at most d successors. Thus one sweep costs O(∣S∣ · ∣A∣ · d). The number of sweeps to achieve an \(ϵ\)-optimal policy depends on the contraction factor \(γ\) and the reward scale; a common bound is O(log(1/ϵ) / (1-γ)). Space is O(∣S∣ + ∣S∣∣A∣d) if P and R are stored sparsely, plus O(∣S∣) for V and O(∣S∣) for a deterministic policy. Policy Iteration alternates: (1) policy evaluation to compute V^π, and (2) policy improvement. Exact evaluation by solving (I - γ P_π) V = R_π costs roughly O(∣S∣^3) with dense linear algebra, which is prohibitive for large ∣S∣. In practice, iterative policy evaluation (Gauss–Seidel/Jacobi) costs O(ke​val · ∣S∣ · d), where ke​val is the number of iterations to reach a given tolerance. Policy improvement costs O(∣S∣ · ∣A∣ · d). The number of outer iterations is typically small in practice (often<10), but worst-case can be large. Q-learning updates one (s,a) at a time using samples, costing O(1) per step to update Q(s,a) plus O(∣A∣) to compute a max for greedy behavior. Over E episodes with up to T steps each, runtime is O(E · T · ∣A∣). Space is O(∣S∣∣A∣) for tabular Q. Convergence to Q* requires sufficient exploration and a suitable learning-rate schedule. For large ∣S∣, function approximation is used to reduce space/time at the cost of approximation error.

Code Examples

Value Iteration on a Small Tabular MDP
1#include <iostream>
2#include <vector>
3#include <tuple>
4#include <limits>
5#include <iomanip>
6#include <cmath>
7using namespace std;
8
9// Transition represented as (next_state, probability, reward)
10struct Transition {
11 int next;
12 double prob;
13 double reward;
14};
15
16// Value Iteration for finite MDP with sparse transitions R(s,a,s')
17pair<vector<double>, vector<int>> value_iteration(
18 const vector<vector<vector<Transition>>> &P, // P[s][a] -> list of transitions
19 int S, int A, double gamma, double theta, int max_iters = 10000) {
20
21 vector<double> V(S, 0.0); // value function initialized to zero
22 vector<int> policy(S, 0); // greedy policy derived from V
23
24 for (int iter = 0; iter < max_iters; ++iter) {
25 double delta = 0.0;
26 vector<double> Vnext = V; // keep old V for synchronous updates
27 for (int s = 0; s < S; ++s) {
28 double best = -numeric_limits<double>::infinity();
29 int best_a = 0;
30 for (int a = 0; a < A; ++a) {
31 double q = 0.0;
32 for (const auto &t : P[s][a]) {
33 q += t.prob * (t.reward + gamma * V[t.next]);
34 }
35 if (q > best) {
36 best = q;
37 best_a = a;
38 }
39 }
40 Vnext[s] = best;
41 policy[s] = best_a; // track current greedy action
42 delta = max(delta, fabs(Vnext[s] - V[s]));
43 }
44 V.swap(Vnext);
45 // Stopping criterion ensuring epsilon-optimality (scaled by gamma)
46 if (delta < theta) break;
47 }
48 return {V, policy};
49}
50
51int main() {
52 // Example MDP with 3 states (0,1,2) and 2 actions (0,1)
53 // Deterministic transitions for clarity; rewards on transitions
54 int S = 3, A = 2;
55 vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A));
56
57 double gamma = 0.9;
58
59 // State 0:
60 // Action 0: go to state 1, reward +5
61 P[0][0].push_back({1, 1.0, 5.0});
62 // Action 1: stay in 0, reward +1
63 P[0][1].push_back({0, 1.0, 1.0});
64
65 // State 1:
66 // Action 0: go to state 2, reward +2
67 P[1][0].push_back({2, 1.0, 2.0});
68 // Action 1: go to state 0, reward 0
69 P[1][1].push_back({0, 1.0, 0.0});
70
71 // State 2 (terminal-like): both actions loop with 0 reward
72 P[2][0].push_back({2, 1.0, 0.0});
73 P[2][1].push_back({2, 1.0, 0.0});
74
75 double theta = 1e-6; // absolute tolerance on value updates
76
77 auto [V, policy] = value_iteration(P, S, A, gamma, theta);
78
79 cout << fixed << setprecision(6);
80 cout << "Optimal state values:\n";
81 for (int s = 0; s < S; ++s) {
82 cout << "V[" << s << "] = " << V[s] << "\n";
83 }
84 cout << "\nGreedy optimal policy (action indices):\n";
85 for (int s = 0; s < S; ++s) {
86 cout << "pi[" << s << "] = " << policy[s] << "\n";
87 }
88 return 0;
89}
90

This program encodes a small MDP with rewards on transitions and runs Value Iteration. For each state, it computes the best expected one-step return over actions using the current value function and updates V. It stops when value changes are below a tolerance, then reports the optimal values and the greedy policy.

Time: O(|S| · |A| · d · K), where d is average nonzero successors per (s,a) and K the number of sweeps until convergence.Space: O(|S| + |S||A|d) to store V and sparse transitions.
Policy Iteration with Iterative Policy Evaluation
1#include <iostream>
2#include <vector>
3#include <tuple>
4#include <limits>
5#include <iomanip>
6#include <cmath>
7using namespace std;
8
9struct Transition { int next; double prob; double reward; };
10
11// Evaluate a fixed (deterministic) policy via Gauss-Seidel updates
12vector<double> policy_evaluation(
13 const vector<int> &policy,
14 const vector<vector<vector<Transition>>> &P,
15 int S, int A, double gamma, double theta, int max_iters = 100000) {
16 vector<double> V(S, 0.0);
17 for (int it = 0; it < max_iters; ++it) {
18 double delta = 0.0;
19 for (int s = 0; s < S; ++s) {
20 int a = policy[s];
21 double v_old = V[s];
22 double v_new = 0.0;
23 for (const auto &t : P[s][a]) {
24 v_new += t.prob * (t.reward + gamma * V[t.next]);
25 }
26 V[s] = v_new; // Gauss-Seidel: immediate write-back
27 delta = max(delta, fabs(v_new - v_old));
28 }
29 if (delta < theta) break;
30 }
31 return V;
32}
33
34pair<vector<double>, vector<int>> policy_iteration(
35 const vector<vector<vector<Transition>>> &P,
36 int S, int A, double gamma, double theta_eval) {
37
38 vector<int> policy(S, 0); // initial arbitrary policy
39 vector<double> V(S, 0.0);
40
41 while (true) {
42 // 1) Policy evaluation
43 V = policy_evaluation(policy, P, S, A, gamma, theta_eval);
44
45 // 2) Policy improvement
46 bool policy_stable = true;
47 for (int s = 0; s < S; ++s) {
48 int old_a = policy[s];
49 double best = -numeric_limits<double>::infinity();
50 int best_a = old_a;
51 for (int a = 0; a < A; ++a) {
52 double q = 0.0;
53 for (const auto &t : P[s][a]) {
54 q += t.prob * (t.reward + gamma * V[t.next]);
55 }
56 if (q > best) {
57 best = q;
58 best_a = a;
59 }
60 }
61 policy[s] = best_a;
62 if (best_a != old_a) policy_stable = false;
63 }
64 if (policy_stable) break; // converged to optimal policy
65 }
66 return {V, policy};
67}
68
69int main() {
70 // Reuse the same 3-state, 2-action MDP as in Value Iteration example
71 int S = 3, A = 2;
72 vector<vector<vector<Transition>>> P(S, vector<vector<Transition>>(A));
73 double gamma = 0.9;
74
75 P[0][0].push_back({1, 1.0, 5.0});
76 P[0][1].push_back({0, 1.0, 1.0});
77
78 P[1][0].push_back({2, 1.0, 2.0});
79 P[1][1].push_back({0, 1.0, 0.0});
80
81 P[2][0].push_back({2, 1.0, 0.0});
82 P[2][1].push_back({2, 1.0, 0.0});
83
84 double theta_eval = 1e-8; // tolerance for policy evaluation
85
86 auto [V, policy] = policy_iteration(P, S, A, gamma, theta_eval);
87
88 cout << fixed << setprecision(6);
89 cout << "Optimal values (Policy Iteration):\n";
90 for (int s = 0; s < S; ++s) cout << "V[" << s << "] = " << V[s] << "\n";
91
92 cout << "\nOptimal policy (action indices):\n";
93 for (int s = 0; s < S; ++s) cout << "pi[" << s << "] = " << policy[s] << "\n";
94
95 return 0;
96}
97

This code alternates between evaluating the current deterministic policy using Gauss–Seidel iterative updates and improving the policy greedily with respect to the evaluated values. It terminates when the policy no longer changes, yielding an optimal policy for the given MDP.

Time: O(I_outer · (k_eval · |S| · d + |S| · |A| · d)), where I_outer is the number of policy improvements and k_eval the iterations per evaluation.Space: O(|S| + |S||A|d) for values, policy, and sparse transitions.
Tabular Q-learning on a 4x4 Gridworld
1#include <iostream>
2#include <vector>
3#include <random>
4#include <iomanip>
5#include <algorithm>
6using namespace std;
7
8struct StepResult { int next_s; double reward; bool done; };
9
10struct Gridworld {
11 int W, H;
12 int goal_r, goal_c;
13 double step_reward;
14
15 Gridworld(int W_, int H_, int gr, int gc, double step_r)
16 : W(W_), H(H_), goal_r(gr), goal_c(gc), step_reward(step_r) {}
17
18 int to_state(int r, int c) const { return r * W + c; }
19 pair<int,int> from_state(int s) const { return {s / W, s % W}; }
20
21 // Actions: 0=up,1=right,2=down,3=left
22 StepResult step(int s, int a) const {
23 auto [r, c] = from_state(s);
24 int nr = r, nc = c;
25 if (a == 0) nr = max(0, r - 1);
26 if (a == 1) nc = min(W - 1, c + 1);
27 if (a == 2) nr = min(H - 1, r + 1);
28 if (a == 3) nc = max(0, c - 1);
29 int ns = to_state(nr, nc);
30 bool done = (nr == goal_r && nc == goal_c);
31 double rwd = done ? 0.0 : step_reward; // 0 at goal, negative elsewhere
32 return {ns, rwd, done};
33 }
34};
35
36int main() {
37 const int W = 4, H = 4;
38 Gridworld env(W, H, H-1, W-1, -1.0); // goal at bottom-right
39 const int S = W * H, A = 4;
40
41 vector<vector<double>> Q(S, vector<double>(A, 0.0));
42
43 // Hyperparameters
44 double gamma = 0.95;
45 double alpha = 0.5; // learning rate (can decay)
46 double eps = 0.2; // epsilon-greedy exploration (can decay)
47 int episodes = 2000;
48 int max_steps = 200;
49
50 mt19937 rng(123);
51 uniform_real_distribution<double> uni(0.0, 1.0);
52 uniform_int_distribution<int> act_dist(0, A-1);
53
54 auto greedy_action = [&](int s) {
55 int best_a = 0; double best_q = Q[s][0];
56 for (int a = 1; a < A; ++a) if (Q[s][a] > best_q) { best_q = Q[s][a]; best_a = a; }
57 return best_a;
58 };
59
60 for (int ep = 0; ep < episodes; ++ep) {
61 // start at (0,0)
62 int s = 0;
63 bool done = false;
64 for (int t = 0; t < max_steps && !done; ++t) {
65 // epsilon-greedy
66 int a = (uni(rng) < eps) ? act_dist(rng) : greedy_action(s);
67 auto sr = env.step(s, a);
68 // Q-learning update: Q(s,a) <- Q(s,a) + alpha [ r + gamma max_{a'} Q(s',a') - Q(s,a) ]
69 double max_next = *max_element(Q[sr.next_s].begin(), Q[sr.next_s].end());
70 double target = sr.reward + (sr.done ? 0.0 : gamma * max_next);
71 Q[s][a] += alpha * (target - Q[s][a]);
72
73 s = sr.next_s;
74 done = sr.done;
75 }
76 // optional decay
77 eps = max(0.01, eps * 0.995);
78 alpha = max(0.05, alpha * 0.999);
79 }
80
81 // Derive greedy policy from learned Q
82 vector<int> policy(S, 0);
83 for (int s = 0; s < S; ++s) {
84 policy[s] = int(max_element(Q[s].begin(), Q[s].end()) - Q[s].begin());
85 }
86
87 cout << fixed << setprecision(2);
88 cout << "Greedy policy (0=U,1=R,2=D,3=L):\n";
89 for (int r = 0; r < H; ++r) {
90 for (int c = 0; c < W; ++c) {
91 int s = env.to_state(r, c);
92 cout << policy[s] << ' ';
93 }
94 cout << '\n';
95 }
96
97 cout << "\nState values V(s)=max_a Q(s,a):\n";
98 for (int r = 0; r < H; ++r) {
99 for (int c = 0; c < W; ++c) {
100 int s = env.to_state(r, c);
101 double v = *max_element(Q[s].begin(), Q[s].end());
102 cout << setw(6) << v << ' ';
103 }
104 cout << '\n';
105 }
106 return 0;
107}
108

This program learns a shortest-path-like policy in a 4x4 grid using Q-learning. It repeatedly runs episodes from the start, selecting actions with an \(\epsilon\)-greedy rule, and updates Q-values toward the target r + \(\gamma\) max Q at the next state. After training, it prints the greedy policy and the induced state values.

Time: O(episodes · max_steps · |A|) due to per-step action selection and max over actions.Space: O(|S||A|) to store Q-values.
#markov decision process#value iteration#policy iteration#bellman equation#q-learning#discount factor#reinforcement learning#dynamic programming#state value#action value#optimal policy#gridworld#contraction mapping#policy evaluation#epsilon-greedy