📚TheoryIntermediate

Bellman Equations

Key Points

•
Bellman equations express how the value of a state or action equals immediate reward plus discounted value of what follows.
•
They turn long-term planning into local, recursive updates that can be solved by fixed-point iteration.
•
For a fixed policy, the Bellman operator is a $γ - co n t r a c t i o n,$ guaranteeing a unique value function and convergence by iteration.
•
Bellman optimality equations introduce a max over actions to define the best possible long-term value.
•
Algorithms like policy evaluation, value iteration, policy iteration, and Q-learning all aim to satisfy Bellman equations.
•
In discrete MDPs, tabular methods compute values exactly; with large/continuous spaces, function approximation or HJB PDEs are used.
•
Convergence speed and stability depend on the discount factor $γ,$ transition structure, and learning rates in RL.
•
Getting terminal states, probability normalization, and the difference between expectation (policy evaluation) and max (optimality) correct is crucial.

Prerequisites

→Probability and Expectation — Bellman equations average over actions and next states; understanding expectation is essential.
→Markov Chains — MDP dynamics reduce to Markov transitions when the policy is fixed.
→Markov Decision Processes (MDPs) — Bellman equations are defined on top of the MDP framework of states, actions, transitions, and rewards.
→Dynamic Programming — Bellman equations embody optimal substructure and recursive decomposition used in DP.
→Matrix and Linear Algebra — Policy evaluation can be written and solved as a linear system (I − γP^π)V = R^π.
→Norms and Contraction Mappings — Convergence guarantees rely on the γ-contraction property under the sup-norm.
→Stochastic Approximation — Q-learning convergence analysis uses stochastic approximation theory and learning-rate conditions.
→Basic C++ (Vectors, Loops, Functions) — Implementing tabular value iteration or Q-learning requires fundamental C++ skills.

Detailed Explanation

Tap terms for definitions

01Overview

Bellman equations are the backbone of dynamic programming and value-based reinforcement learning. They state that the value of a decision now equals the immediate payoff plus the discounted value of the future, assuming some way of choosing actions (a policy) or choosing optimally. This recursive relationship breaks a long, uncertain future into smaller, local computations. In a Markov Decision Process (MDP), the environment is summarized by states, actions, rewards, and transition probabilities. The state-value function V measures how good it is to be in a state under a policy, while the action-value function Q measures how good it is to take a specific action in a state under a policy. The Bellman expectation equations characterize V and Q for a given policy; the Bellman optimality equations characterize the best achievable values V* and Q*. Because Bellman operators are contractions under the infinity norm when 0 ≤ γ < 1, iterative application converges to a unique fixed point. This leads directly to practical algorithms such as policy evaluation, value iteration, and Q-learning, which iteratively improve estimates until they satisfy the equations closely. In continuous-time control, the analogous concept is the Hamilton–Jacobi–Bellman (HJB) partial differential equation, which connects optimal control with calculus of variations. Bellman equations unify planning (when a model is known) and learning (when it is not), providing both theoretical guarantees and computational recipes for decision making under uncertainty.

02Intuition & Analogies

Imagine planning a road trip. You care about the enjoyment now (scenery, snacks) and what follows (next stops), but you discount far-future enjoyment a bit because it’s uncertain or you’re impatient. The value of your current location equals what you get now plus a fraction of what you expect later. That is the Bellman idea: break a big trip (long horizon) into small hops (immediate step + future value). Now, suppose you follow a fixed travel style (your policy): maybe you prefer scenic roads 70% of the time and fastest roads 30%. The expected future enjoyment depends on that style, so your value is the average over choices you’ll likely make, weighted by their probabilities. That’s the Bellman expectation equation. If instead you’re optimizing, you’ll always pick the best next road; then the future value at each step is the maximum over choices, leading to the Bellman optimality equation. The magic is that you can compute long-term value by repeatedly updating each location from nearby locations, like smoothing with local rules, until it stops changing. Because each step shrinks the error by γ (the discount), this local update steadily converges. In the real world, you might not know exactly how traffic behaves (the transition model), so you learn from experience: you try a route, see where you end up, and update your estimate of how good that action was. That’s Q-learning—using samples to push your action-value toward the Bellman optimality target. Whether you have a full map (planning) or learn by driving (learning), the principle is the same: today’s value equals today’s reward plus a discounted view of tomorrow.

03Formal Definition

Consider a finite MDP with state space S, action space A, transition probabilities P(s'

∣ s, a), re w a r df u n c t i o n R (s, a) = E [r_{t + 1} ∣ s_{t} = s

a_{t} = a]

, and discount factor γ ∈ [0,1). For a stationary policy

π (a ∣ s),

the state-value function V^π and action-value function Q^π are defined by the Bellman expectation equations: V^

π (s)

= Σ_a

π (a ∣ s) [

R(s,a) + γ Σ_{s'} P(s'

∣ s, a) V^{π} (s^{'})], an d Q^{π} (s, a) = R (s, a) + γ Σ_{s^{'}} P (s^{'} ∣

s,a) Σ_{a'}

π (a^{'} ∣ s^{'})

π (s^{'}, a^{'}) .

The Bellman optimality equations define the optimal value functions: V*(s) = ma

x_{a}

Q*(s,a), and Q*(s,a) = R(s,a) + γ Σ_{s'} P(s'|s,a)

ma x_{a^{'}}

Q*(s',a'). Define the Bellman operators T^π and T: (T^π V)(s) = Σ_a

π (a ∣ s) [

R(s,a) + γ Σ_{s'} P(s'|s,a) V(s') ] and (T V)(s) = ma

x_{a}

[ R(s,a) + γ Σ_{s'} P(s'|s,a) V(s') ]. For γ < 1, T^π is a

γ - co n t r a c t i o n

under the sup-norm, implying a unique fixed point V^π satisfying V^π = T^π V^

π;

similarly, V* is the unique fixed point of T. Iterative updates

V_{k + 1}

← T^π

V_{k}

(policy evaluation) or

V_{k + 1}

← T

V_{k}

(value iteration) converge to their fixed points. In continuous-time stochastic control, with dynamics

d x_{t} = f (x_{t}

a_{t}

) dt +

σ (x_{t}, a_{t})

W_{t}

and running reward r(x,a), the optimal value solves the Hamilton–Jacobi–Bellman PDE: 0 = ma

x_{a}

{ r(x,a) + ∇V(x)^T f(x,a) +

\frac{1}{2}

tr(σ

σ^{T}

∇^2 V(x)) }.

04When to Use

Use Bellman equations when solving Markov Decision Processes or designing value-based reinforcement learning algorithms. If you know the model (P and R), you can compute exact or near-exact solutions via planning methods: policy evaluation (to assess a fixed policy), policy iteration (alternate evaluation and improvement), or value iteration (apply the optimality operator directly). These are ideal for small to medium tabular problems such as gridworlds, queueing systems, inventory control, and finite-horizon planning. When the model is unknown or too complex, use sample-based methods that still aim for Bellman consistency, such as temporal-difference learning and Q-learning; they work from experience trajectories collected by interacting with the environment. In large or continuous spaces, you approximate value functions with linear features or neural networks; while exact Bellman updates are impossible, you still minimize Bellman error (e.g., in DQN). If dynamics are continuous in time, use the Hamilton–Jacobi–Bellman framework to derive optimality conditions and numerically solve PDEs when feasible. Choose Bellman approaches when you need principled, convergent methods (γ < 1) with interpretable value functions and when optimal or near-optimal long-term performance matters rather than myopic rewards.

⚠️Common Mistakes

Common pitfalls include: (1) Mixing expectation and maximization: in policy evaluation you must average over π(a|s); using max there computes an improved policy instead. (2) Mishandling terminal states: terminals should have zero continuation value and self-loop transitions with probability 1; forgetting this yields inflated values or non-convergence. (3) Using γ = 1 without proper conditions: with infinite horizons, γ must be < 1 (or rewards must be bounded and episodic) to ensure contraction and convergence. (4) Not normalizing transition probabilities P(s'|s,a) to sum to 1, which breaks the probabilistic interpretation and skews values. (5) Large learning rates or non-decaying α in Q-learning can prevent convergence; α_t should typically decrease or be small and stable. (6) Off-policy instability with function approximation: naive Q-learning with powerful approximators can diverge; use target networks or double estimators. (7) Stopping criteria based on small average changes rather than max-norm can miss worst-case errors; use the sup-norm threshold. (8) Confusing Q* and V*: V*(s) = max_a Q*(s,a), but during learning these may be inconsistent—derive policies greedily from Q, not partially from V. (9) Ignoring reward scaling: very large rewards slow convergence and cause numerical issues; normalize or adjust γ. (10) Overlooking that computational cost scales with the number of nonzero transitions: dense models are far more expensive than sparse ones; use sparse representations and precompute adjacency.

Key Formulas

Bellman Expectation (State Value)

V^{π} (s) = a \in A \sum π (a ∣ s) [R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) V^{π} (s^{'})]

Explanation: The value under policy π equals the expected immediate reward plus discounted value of the next state. Average over actions using π and over next states using the transition model.

Bellman Expectation (Action Value)

Q^{π} (s, a) = R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) a^{'} \in A \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})

Explanation: The expected return after taking action a in s and then following π equals reward plus discounted next-step action-values averaged by $π .$

Optimal Value from Optimal Q

V^{*} (s) = a \in A max Q^{*} (s, a)

Explanation: The optimal state value is achieved by taking the best possible action at that state. This ties optimal V* and Q* together.

Bellman Optimality (Action Value)

Q^{*} (s, a) = R (s, a) + γ s^{'} \in S \sum P (s^{'} ∣ s, a) a^{'} \in A max Q^{*} (s^{'}, a^{'})

Explanation: Optimal action-values satisfy a fixed point where targets use the maximum over next actions. This is the foundation for Q-learning.

Bellman Operator (Policy)

(T^{π} V) (s) = a \sum π (a ∣ s) [R (s, a) + γ s^{'} \sum P (s^{'} ∣ s, a) V (s^{'})]

Explanation: Applies a single Bellman expectation backup to a value function. Repeatedly applying T^π converges to V^ $π .$

Contraction Property

∥ T^{π} V - T^{π} W ∥_{\infty} \leq γ ∥ V - W ∥_{\infty}

Explanation: The Bellman expectation operator shrinks the sup-norm distance between two value functions by at least a factor $γ .$ This ensures a unique fixed point and convergence by iteration.

Linear System for Policy Evaluation

(I - γ P^{π}) V^{π} = R^{π}

Explanation: Exact policy evaluation can be written as a linear system using the state transition matrix under π and expected rewards. Solving this yields V^π in one step for small problems.

Value Iteration and Error Bound

V_{k + 1} = T V_{k}, with ∥ V_{k} - V^{*} ∥_{\infty} \leq \frac{γ ^{k}}{1 - γ} ∥ T V_{0} - V_{0} ∥_{\infty}

Explanation: Applying the optimality operator iteratively converges to V*. The bound quantifies how fast the error shrinks with k, depending on γ and the initial residual.

Q-learning Update

Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + α_{t} [r_{t} + γ a^{'} max Q_{t} (s_{t + 1}, a^{'}) - Q_{t} (s_{t}, a_{t})]

Explanation: A sample-based stochastic approximation to the Bellman optimality equation. With suitable conditions, Q-learning converges to Q* in the tabular setting.

Hamilton–Jacobi–Bellman (Continuous Time)

0 = a \in A max {r (x, a) + \nabla V (x)^{⊤} f (x, a) + \frac{1}{2} tr (σ (x, a) σ (x, a)^{⊤} \nabla^{2} V (x))}

Explanation: The continuous-time counterpart of Bellman optimality. It characterizes the optimal value via a nonlinear PDE involving drift, diffusion, and the reward.

Complexity Analysis

Let S =

∣ S ∣

, A =

∣ A ∣

, and E =

\sum_{s, a}

∣ {s^{'} : P (s^{'} ∣ s, a) > 0} ∣

be the number of nonzero transitions (sparse edges). One Bellman backup for a single state-action pair costs O(outdegree), so a full sweep over all states via policy evaluation or value iteration costs O(E). Iterative policy evaluation of a fixed policy performs K sweeps until the sup-norm change falls below a tolerance; the total time is O(K E). With high discount factors (γ close to 1), K typically increases because the contraction is weaker. Space is O(S) for V (or O(SA) if maintaining Q^

π) .

Value iteration similarly costs O(K E) time and O(S) space, with K depending on γ and desired accuracy

ε;

a standard bound is

K = O (lo g (1/ ε

) / (1-

γ)) .

Policy iteration alternates evaluation and improvement; exact evaluation by solving (I - γ P^

π)

V = R^{π}

via Gaussian elimination costs O(

S^{3}

) time and O(

S^{2}

) space, but iterative evaluation reduces this to multiple O(E) sweeps. Tabular Q-learning operates per step in O(A) time (for the max over actions) and O(1) space per update, with total time proportional to the number of environment steps. Its memory footprint is O(SA) to store Q. In practice, sparse transitions reduce E dramatically compared to SA S, so using adjacency lists is critical. When transitioning to function approximation, per-update cost scales with the number of parameters, but the Bellman principle remains the computational core.

Code Examples

Policy Evaluation via Bellman Expectation Backups (Tabular MDP)

1 #include <iostream>
2 #include <vector>
3 #include <utility>
4 #include <cmath>
5 #include <iomanip>
6 #include <cassert>
7 
8 struct Transition {
9     int next;      // next state index
10     double prob;   // probability of transition
11 };
12 
13 struct MDP {
14     int S; // number of states
15     int A; // number of actions
16     std::vector<std::vector<std::vector<Transition>>> P; // P[s][a] = list of (s', p)
17     std::vector<std::vector<double>> R; // R[s][a] = expected immediate reward for (s,a)
18     std::vector<bool> terminal; // terminal states
19 };
20 
21 // One sweep of Bellman expectation update for V under policy pi
22 void policy_evaluation_sweep(const MDP& mdp,
23                              const std::vector<std::vector<double>>& pi,
24                              double gamma,
25                              const std::vector<double>& V_old,
26                              std::vector<double>& V_new) {
27     for (int s = 0; s < mdp.S; ++s) {
28         if (mdp.terminal[s]) { V_new[s] = 0.0; continue; }
29         double val = 0.0;
30         for (int a = 0; a < mdp.A; ++a) {
31             double expected_next = 0.0;
32             for (const auto& tr : mdp.P[s][a]) {
33                 expected_next += tr.prob * V_old[tr.next];
34             }
35             val += pi[s][a] * (mdp.R[s][a] + gamma * expected_next);
36         }
37         V_new[s] = val;
38     }
39 }
40 
41 std::vector<double> policy_evaluation(const MDP& mdp,
42                                       const std::vector<std::vector<double>>& pi,
43                                       double gamma,
44                                       int max_iters = 1000,
45                                       double theta = 1e-8) {
46     std::vector<double> V(mdp.S, 0.0), V_new(mdp.S, 0.0);
47     for (int it = 0; it < max_iters; ++it) {
48         policy_evaluation_sweep(mdp, pi, gamma, V, V_new);
49         double delta = 0.0;
50         for (int s = 0; s < mdp.S; ++s) delta = std::max(delta, std::fabs(V_new[s] - V[s]));
51         V.swap(V_new);
52         if (delta < theta) break; // sup-norm stopping criterion
53     }
54     return V;
55 }
56 
57 int main() {
58     // Build a small MDP with 4 states (state 3 is terminal) and 2 actions
59     MDP mdp; mdp.S = 4; mdp.A = 2;
60     mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A));
61     mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0));
62     mdp.terminal.assign(mdp.S, false);
63 
64     // Define transitions and rewards
65     // State 0
66     mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0;
67     mdp.P[0][1] = { {2,1.0} };                   mdp.R[0][1] = 1.0;
68     // State 1
69     mdp.P[1][0] = { {3,1.0} };                   mdp.R[1][0] = 2.0;
70     mdp.P[1][1] = { {0,1.0} };                   mdp.R[1][1] = 0.0;
71     // State 2
72     mdp.P[2][0] = { {3,1.0} };                   mdp.R[2][0] = 3.0;
73     mdp.P[2][1] = { {1,1.0} };                   mdp.R[2][1] = 0.5;
74     // State 3 (terminal)
75     mdp.terminal[3] = true;
76     mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; // self-loop
77     mdp.R[3][0] = mdp.R[3][1] = 0.0;
78 
79     // Define a stochastic policy pi[a|s]
80     std::vector<std::vector<double>> pi(mdp.S, std::vector<double>(mdp.A, 0.5));
81     pi[0] = {0.0, 1.0}; // in s=0 choose action 1 deterministically
82     pi[1] = {1.0, 0.0}; // in s=1 choose action 0 deterministically
83     pi[2] = {0.0, 1.0}; // in s=2 choose action 1 deterministically
84     pi[3] = {0.5, 0.5}; // terminal doesn't matter
85 
86     // Validate that each pi[s] sums to 1
87     for (int s = 0; s < mdp.S; ++s) {
88         double sum = pi[s][0] + pi[s][1];
89         assert(std::fabs(sum - 1.0) < 1e-9);
90     }
91 
92     double gamma = 0.9;
93     auto V = policy_evaluation(mdp, pi, gamma, 10000, 1e-12);
94 
95     std::cout << std::fixed << std::setprecision(6);
96     std::cout << "State values under the given policy (gamma=0.9):\n";
97     for (int s = 0; s < mdp.S; ++s) {
98         std::cout << "V[" << s << "] = " << V[s] << "\n";
99     }
100 
101     return 0;
102 }
103

This program builds a small tabular MDP and evaluates a fixed policy by iteratively applying the Bellman expectation backup. Terminal states are handled as zero continuation value with self-loops. The sup-norm change (max absolute difference across states) is used for convergence.

Time: O(K E), where K is the number of sweeps until convergence and E is the number of nonzero transitions.Space: O(S) for the value vector V, plus O(E) to store the model.

Value Iteration (Bellman Optimality) with Greedy Policy Extraction

1 #include <iostream>
2 #include <vector>
3 #include <utility>
4 #include <cmath>
5 #include <iomanip>
6 
7 struct Transition { int next; double prob; };
8 struct MDP {
9     int S, A;
10     std::vector<std::vector<std::vector<Transition>>> P;
11     std::vector<std::vector<double>> R;
12     std::vector<bool> terminal;
13 };
14 
15 std::vector<double> value_iteration(const MDP& mdp, double gamma, int max_iters, double theta) {
16     std::vector<double> V(mdp.S, 0.0), V_new(mdp.S, 0.0);
17     for (int it = 0; it < max_iters; ++it) {
18         double delta = 0.0;
19         for (int s = 0; s < mdp.S; ++s) {
20             if (mdp.terminal[s]) { V_new[s] = 0.0; continue; }
21             double best = -1e100;
22             for (int a = 0; a < mdp.A; ++a) {
23                 double expected_next = 0.0;
24                 for (const auto& tr : mdp.P[s][a]) expected_next += tr.prob * V[tr.next];
25                 best = std::max(best, mdp.R[s][a] + gamma * expected_next);
26             }
27             V_new[s] = best;
28             delta = std::max(delta, std::fabs(V_new[s] - V[s]));
29         }
30         V.swap(V_new);
31         if (delta < theta) break;
32     }
33     return V;
34 }
35 
36 std::vector<int> greedy_policy_from_V(const MDP& mdp, const std::vector<double>& V, double gamma) {
37     std::vector<int> policy(mdp.S, 0);
38     for (int s = 0; s < mdp.S; ++s) {
39         if (mdp.terminal[s]) { policy[s] = 0; continue; }
40         double best = -1e100; int besta = 0;
41         for (int a = 0; a < mdp.A; ++a) {
42             double expected_next = 0.0;
43             for (const auto& tr : mdp.P[s][a]) expected_next += tr.prob * V[tr.next];
44             double q = mdp.R[s][a] + gamma * expected_next;
45             if (q > best) { best = q; besta = a; }
46         }
47         policy[s] = besta;
48     }
49     return policy;
50 }
51 
52 int main() {
53     // Same MDP as in the previous example
54     MDP mdp; mdp.S = 4; mdp.A = 2;
55     mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A));
56     mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0));
57     mdp.terminal.assign(mdp.S, false);
58 
59     mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0;
60     mdp.P[0][1] = { {2,1.0} };                   mdp.R[0][1] = 1.0;
61     mdp.P[1][0] = { {3,1.0} };                   mdp.R[1][0] = 2.0;
62     mdp.P[1][1] = { {0,1.0} };                   mdp.R[1][1] = 0.0;
63     mdp.P[2][0] = { {3,1.0} };                   mdp.R[2][0] = 3.0;
64     mdp.P[2][1] = { {1,1.0} };                   mdp.R[2][1] = 0.5;
65     mdp.terminal.assign(mdp.S, false); mdp.terminal[3] = true;
66     mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; mdp.R[3][0] = mdp.R[3][1] = 0.0;
67 
68     double gamma = 0.9;
69     auto V = value_iteration(mdp, gamma, 10000, 1e-12);
70     auto pi_greedy = greedy_policy_from_V(mdp, V, gamma);
71 
72     std::cout << std::fixed << std::setprecision(6);
73     std::cout << "Optimal state values (approx):\n";
74     for (int s = 0; s < mdp.S; ++s) std::cout << "V*[" << s << "] = " << V[s] << "\n";
75 
76     std::cout << "Greedy policy (0-based action indices):\n";
77     for (int s = 0; s < mdp.S; ++s) std::cout << "pi*(" << s << ") = action " << pi_greedy[s] << "\n";
78 
79     return 0;
80 }
81

This program applies the Bellman optimality operator until convergence to approximate V*. It then extracts a greedy policy by choosing, in each state, the action that maximizes the one-step lookahead using the converged V.

Time: O(K E), where K is the number of value-iteration sweeps to reach tolerance and E is the number of nonzero transitions.Space: O(S) for V plus O(E) for the model.

Tabular Q-learning (Sample-based Bellman Optimality)

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <algorithm>
5 #include <iomanip>
6 
7 struct Transition { int next; double prob; };
8 struct MDP {
9     int S, A;
10     std::vector<std::vector<std::vector<Transition>>> P;
11     std::vector<std::vector<double>> R;
12     std::vector<bool> terminal;
13 };
14 
15 int sample_next_state(const std::vector<Transition>& tr_list, std::mt19937& rng) {
16     std::uniform_real_distribution<double> dist(0.0, 1.0);
17     double u = dist(rng), cdf = 0.0;
18     for (const auto& tr : tr_list) { cdf += tr.prob; if (u <= cdf) return tr.next; }
19     return tr_list.back().next; // fallback for numerical issues
20 }
21 
22 int epsilon_greedy_action(const std::vector<std::vector<double>>& Q, int s, double epsilon, std::mt19937& rng) {
23     std::uniform_real_distribution<double> dist(0.0, 1.0);
24     if (dist(rng) < epsilon) {
25         std::uniform_int_distribution<int> a_dist(0, (int)Q[s].size()-1);
26         return a_dist(rng);
27     }
28     int besta = 0; double bestq = Q[s][0];
29     for (int a = 1; a < (int)Q[s].size(); ++a) if (Q[s][a] > bestq) { bestq = Q[s][a]; besta = a; }
30     return besta;
31 }
32 
33 int main() {
34     // Build the same MDP
35     MDP mdp; mdp.S = 4; mdp.A = 2;
36     mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A));
37     mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0));
38     mdp.terminal.assign(mdp.S, false);
39 
40     mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0;
41     mdp.P[0][1] = { {2,1.0} };                   mdp.R[0][1] = 1.0;
42     mdp.P[1][0] = { {3,1.0} };                   mdp.R[1][0] = 2.0;
43     mdp.P[1][1] = { {0,1.0} };                   mdp.R[1][1] = 0.0;
44     mdp.P[2][0] = { {3,1.0} };                   mdp.R[2][0] = 3.0;
45     mdp.P[2][1] = { {1,1.0} };                   mdp.R[2][1] = 0.5;
46     mdp.terminal[3] = true; mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} };
47 
48     // Q-learning parameters
49     double gamma = 0.9;
50     double alpha = 0.1;      // learning rate
51     double epsilon = 0.1;    // exploration
52     int episodes = 2000;
53     int max_steps = 100;     // per episode cap
54 
55     std::vector<std::vector<double>> Q(mdp.S, std::vector<double>(mdp.A, 0.0));
56     std::mt19937 rng(12345);
57     std::uniform_int_distribution<int> start_dist(0, mdp.S - 2); // avoid starting in terminal
58 
59     for (int ep = 0; ep < episodes; ++ep) {
60         int s = start_dist(rng);
61         for (int t = 0; t < max_steps; ++t) {
62             if (mdp.terminal[s]) break;
63             int a = epsilon_greedy_action(Q, s, epsilon, rng);
64             int s_next = sample_next_state(mdp.P[s][a], rng);
65             double r = mdp.R[s][a];
66             // TD target: r + gamma * max_a' Q(s', a')
67             double max_q_next = *std::max_element(Q[s_next].begin(), Q[s_next].end());
68             double td_target = r + (mdp.terminal[s_next] ? 0.0 : gamma * max_q_next);
69             double td_error = td_target - Q[s][a];
70             Q[s][a] += alpha * td_error;
71             s = s_next;
72         }
73     }
74 
75     // Derive greedy policy from learned Q
76     std::vector<int> policy(mdp.S, 0);
77     for (int s = 0; s < mdp.S; ++s) {
78         policy[s] = std::distance(Q[s].begin(), std::max_element(Q[s].begin(), Q[s].end()));
79     }
80 
81     std::cout << std::fixed << std::setprecision(4);
82     std::cout << "Learned Q-values:\n";
83     for (int s = 0; s < mdp.S; ++s) {
84         std::cout << "s=" << s << ": ";
85         for (int a = 0; a < mdp.A; ++a) std::cout << "Q[" << a << "]=" << Q[s][a] << " ";
86         std::cout << "\n";
87     }
88 
89     std::cout << "Greedy policy after Q-learning (0-based actions):\n";
90     for (int s = 0; s < mdp.S; ++s) std::cout << "pi(" << s << ") = " << policy[s] << "\n";
91 
92     return 0;
93 }
94

This example learns Q* from experience using the Q-learning update, an off-policy stochastic approximation to the Bellman optimality equation. We simulate transitions from the MDP, apply an ε-greedy behavior policy for exploration, and iteratively update Q toward the one-step optimality target.

Time: O(T A) total for T environment steps (A for the max over actions per update).Space: O(SA) to store the Q-table, plus O(E) for the model if used for simulation.

1	#include <iostream>
2	#include <vector>
3	#include <utility>
4	#include <cmath>
5	#include <iomanip>
6	#include <cassert>
7
8	struct Transition {
9	int next; // next state index
10	double prob; // probability of transition
11	};
12
13	struct MDP {
14	int S; // number of states
15	int A; // number of actions
16	std::vector<std::vector<std::vector<Transition>>> P; // P[s][a] = list of (s', p)
17	std::vector<std::vector<double>> R; // R[s][a] = expected immediate reward for (s,a)
18	std::vector<bool> terminal; // terminal states
19	};
20
21	// One sweep of Bellman expectation update for V under policy pi
22	void policy_evaluation_sweep(const MDP& mdp,
23	const std::vector<std::vector<double>>& pi,
24	double gamma,
25	const std::vector<double>& V_old,
26	std::vector<double>& V_new) {
27	for (int s = 0; s < mdp.S; ++s) {
28	if (mdp.terminal[s]) { V_new[s] = 0.0; continue; }
29	double val = 0.0;
30	for (int a = 0; a < mdp.A; ++a) {
31	double expected_next = 0.0;
32	for (const auto& tr : mdp.P[s][a]) {
33	expected_next += tr.prob * V_old[tr.next];
34	}
35	val += pi[s][a] * (mdp.R[s][a] + gamma * expected_next);
36	}
37	V_new[s] = val;
38	}
39	}
40
41	std::vector<double> policy_evaluation(const MDP& mdp,
42	const std::vector<std::vector<double>>& pi,
43	double gamma,
44	int max_iters = 1000,
45	double theta = 1e-8) {
46	std::vector<double> V(mdp.S, 0.0), V_new(mdp.S, 0.0);
47	for (int it = 0; it < max_iters; ++it) {
48	policy_evaluation_sweep(mdp, pi, gamma, V, V_new);
49	double delta = 0.0;
50	for (int s = 0; s < mdp.S; ++s) delta = std::max(delta, std::fabs(V_new[s] - V[s]));
51	V.swap(V_new);
52	if (delta < theta) break; // sup-norm stopping criterion
53	}
54	return V;
55	}
56
57	int main() {
58	// Build a small MDP with 4 states (state 3 is terminal) and 2 actions
59	MDP mdp; mdp.S = 4; mdp.A = 2;
60	mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A));
61	mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0));
62	mdp.terminal.assign(mdp.S, false);
63
64	// Define transitions and rewards
65	// State 0
66	mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0;
67	mdp.P[0][1] = { {2,1.0} }; mdp.R[0][1] = 1.0;
68	// State 1
69	mdp.P[1][0] = { {3,1.0} }; mdp.R[1][0] = 2.0;
70	mdp.P[1][1] = { {0,1.0} }; mdp.R[1][1] = 0.0;
71	// State 2
72	mdp.P[2][0] = { {3,1.0} }; mdp.R[2][0] = 3.0;
73	mdp.P[2][1] = { {1,1.0} }; mdp.R[2][1] = 0.5;
74	// State 3 (terminal)
75	mdp.terminal[3] = true;
76	mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; // self-loop
77	mdp.R[3][0] = mdp.R[3][1] = 0.0;
78
79	// Define a stochastic policy pi[a\|s]
80	std::vector<std::vector<double>> pi(mdp.S, std::vector<double>(mdp.A, 0.5));
81	pi[0] = {0.0, 1.0}; // in s=0 choose action 1 deterministically
82	pi[1] = {1.0, 0.0}; // in s=1 choose action 0 deterministically
83	pi[2] = {0.0, 1.0}; // in s=2 choose action 1 deterministically
84	pi[3] = {0.5, 0.5}; // terminal doesn't matter
85
86	// Validate that each pi[s] sums to 1
87	for (int s = 0; s < mdp.S; ++s) {
88	double sum = pi[s][0] + pi[s][1];
89	assert(std::fabs(sum - 1.0) < 1e-9);
90	}
91
92	double gamma = 0.9;
93	auto V = policy_evaluation(mdp, pi, gamma, 10000, 1e-12);
94
95	std::cout << std::fixed << std::setprecision(6);
96	std::cout << "State values under the given policy (gamma=0.9):\n";
97	for (int s = 0; s < mdp.S; ++s) {
98	std::cout << "V[" << s << "] = " << V[s] << "\n";
99	}
100
101	return 0;
102	}
103

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <algorithm>
5	#include <iomanip>
6
7	struct Transition { int next; double prob; };
8	struct MDP {
9	int S, A;
10	std::vector<std::vector<std::vector<Transition>>> P;
11	std::vector<std::vector<double>> R;
12	std::vector<bool> terminal;
13	};
14
15	int sample_next_state(const std::vector<Transition>& tr_list, std::mt19937& rng) {
16	std::uniform_real_distribution<double> dist(0.0, 1.0);
17	double u = dist(rng), cdf = 0.0;
18	for (const auto& tr : tr_list) { cdf += tr.prob; if (u <= cdf) return tr.next; }
19	return tr_list.back().next; // fallback for numerical issues
20	}
21
22	int epsilon_greedy_action(const std::vector<std::vector<double>>& Q, int s, double epsilon, std::mt19937& rng) {
23	std::uniform_real_distribution<double> dist(0.0, 1.0);
24	if (dist(rng) < epsilon) {
25	std::uniform_int_distribution<int> a_dist(0, (int)Q[s].size()-1);
26	return a_dist(rng);
27	}
28	int besta = 0; double bestq = Q[s][0];
29	for (int a = 1; a < (int)Q[s].size(); ++a) if (Q[s][a] > bestq) { bestq = Q[s][a]; besta = a; }
30	return besta;
31	}
32
33	int main() {
34	// Build the same MDP
35	MDP mdp; mdp.S = 4; mdp.A = 2;
36	mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A));
37	mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0));
38	mdp.terminal.assign(mdp.S, false);
39
40	mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0;
41	mdp.P[0][1] = { {2,1.0} }; mdp.R[0][1] = 1.0;
42	mdp.P[1][0] = { {3,1.0} }; mdp.R[1][0] = 2.0;
43	mdp.P[1][1] = { {0,1.0} }; mdp.R[1][1] = 0.0;
44	mdp.P[2][0] = { {3,1.0} }; mdp.R[2][0] = 3.0;
45	mdp.P[2][1] = { {1,1.0} }; mdp.R[2][1] = 0.5;
46	mdp.terminal[3] = true; mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} };
47
48	// Q-learning parameters
49	double gamma = 0.9;
50	double alpha = 0.1; // learning rate
51	double epsilon = 0.1; // exploration
52	int episodes = 2000;
53	int max_steps = 100; // per episode cap
54
55	std::vector<std::vector<double>> Q(mdp.S, std::vector<double>(mdp.A, 0.0));
56	std::mt19937 rng(12345);
57	std::uniform_int_distribution<int> start_dist(0, mdp.S - 2); // avoid starting in terminal
58
59	for (int ep = 0; ep < episodes; ++ep) {
60	int s = start_dist(rng);
61	for (int t = 0; t < max_steps; ++t) {
62	if (mdp.terminal[s]) break;
63	int a = epsilon_greedy_action(Q, s, epsilon, rng);
64	int s_next = sample_next_state(mdp.P[s][a], rng);
65	double r = mdp.R[s][a];
66	// TD target: r + gamma * max_a' Q(s', a')
67	double max_q_next = *std::max_element(Q[s_next].begin(), Q[s_next].end());
68	double td_target = r + (mdp.terminal[s_next] ? 0.0 : gamma * max_q_next);
69	double td_error = td_target - Q[s][a];
70	Q[s][a] += alpha * td_error;
71	s = s_next;
72	}
73	}
74
75	// Derive greedy policy from learned Q
76	std::vector<int> policy(mdp.S, 0);
77	for (int s = 0; s < mdp.S; ++s) {
78	policy[s] = std::distance(Q[s].begin(), std::max_element(Q[s].begin(), Q[s].end()));
79	}
80
81	std::cout << std::fixed << std::setprecision(4);
82	std::cout << "Learned Q-values:\n";
83	for (int s = 0; s < mdp.S; ++s) {
84	std::cout << "s=" << s << ": ";
85	for (int a = 0; a < mdp.A; ++a) std::cout << "Q[" << a << "]=" << Q[s][a] << " ";
86	std::cout << "\n";
87	}
88
89	std::cout << "Greedy policy after Q-learning (0-based actions):\n";
90	for (int s = 0; s < mdp.S; ++s) std::cout << "pi(" << s << ") = " << policy[s] << "\n";
91
92	return 0;
93	}
94