Bellman Equations
Key Points
- •Bellman equations express how the value of a state or action equals immediate reward plus discounted value of what follows.
- •They turn long-term planning into local, recursive updates that can be solved by fixed-point iteration.
- •For a fixed policy, the Bellman operator is a guaranteeing a unique value function and convergence by iteration.
- •Bellman optimality equations introduce a max over actions to define the best possible long-term value.
- •Algorithms like policy evaluation, value iteration, policy iteration, and Q-learning all aim to satisfy Bellman equations.
- •In discrete MDPs, tabular methods compute values exactly; with large/continuous spaces, function approximation or HJB PDEs are used.
- •Convergence speed and stability depend on the discount factor transition structure, and learning rates in RL.
- •Getting terminal states, probability normalization, and the difference between expectation (policy evaluation) and max (optimality) correct is crucial.
Prerequisites
- →Probability and Expectation — Bellman equations average over actions and next states; understanding expectation is essential.
- →Markov Chains — MDP dynamics reduce to Markov transitions when the policy is fixed.
- →Markov Decision Processes (MDPs) — Bellman equations are defined on top of the MDP framework of states, actions, transitions, and rewards.
- →Dynamic Programming — Bellman equations embody optimal substructure and recursive decomposition used in DP.
- →Matrix and Linear Algebra — Policy evaluation can be written and solved as a linear system (I − γP^π)V = R^π.
- →Norms and Contraction Mappings — Convergence guarantees rely on the γ-contraction property under the sup-norm.
- →Stochastic Approximation — Q-learning convergence analysis uses stochastic approximation theory and learning-rate conditions.
- →Basic C++ (Vectors, Loops, Functions) — Implementing tabular value iteration or Q-learning requires fundamental C++ skills.
Detailed Explanation
Tap terms for definitions01Overview
Bellman equations are the backbone of dynamic programming and value-based reinforcement learning. They state that the value of a decision now equals the immediate payoff plus the discounted value of the future, assuming some way of choosing actions (a policy) or choosing optimally. This recursive relationship breaks a long, uncertain future into smaller, local computations. In a Markov Decision Process (MDP), the environment is summarized by states, actions, rewards, and transition probabilities. The state-value function V measures how good it is to be in a state under a policy, while the action-value function Q measures how good it is to take a specific action in a state under a policy. The Bellman expectation equations characterize V and Q for a given policy; the Bellman optimality equations characterize the best achievable values V* and Q*. Because Bellman operators are contractions under the infinity norm when 0 ≤ γ < 1, iterative application converges to a unique fixed point. This leads directly to practical algorithms such as policy evaluation, value iteration, and Q-learning, which iteratively improve estimates until they satisfy the equations closely. In continuous-time control, the analogous concept is the Hamilton–Jacobi–Bellman (HJB) partial differential equation, which connects optimal control with calculus of variations. Bellman equations unify planning (when a model is known) and learning (when it is not), providing both theoretical guarantees and computational recipes for decision making under uncertainty.
02Intuition & Analogies
Imagine planning a road trip. You care about the enjoyment now (scenery, snacks) and what follows (next stops), but you discount far-future enjoyment a bit because it’s uncertain or you’re impatient. The value of your current location equals what you get now plus a fraction of what you expect later. That is the Bellman idea: break a big trip (long horizon) into small hops (immediate step + future value). Now, suppose you follow a fixed travel style (your policy): maybe you prefer scenic roads 70% of the time and fastest roads 30%. The expected future enjoyment depends on that style, so your value is the average over choices you’ll likely make, weighted by their probabilities. That’s the Bellman expectation equation. If instead you’re optimizing, you’ll always pick the best next road; then the future value at each step is the maximum over choices, leading to the Bellman optimality equation. The magic is that you can compute long-term value by repeatedly updating each location from nearby locations, like smoothing with local rules, until it stops changing. Because each step shrinks the error by γ (the discount), this local update steadily converges. In the real world, you might not know exactly how traffic behaves (the transition model), so you learn from experience: you try a route, see where you end up, and update your estimate of how good that action was. That’s Q-learning—using samples to push your action-value toward the Bellman optimality target. Whether you have a full map (planning) or learn by driving (learning), the principle is the same: today’s value equals today’s reward plus a discounted view of tomorrow.
03Formal Definition
04When to Use
Use Bellman equations when solving Markov Decision Processes or designing value-based reinforcement learning algorithms. If you know the model (P and R), you can compute exact or near-exact solutions via planning methods: policy evaluation (to assess a fixed policy), policy iteration (alternate evaluation and improvement), or value iteration (apply the optimality operator directly). These are ideal for small to medium tabular problems such as gridworlds, queueing systems, inventory control, and finite-horizon planning. When the model is unknown or too complex, use sample-based methods that still aim for Bellman consistency, such as temporal-difference learning and Q-learning; they work from experience trajectories collected by interacting with the environment. In large or continuous spaces, you approximate value functions with linear features or neural networks; while exact Bellman updates are impossible, you still minimize Bellman error (e.g., in DQN). If dynamics are continuous in time, use the Hamilton–Jacobi–Bellman framework to derive optimality conditions and numerically solve PDEs when feasible. Choose Bellman approaches when you need principled, convergent methods (γ < 1) with interpretable value functions and when optimal or near-optimal long-term performance matters rather than myopic rewards.
⚠️Common Mistakes
Common pitfalls include: (1) Mixing expectation and maximization: in policy evaluation you must average over π(a|s); using max there computes an improved policy instead. (2) Mishandling terminal states: terminals should have zero continuation value and self-loop transitions with probability 1; forgetting this yields inflated values or non-convergence. (3) Using γ = 1 without proper conditions: with infinite horizons, γ must be < 1 (or rewards must be bounded and episodic) to ensure contraction and convergence. (4) Not normalizing transition probabilities P(s'|s,a) to sum to 1, which breaks the probabilistic interpretation and skews values. (5) Large learning rates or non-decaying α in Q-learning can prevent convergence; α_t should typically decrease or be small and stable. (6) Off-policy instability with function approximation: naive Q-learning with powerful approximators can diverge; use target networks or double estimators. (7) Stopping criteria based on small average changes rather than max-norm can miss worst-case errors; use the sup-norm threshold. (8) Confusing Q* and V*: V*(s) = max_a Q*(s,a), but during learning these may be inconsistent—derive policies greedily from Q, not partially from V. (9) Ignoring reward scaling: very large rewards slow convergence and cause numerical issues; normalize or adjust γ. (10) Overlooking that computational cost scales with the number of nonzero transitions: dense models are far more expensive than sparse ones; use sparse representations and precompute adjacency.
Key Formulas
Bellman Expectation (State Value)
Explanation: The value under policy π equals the expected immediate reward plus discounted value of the next state. Average over actions using π and over next states using the transition model.
Bellman Expectation (Action Value)
Explanation: The expected return after taking action a in s and then following π equals reward plus discounted next-step action-values averaged by
Optimal Value from Optimal Q
Explanation: The optimal state value is achieved by taking the best possible action at that state. This ties optimal V* and Q* together.
Bellman Optimality (Action Value)
Explanation: Optimal action-values satisfy a fixed point where targets use the maximum over next actions. This is the foundation for Q-learning.
Bellman Operator (Policy)
Explanation: Applies a single Bellman expectation backup to a value function. Repeatedly applying T^π converges to V^
Contraction Property
Explanation: The Bellman expectation operator shrinks the sup-norm distance between two value functions by at least a factor This ensures a unique fixed point and convergence by iteration.
Linear System for Policy Evaluation
Explanation: Exact policy evaluation can be written as a linear system using the state transition matrix under π and expected rewards. Solving this yields V^π in one step for small problems.
Value Iteration and Error Bound
Explanation: Applying the optimality operator iteratively converges to V*. The bound quantifies how fast the error shrinks with k, depending on γ and the initial residual.
Q-learning Update
Explanation: A sample-based stochastic approximation to the Bellman optimality equation. With suitable conditions, Q-learning converges to Q* in the tabular setting.
Hamilton–Jacobi–Bellman (Continuous Time)
Explanation: The continuous-time counterpart of Bellman optimality. It characterizes the optimal value via a nonlinear PDE involving drift, diffusion, and the reward.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <utility> 4 #include <cmath> 5 #include <iomanip> 6 #include <cassert> 7 8 struct Transition { 9 int next; // next state index 10 double prob; // probability of transition 11 }; 12 13 struct MDP { 14 int S; // number of states 15 int A; // number of actions 16 std::vector<std::vector<std::vector<Transition>>> P; // P[s][a] = list of (s', p) 17 std::vector<std::vector<double>> R; // R[s][a] = expected immediate reward for (s,a) 18 std::vector<bool> terminal; // terminal states 19 }; 20 21 // One sweep of Bellman expectation update for V under policy pi 22 void policy_evaluation_sweep(const MDP& mdp, 23 const std::vector<std::vector<double>>& pi, 24 double gamma, 25 const std::vector<double>& V_old, 26 std::vector<double>& V_new) { 27 for (int s = 0; s < mdp.S; ++s) { 28 if (mdp.terminal[s]) { V_new[s] = 0.0; continue; } 29 double val = 0.0; 30 for (int a = 0; a < mdp.A; ++a) { 31 double expected_next = 0.0; 32 for (const auto& tr : mdp.P[s][a]) { 33 expected_next += tr.prob * V_old[tr.next]; 34 } 35 val += pi[s][a] * (mdp.R[s][a] + gamma * expected_next); 36 } 37 V_new[s] = val; 38 } 39 } 40 41 std::vector<double> policy_evaluation(const MDP& mdp, 42 const std::vector<std::vector<double>>& pi, 43 double gamma, 44 int max_iters = 1000, 45 double theta = 1e-8) { 46 std::vector<double> V(mdp.S, 0.0), V_new(mdp.S, 0.0); 47 for (int it = 0; it < max_iters; ++it) { 48 policy_evaluation_sweep(mdp, pi, gamma, V, V_new); 49 double delta = 0.0; 50 for (int s = 0; s < mdp.S; ++s) delta = std::max(delta, std::fabs(V_new[s] - V[s])); 51 V.swap(V_new); 52 if (delta < theta) break; // sup-norm stopping criterion 53 } 54 return V; 55 } 56 57 int main() { 58 // Build a small MDP with 4 states (state 3 is terminal) and 2 actions 59 MDP mdp; mdp.S = 4; mdp.A = 2; 60 mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A)); 61 mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0)); 62 mdp.terminal.assign(mdp.S, false); 63 64 // Define transitions and rewards 65 // State 0 66 mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0; 67 mdp.P[0][1] = { {2,1.0} }; mdp.R[0][1] = 1.0; 68 // State 1 69 mdp.P[1][0] = { {3,1.0} }; mdp.R[1][0] = 2.0; 70 mdp.P[1][1] = { {0,1.0} }; mdp.R[1][1] = 0.0; 71 // State 2 72 mdp.P[2][0] = { {3,1.0} }; mdp.R[2][0] = 3.0; 73 mdp.P[2][1] = { {1,1.0} }; mdp.R[2][1] = 0.5; 74 // State 3 (terminal) 75 mdp.terminal[3] = true; 76 mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; // self-loop 77 mdp.R[3][0] = mdp.R[3][1] = 0.0; 78 79 // Define a stochastic policy pi[a|s] 80 std::vector<std::vector<double>> pi(mdp.S, std::vector<double>(mdp.A, 0.5)); 81 pi[0] = {0.0, 1.0}; // in s=0 choose action 1 deterministically 82 pi[1] = {1.0, 0.0}; // in s=1 choose action 0 deterministically 83 pi[2] = {0.0, 1.0}; // in s=2 choose action 1 deterministically 84 pi[3] = {0.5, 0.5}; // terminal doesn't matter 85 86 // Validate that each pi[s] sums to 1 87 for (int s = 0; s < mdp.S; ++s) { 88 double sum = pi[s][0] + pi[s][1]; 89 assert(std::fabs(sum - 1.0) < 1e-9); 90 } 91 92 double gamma = 0.9; 93 auto V = policy_evaluation(mdp, pi, gamma, 10000, 1e-12); 94 95 std::cout << std::fixed << std::setprecision(6); 96 std::cout << "State values under the given policy (gamma=0.9):\n"; 97 for (int s = 0; s < mdp.S; ++s) { 98 std::cout << "V[" << s << "] = " << V[s] << "\n"; 99 } 100 101 return 0; 102 } 103
This program builds a small tabular MDP and evaluates a fixed policy by iteratively applying the Bellman expectation backup. Terminal states are handled as zero continuation value with self-loops. The sup-norm change (max absolute difference across states) is used for convergence.
1 #include <iostream> 2 #include <vector> 3 #include <utility> 4 #include <cmath> 5 #include <iomanip> 6 7 struct Transition { int next; double prob; }; 8 struct MDP { 9 int S, A; 10 std::vector<std::vector<std::vector<Transition>>> P; 11 std::vector<std::vector<double>> R; 12 std::vector<bool> terminal; 13 }; 14 15 std::vector<double> value_iteration(const MDP& mdp, double gamma, int max_iters, double theta) { 16 std::vector<double> V(mdp.S, 0.0), V_new(mdp.S, 0.0); 17 for (int it = 0; it < max_iters; ++it) { 18 double delta = 0.0; 19 for (int s = 0; s < mdp.S; ++s) { 20 if (mdp.terminal[s]) { V_new[s] = 0.0; continue; } 21 double best = -1e100; 22 for (int a = 0; a < mdp.A; ++a) { 23 double expected_next = 0.0; 24 for (const auto& tr : mdp.P[s][a]) expected_next += tr.prob * V[tr.next]; 25 best = std::max(best, mdp.R[s][a] + gamma * expected_next); 26 } 27 V_new[s] = best; 28 delta = std::max(delta, std::fabs(V_new[s] - V[s])); 29 } 30 V.swap(V_new); 31 if (delta < theta) break; 32 } 33 return V; 34 } 35 36 std::vector<int> greedy_policy_from_V(const MDP& mdp, const std::vector<double>& V, double gamma) { 37 std::vector<int> policy(mdp.S, 0); 38 for (int s = 0; s < mdp.S; ++s) { 39 if (mdp.terminal[s]) { policy[s] = 0; continue; } 40 double best = -1e100; int besta = 0; 41 for (int a = 0; a < mdp.A; ++a) { 42 double expected_next = 0.0; 43 for (const auto& tr : mdp.P[s][a]) expected_next += tr.prob * V[tr.next]; 44 double q = mdp.R[s][a] + gamma * expected_next; 45 if (q > best) { best = q; besta = a; } 46 } 47 policy[s] = besta; 48 } 49 return policy; 50 } 51 52 int main() { 53 // Same MDP as in the previous example 54 MDP mdp; mdp.S = 4; mdp.A = 2; 55 mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A)); 56 mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0)); 57 mdp.terminal.assign(mdp.S, false); 58 59 mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0; 60 mdp.P[0][1] = { {2,1.0} }; mdp.R[0][1] = 1.0; 61 mdp.P[1][0] = { {3,1.0} }; mdp.R[1][0] = 2.0; 62 mdp.P[1][1] = { {0,1.0} }; mdp.R[1][1] = 0.0; 63 mdp.P[2][0] = { {3,1.0} }; mdp.R[2][0] = 3.0; 64 mdp.P[2][1] = { {1,1.0} }; mdp.R[2][1] = 0.5; 65 mdp.terminal.assign(mdp.S, false); mdp.terminal[3] = true; 66 mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; mdp.R[3][0] = mdp.R[3][1] = 0.0; 67 68 double gamma = 0.9; 69 auto V = value_iteration(mdp, gamma, 10000, 1e-12); 70 auto pi_greedy = greedy_policy_from_V(mdp, V, gamma); 71 72 std::cout << std::fixed << std::setprecision(6); 73 std::cout << "Optimal state values (approx):\n"; 74 for (int s = 0; s < mdp.S; ++s) std::cout << "V*[" << s << "] = " << V[s] << "\n"; 75 76 std::cout << "Greedy policy (0-based action indices):\n"; 77 for (int s = 0; s < mdp.S; ++s) std::cout << "pi*(" << s << ") = action " << pi_greedy[s] << "\n"; 78 79 return 0; 80 } 81
This program applies the Bellman optimality operator until convergence to approximate V*. It then extracts a greedy policy by choosing, in each state, the action that maximizes the one-step lookahead using the converged V.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <algorithm> 5 #include <iomanip> 6 7 struct Transition { int next; double prob; }; 8 struct MDP { 9 int S, A; 10 std::vector<std::vector<std::vector<Transition>>> P; 11 std::vector<std::vector<double>> R; 12 std::vector<bool> terminal; 13 }; 14 15 int sample_next_state(const std::vector<Transition>& tr_list, std::mt19937& rng) { 16 std::uniform_real_distribution<double> dist(0.0, 1.0); 17 double u = dist(rng), cdf = 0.0; 18 for (const auto& tr : tr_list) { cdf += tr.prob; if (u <= cdf) return tr.next; } 19 return tr_list.back().next; // fallback for numerical issues 20 } 21 22 int epsilon_greedy_action(const std::vector<std::vector<double>>& Q, int s, double epsilon, std::mt19937& rng) { 23 std::uniform_real_distribution<double> dist(0.0, 1.0); 24 if (dist(rng) < epsilon) { 25 std::uniform_int_distribution<int> a_dist(0, (int)Q[s].size()-1); 26 return a_dist(rng); 27 } 28 int besta = 0; double bestq = Q[s][0]; 29 for (int a = 1; a < (int)Q[s].size(); ++a) if (Q[s][a] > bestq) { bestq = Q[s][a]; besta = a; } 30 return besta; 31 } 32 33 int main() { 34 // Build the same MDP 35 MDP mdp; mdp.S = 4; mdp.A = 2; 36 mdp.P.assign(mdp.S, std::vector<std::vector<Transition>>(mdp.A)); 37 mdp.R.assign(mdp.S, std::vector<double>(mdp.A, 0.0)); 38 mdp.terminal.assign(mdp.S, false); 39 40 mdp.P[0][0] = { {0,0.1}, {1,0.6}, {2,0.3} }; mdp.R[0][0] = 0.0; 41 mdp.P[0][1] = { {2,1.0} }; mdp.R[0][1] = 1.0; 42 mdp.P[1][0] = { {3,1.0} }; mdp.R[1][0] = 2.0; 43 mdp.P[1][1] = { {0,1.0} }; mdp.R[1][1] = 0.0; 44 mdp.P[2][0] = { {3,1.0} }; mdp.R[2][0] = 3.0; 45 mdp.P[2][1] = { {1,1.0} }; mdp.R[2][1] = 0.5; 46 mdp.terminal[3] = true; mdp.P[3][0] = { {3,1.0} }; mdp.P[3][1] = { {3,1.0} }; 47 48 // Q-learning parameters 49 double gamma = 0.9; 50 double alpha = 0.1; // learning rate 51 double epsilon = 0.1; // exploration 52 int episodes = 2000; 53 int max_steps = 100; // per episode cap 54 55 std::vector<std::vector<double>> Q(mdp.S, std::vector<double>(mdp.A, 0.0)); 56 std::mt19937 rng(12345); 57 std::uniform_int_distribution<int> start_dist(0, mdp.S - 2); // avoid starting in terminal 58 59 for (int ep = 0; ep < episodes; ++ep) { 60 int s = start_dist(rng); 61 for (int t = 0; t < max_steps; ++t) { 62 if (mdp.terminal[s]) break; 63 int a = epsilon_greedy_action(Q, s, epsilon, rng); 64 int s_next = sample_next_state(mdp.P[s][a], rng); 65 double r = mdp.R[s][a]; 66 // TD target: r + gamma * max_a' Q(s', a') 67 double max_q_next = *std::max_element(Q[s_next].begin(), Q[s_next].end()); 68 double td_target = r + (mdp.terminal[s_next] ? 0.0 : gamma * max_q_next); 69 double td_error = td_target - Q[s][a]; 70 Q[s][a] += alpha * td_error; 71 s = s_next; 72 } 73 } 74 75 // Derive greedy policy from learned Q 76 std::vector<int> policy(mdp.S, 0); 77 for (int s = 0; s < mdp.S; ++s) { 78 policy[s] = std::distance(Q[s].begin(), std::max_element(Q[s].begin(), Q[s].end())); 79 } 80 81 std::cout << std::fixed << std::setprecision(4); 82 std::cout << "Learned Q-values:\n"; 83 for (int s = 0; s < mdp.S; ++s) { 84 std::cout << "s=" << s << ": "; 85 for (int a = 0; a < mdp.A; ++a) std::cout << "Q[" << a << "]=" << Q[s][a] << " "; 86 std::cout << "\n"; 87 } 88 89 std::cout << "Greedy policy after Q-learning (0-based actions):\n"; 90 for (int s = 0; s < mdp.S; ++s) std::cout << "pi(" << s << ") = " << policy[s] << "\n"; 91 92 return 0; 93 } 94
This example learns Q* from experience using the Q-learning update, an off-policy stochastic approximation to the Bellman optimality equation. We simulate transitions from the MDP, apply an ε-greedy behavior policy for exploration, and iteratively update Q toward the one-step optimality target.