📚TheoryAdvanced

Reinforcement Learning Theory

Key Points

  • Reinforcement Learning (RL) studies how an agent learns to act in an environment to maximize long-term cumulative reward.
  • Markov Decision Processes (MDPs) formalize RL problems with states, actions, transition probabilities, rewards, and a discount factor.
  • The Bellman equations relate the value of a state to the values of successor states and are the backbone of dynamic programming methods.
  • Value iteration and policy iteration solve known MDPs exactly, while Q-learning learns optimal action values from samples without a model.
  • Policy gradient and actor-critic methods directly optimize a parameterized policy, scaling to continuous actions and large state spaces.
  • Tabular Q-learning converges under sufficient exploration, but function approximation can diverge when combined with bootstrapping and off-policy learning (the deadly triad).
  • Sample complexity in RL is captured by PAC-MDP bounds, which scale polynomially in , , 1/(1− 1/ and log(1/
  • RL is foundational for RLHF, game-playing AIs, and robotics control, where sequential decision-making under uncertainty is critical.

Prerequisites

  • Probability Theory (expectations, conditional probability)Value functions and Bellman equations are expectations over stochastic transitions and policies.
  • Markov ChainsMDPs extend Markov chains by adding actions; understanding state transitions and stationary behavior is essential.
  • Dynamic ProgrammingBellman optimality and iterative improvement are DP principles applied to decision processes.
  • Linear Algebra and NormsContraction mappings, fixed points, and convergence analyses use vector norms and operators.
  • Calculus and Gradient MethodsPolicy gradient theorem and optimization require gradients and step size reasoning.
  • Stochastic ApproximationConvergence of TD and Q-learning relies on stochastic approximation and diminishing step sizes.
  • Programming with Randomness (PRNGs)RL algorithms require sampling actions and stochastic transitions during training.

Detailed Explanation

Tap terms for definitions

01Overview

Reinforcement Learning (RL) is the study of how an autonomous agent learns to make sequences of decisions by interacting with an environment to maximize cumulative reward. Problems are modeled as Markov Decision Processes (MDPs), which specify the set of states S, actions A, a transition model P describing how actions change the state, a reward function R quantifying immediate utility, and a discount factor γ that balances immediate and future rewards. The central objects are value functions: V^π(s) (state value) and Q^π(s,a) (action value), which predict expected discounted returns under a policy π. The Bellman equations tie these values together recursively, enabling algorithms that propagate information about immediate rewards to estimates of long-term outcomes. When the MDP is known, dynamic programming (DP) methods like value iteration and policy iteration compute optimal policies. When the model is unknown, model-free methods like Q-learning and SARSA learn from sampled transitions. For large or continuous spaces, policies and value functions are parameterized (e.g., with neural networks), and optimized with policy gradients or actor-critic methods. RL theory studies convergence guarantees, sample complexity, stability, and bias–variance trade-offs, informing practical algorithm design and safe deployment.

02Intuition & Analogies

Imagine teaching a dog tricks. You say “sit,” and if the dog sits, you give a treat. Over time, the dog learns which actions (sit, stay, roll over) in which situations (hearing specific commands) lead to treats. The dog doesn’t get a rulebook up front; it learns from trial and error, guided by rewards. That’s RL: an agent (the dog) observes the state (the command and context), takes an action, gets feedback (a reward), and adjusts behavior to get more rewards in the future. Now picture a GPS navigator choosing a route. Each turn has a small time cost, traffic conditions add uncertainty, and the goal is to minimize total travel time. The GPS must weigh immediate gains (a quick turn now) against long-term outcomes (avoiding a jam later). The discount factor γ is like how much the navigator cares about the time 5 minutes from now compared to right now. The Bellman equations are like common sense rules: the value of being on a road equals the immediate cost of that road plus the best expected value of the roads that follow. Value iteration is like repeatedly recalculating best paths as new traffic data comes in until things settle. Q-learning is like learning, from experience alone, how good each turn is without ever seeing the full map. Policy gradients are like directly tuning the GPS’s “style” of choosing turns—aggressive vs. cautious—guided by how well recent trips went. Actor-critic separates these roles: the actor proposes turns (policy) while the critic scores how good a location is (value), just like a driver (actor) guided by a traffic analyst (critic).

03Formal Definition

An MDP is a tuple (S, A, P, R, where S is a (finite or measurable) state space, A is an action space, P(s'|s,a) is the transition kernel, R(s,a) is the expected immediate reward, and γ ∈ [0,1) is the discount factor. A policy is a conditional distribution over actions given states. The discounted return from time t is = . The value functions are V^ = _ [ , ]. The Bellman expectation equations are V^ = (as,a) V^ ] and Q^ = R(s,a) + P(s's') Q^ The optimal value functions satisfy V^*(s) = [ R(s,a) + P(s's,a) Q^*(s',a'). The Bellman optimality operator is a -contraction in the _ norm, implying existence and uniqueness of V^*. Tabular Q-learning uses the stochastic approximation Q(,) Q(,) + [ + Q(,a') - Q(,) ] and converges to Q^* under standard conditions (sufficient exploration, diminishing step sizes). Policy gradient methods optimize a differentiable objective J( via the policy gradient theorem: _θ J( = _ [ _ Q^ ], often using baselines to reduce variance.

04When to Use

Use dynamic programming (value or policy iteration) when you know the full MDP model: transition probabilities and rewards. This is common in planning problems, small gridworlds, and simulated environments where the model is explicit. Use model-free temporal-difference methods (Q-learning, SARSA) when the model is unknown but you can interact with the environment: robotics with simulators, online recommendation systems, or games where you can play many episodes. When state/action spaces are large or continuous (e.g., image observations, torque control), prefer function approximation and policy-based methods: actor-critic, proximal policy optimization (PPO), or deep Q-networks (DQN) for discrete actions. For continuous actions or stochastic optimal control (robot arm, autonomous driving throttle/steering), policy gradients and actor-critic are the natural fit. If exploration is costly or risky, consider model-based RL to learn a dynamics model and plan conservatively, or use safer exploration strategies and off-policy evaluation. In multi-agent or competitive settings (poker, StarCraft), extend to stochastic games and equilibrium concepts, with algorithms like policy gradient in self-play. For aligning AI systems with human preferences (RLHF), use RL theory to construct reward models from feedback and optimize policies under constraints (e.g., KL-regularization).

⚠️Common Mistakes

• Misusing the discount factor γ: setting γ too high (near 1) without enough iterations or careful step sizes causes instability and slow convergence; too low dismisses long-term rewards. Start with γ in [0.95, 0.99] and verify numerical stability. • Insufficient exploration: greedy or decaying-ε too fast traps learning in suboptimal policies. Use ε-greedy with persistent exploration, optimistic initialization, or exploration bonuses. • Mixing off-policy data with function approximation and bootstrapping (the deadly triad) can diverge. If you must be off-policy, prefer algorithms with stability mechanisms (target networks, double estimators, compatible function approximation, or gradient TD methods). • High-variance returns in policy gradients: using raw returns without baselines leads to noisy updates. Subtract a value-function baseline, normalize advantages, and use reward-to-go. • Target/behavior mismatch in Q-learning: using the same network for bootstrapped targets overestimates action values. Use target networks and double Q-learning. • Poor learning rates: constant large α prevents convergence; too small stalls learning. Use diminishing step sizes or adaptive schedules and monitor TD error. • Reward hacking and shaping pitfalls: naive shaping can change the optimal policy. Use potential-based shaping to preserve optimality and carefully validate metrics. • Incorrect terminal handling: forgetting to zero out bootstrapping at terminal states biases updates. • Ignoring variance–bias trade-offs: Monte Carlo has high variance, TD has bias; actor-critic balances both. • Inadequate evaluation: evaluating on-policy while training off-policy, or not separating training/evaluation seeds, leads to misleading conclusions.

Key Formulas

Discounted Return

Explanation: The total future reward from time t, discounted by γ each step. It formalizes long-term objectives while keeping sums finite for

State Value

Explanation: Expected discounted return starting at state s and following policy It measures how good it is to be in state s under

Action Value

Explanation: Expected discounted return starting at state s, taking action a first, then following It compares immediate choices.

Bellman Expectation (State Value)

Explanation: Relates a state’s value to the expected immediate reward plus the discounted value of successor states under policy

Bellman Optimality (State Value)

Explanation: Defines the optimal value as the best immediate action plus the discounted optimal value of next states.

Bellman Optimality (Action Value)

Explanation: Expresses optimal action values via a one-step lookahead and greedy continuation, foundational for Q-learning.

Q-Learning Update

Explanation: Stochastic approximation that moves Q toward a bootstrapped target using a learning rate Converges in the tabular case under standard conditions.

TD(0) Update

Explanation: Bootstrapped update for value prediction using the TD error. Balances bias and variance versus Monte Carlo methods.

Policy Gradient Theorem

Explanation: Gives an unbiased gradient estimator depending only on the policy and action-value, not on environment dynamics. Enables direct policy optimization.

Advantage

Explanation: Measures how much better action a is compared to the average action at s. Using advantages reduces variance in policy gradients.

Contraction of Bellman Optimality Operator

Explanation: Shows that repeated application of the Bellman operator shrinks errors by factor guaranteeing a unique fixed point V* and convergence of value iteration.

Gradient Ascent Update

Explanation: Updates parameters in the direction of the objective’s gradient with step size Used in policy gradient and actor-critic methods.

PAC-MDP Sample Complexity (Informal)

Explanation: An informal bound on the number of samples to achieve an policy with probability at least 1− Emphasizes dependence on , , the effective horizon 1/(1− and accuracy/confidence.

Entropy-Regularized Objective

Explanation: Adds policy entropy weighted by β to encourage exploration and flatter policies early in learning. Common in modern policy optimization.

Complexity Analysis

In dynamic programming for finite MDPs, value iteration performs Bellman backups over all state–action pairs per sweep. If is the number of states and the number of actions, each sweep costs O(^2 ) in the dense case (O() successors per state) or O( d) if each (s,a) has at most d nonzero transitions. Convergence to within ε in _ requires O\big(\frac{\log(1/\epsilon)}{1-\gamma}\big) sweeps due to the yielding total time O\big(^2 \frac{\log(1/\epsilon)}{1-\gamma}\big). Space is O() for V plus O( d) to store the model. Policy iteration alternates evaluation and improvement; exact evaluation solves a linear system in O(^3) time (dense), while iterative evaluation reduces cost but may require multiple passes; overall complexity depends on problem structure but often converges in fewer iterations than value iteration. Tabular Q-learning updates one (s,a) entry per step at O(1) time and O() space. While per-step cost is constant, sample complexity (number of environment interactions) to reach scales polynomially in , , 1/(1− 1/ and log(1/ under PAC-MDP analyses. In practice, exploration and reward structure can dominate effective complexity. Policy gradient and actor-critic with function approximation have per-update cost proportional to the number of parameters p and batch size B. A typical minibatch gradient takes O(Bp) time and O(p) space, with additional cost for computing advantages or returns (O(B)). For continuous action spaces, computing log-probabilities and their gradients adds O(a) per sample, where a is action dimension. Convergence rates are problem-dependent; variance reduction (baselines, GAE) improves sample efficiency. When using neural networks, training is dominated by forward/backward passes, and stability enhancements (target networks, clipping, trust regions) add modest overhead but can dramatically reduce total sample complexity.

Code Examples

Value Iteration on a Small MDP (Bellman Optimality)
1#include <bits/stdc++.h>
2using namespace std;
3
4// A small finite MDP with 3 states and 2 actions.
5// Transitions are stored as adjacency lists with probabilities.
6// We compute V* and the greedy policy via value iteration.
7
8struct Transition { int next; double prob; double reward; };
9
10int main(){
11 ios::sync_with_stdio(false);
12 cin.tie(nullptr);
13
14 const int S = 3; // states: 0,1,2
15 const int A = 2; // actions: 0,1
16 const double gamma = 0.9;
17 const double theta = 1e-8; // convergence threshold
18
19 // MDP model: P and R together in transitions[s][a]
20 vector<vector<vector<Transition>>> transitions(S, vector<vector<Transition>>(A));
21
22 // Define a simple MDP:
23 // State 0: a=0 goes to 1 with reward +1 (deterministic)
24 // a=1 goes to 2 with reward 0 (deterministic)
25 transitions[0][0] = { {1, 1.0, 1.0} };
26 transitions[0][1] = { {2, 1.0, 0.0} };
27
28 // State 1: a=0 stays with small reward 0.1, a=1 to state 2 with reward 2
29 transitions[1][0] = { {1, 1.0, 0.1} };
30 transitions[1][1] = { {2, 1.0, 2.0} };
31
32 // State 2: terminal-like (self-loop, zero reward)
33 transitions[2][0] = { {2, 1.0, 0.0} };
34 transitions[2][1] = { {2, 1.0, 0.0} };
35
36 vector<double> V(S, 0.0), Vnew(S, 0.0);
37
38 // Value Iteration loop: apply Bellman optimality operator until convergence
39 int iters = 0;
40 while (true) {
41 iters++;
42 double delta = 0.0;
43 for (int s = 0; s < S; ++s) {
44 double best = -1e100;
45 for (int a = 0; a < A; ++a) {
46 double q = 0.0;
47 for (const auto &t : transitions[s][a]) {
48 q += t.prob * (t.reward + gamma * V[t.next]);
49 }
50 best = max(best, q);
51 }
52 Vnew[s] = best;
53 delta = max(delta, fabs(Vnew[s] - V[s]));
54 }
55 V.swap(Vnew);
56 if (delta < theta * (1 - gamma) / (2 * gamma + 1e-12)) break; // conservative stopping
57 if (iters > 100000) break; // safety
58 }
59
60 // Derive greedy policy π*(s) = argmax_a Q*(s,a)
61 vector<int> policy(S, 0);
62 for (int s = 0; s < S; ++s) {
63 double best = -1e100; int besta = 0;
64 for (int a = 0; a < A; ++a) {
65 double q = 0.0;
66 for (const auto &t : transitions[s][a]) {
67 q += t.prob * (t.reward + gamma * V[t.next]);
68 }
69 if (q > best) { best = q; besta = a; }
70 }
71 policy[s] = besta;
72 }
73
74 cout.setf(std::ios::fixed); cout<<setprecision(6);
75 cout << "Converged in iterations: " << iters << "\n";
76 for (int s = 0; s < S; ++s) cout << "V*["<<s<<"] = " << V[s] << "\n";
77 for (int s = 0; s < S; ++s) cout << "pi*["<<s<<"] = action " << policy[s] << "\n";
78 return 0;
79}
80

This program encodes a tiny MDP and applies value iteration by repeatedly evaluating the Bellman optimality backup until convergence. It then extracts the greedy optimal policy. The example illustrates how the optimal value of a state is the best expected immediate reward plus discounted optimal values of successor states.

Time: O(I · |S| · |A| · d) where I is the number of iterations until convergence and d is the average number of nonzero successors per (s,a). For dense transitions, O(I · |S|^2 · |A|).Space: O(|S| + |S||A|·d) to store values and the transition model.
Tabular Q-Learning on a Small GridWorld
1#include <bits/stdc++.h>
2using namespace std;
3
4// 5x5 GridWorld with start at (0,0) and goal at (4,4). Each step has reward -1, reaching goal gives +10 and episode ends.
5// Actions: 0=up,1=right,2=down,3=left. Moves that hit walls stay in place.
6// We learn Q(s,a) using epsilon-greedy exploration.
7
8struct Env {
9 int H=5, W=5; pair<int,int> start{0,0}; pair<int,int> goal{4,4};
10 pair<int,int> s;
11 void reset(){ s = start; }
12 bool is_goal() const { return s==goal; }
13 int state_id() const { return s.first*W + s.second; }
14 tuple<int,double,bool> step(int a){
15 static int dr[4] = {-1,0,1,0};
16 static int dc[4] = {0,1,0,-1};
17 int nr = s.first + dr[a];
18 int nc = s.second + dc[a];
19 if(nr<0||nr>=H||nc<0||nc>=W){ nr = s.first; nc = s.second; }
20 s = {nr,nc};
21 bool done = is_goal();
22 double r = done ? 10.0 : -1.0;
23 return { state_id(), r, done };
24 }
25};
26
27int main(){
28 ios::sync_with_stdio(false);
29 cin.tie(nullptr);
30
31 Env env; const int S = env.H*env.W; const int A = 4;
32 vector<vector<double>> Q(S, vector<double>(A, 0.0));
33
34 // Hyperparameters
35 double alpha = 0.1; // learning rate
36 double gamma = 0.99; // discount
37 double eps = 0.2; // epsilon-greedy
38 int episodes = 4000; // number of episodes
39 int max_steps = 200; // safety cap per episode
40
41 std::mt19937 rng(123);
42 std::uniform_real_distribution<double> unif(0.0,1.0);
43 std::uniform_int_distribution<int> randA(0,A-1);
44
45 auto greedy = [&](int s){
46 int besta = 0; double bestq = Q[s][0];
47 for(int a=1;a<A;++a){ if(Q[s][a] > bestq){ bestq = Q[s][a]; besta = a; } }
48 return besta;
49 };
50
51 for(int ep=0; ep<episodes; ++ep){
52 env.reset(); int s = env.state_id();
53 for(int t=0; t<max_steps; ++t){
54 int a = (unif(rng) < eps) ? randA(rng) : greedy(s);
55 auto [s2, r, done] = env.step(a);
56 // Q-learning update: move Q toward r + gamma * max_a' Q(s2, a')
57 double maxq_next = *max_element(Q[s2].begin(), Q[s2].end());
58 double td = r + gamma * (done ? 0.0 : maxq_next) - Q[s][a];
59 Q[s][a] += alpha * td;
60 s = s2;
61 if(done) break;
62 }
63 // simple epsilon decay
64 eps = max(0.01, eps * 0.999);
65 }
66
67 // Derive greedy policy arrows
68 const char* arrows = "^>v<";
69 vector<char> policy(S,'?');
70 for(int r=0;r<env.H;++r){
71 for(int c=0;c<env.W;++c){
72 env.s = {r,c};
73 int id = env.state_id();
74 if(env.is_goal()) { policy[id] = 'G'; continue; }
75 policy[id] = arrows[ (int)(max_element(Q[id].begin(), Q[id].end()) - Q[id].begin()) ];
76 }
77 }
78
79 // Print learned policy grid
80 for(int r=0;r<env.H;++r){
81 for(int c=0;c<env.W;++c){
82 cout << policy[r*env.W + c] << ' ';
83 }
84 cout << '\n';
85 }
86 return 0;
87}
88

We learn a tabular Q-function on a deterministic 5×5 grid. Each update pushes Q(s,a) toward the bootstrapped target r + γ max_a' Q(s',a'). Epsilon-greedy drives exploration. After training, we extract the greedy policy and display it as arrows pointing toward the goal.

Time: O(E · T · A) where E is episodes and T is steps per episode; each step uses O(A) to compute max_a' Q(s',a').Space: O(|S||A|) to store the Q-table.
REINFORCE (Policy Gradient) on a K-Armed Bandit
1#include <bits/stdc++.h>
2using namespace std;
3
4// Stateless RL: K-armed Bernoulli bandit. We optimize softmax policy via REINFORCE with a baseline.
5
6int main(){
7 ios::sync_with_stdio(false);
8 cin.tie(nullptr);
9
10 int K = 5; // number of arms
11 vector<double> p_true = {0.1, 0.3, 0.5, 0.2, 0.8}; // success probabilities (unknown to learner)
12
13 // Policy parameters: preferences theta for softmax over arms
14 vector<double> theta(K, 0.0);
15
16 double alpha = 0.05; // learning rate
17 double baseline = 0.0; // moving average baseline to reduce variance
18 double beta = 0.01; // baseline update rate
19
20 std::mt19937 rng(123);
21 std::uniform_real_distribution<double> unif(0.0,1.0);
22
23 auto softmax = [&](const vector<double>& z){
24 vector<double> p(K);
25 double m = *max_element(z.begin(), z.end());
26 double Z = 0.0; for(int i=0;i<K;++i){ p[i] = exp(z[i]-m); Z += p[i]; }
27 for(int i=0;i<K;++i) p[i] /= Z; return p;
28 };
29
30 int steps = 5000;
31 for(int t=0; t<steps; ++t){
32 vector<double> pi = softmax(theta);
33 // sample action
34 double u = unif(rng), cum = 0.0; int a = 0;
35 for(int i=0;i<K;++i){ cum += pi[i]; if(u <= cum){ a = i; break; } }
36 // sample reward from Bernoulli(p_true[a])
37 double r = (unif(rng) < p_true[a]) ? 1.0 : 0.0;
38 // Update baseline
39 baseline = (1.0 - beta) * baseline + beta * r;
40 // Gradient of log-softmax: grad_i = 1_{i=a} - pi[i]
41 for(int i=0;i<K;++i){
42 double grad_log = (i==a ? 1.0 : 0.0) - pi[i];
43 theta[i] += alpha * (r - baseline) * grad_log; // REINFORCE with baseline
44 }
45 }
46
47 vector<double> pi_final = [&](){ return softmax(theta); }();
48
49 cout.setf(std::ios::fixed); cout<<setprecision(4);
50 cout << "Final policy probabilities:" << '\n';
51 for(int i=0;i<K;++i) cout << "arm "<<i<<": p= "<< pi_final[i] << " (true success="<<p_true[i]<<")\n";
52 return 0;
53}
54

This is a minimal policy gradient example without states. The policy is a softmax over arms with preferences θ. REINFORCE updates θ in the direction of (r − baseline) times the gradient of log π(a). The baseline reduces variance. Over time, probability mass concentrates on the best arm.

Time: O(T · K) for T steps and K arms, dominated by softmax and parameter updates.Space: O(K) for parameters and probabilities.