Policy Gradient Theorem
Key Points
- •The policy gradient theorem tells us how to push a stochastic policy’s parameters to increase expected return by following the gradient of expected rewards.
- •It converts the hard-to-differentiate objective into an expectation of score functions (grad log-probabilities) times a return-like signal using the likelihood ratio trick.
- •You can plug in different signals (return, Q-value, advantage, or TD residual) that keep the gradient unbiased while trading off variance and bias.
- •Subtracting a baseline such as the value function does not change the expectation of the gradient but can dramatically reduce variance.
- •Actor-critic methods learn the baseline (critic) and use TD errors to stabilize and speed up learning compared to pure REINFORCE.
- •Trust region and clipped objectives constrain policy updates to avoid destructive large steps and improve reliability.
- •Natural gradients precondition the update by the Fisher information matrix, following the steepest ascent in distribution space rather than parameter space.
- •Deterministic policy gradients extend the idea to continuous actions by differentiating through the policy into the action-value function.
Prerequisites
- →Markov Decision Processes (MDPs) — Understanding states, actions, rewards, transitions, and discounting is necessary to define policies and returns.
- →Probability and Expectation — The theorem relies on expectations over trajectories and gradients of log-probabilities.
- →Calculus and Gradients — Policy gradients differentiate objectives with respect to parameters using chain rule and score functions.
- →Stochastic Gradient Descent — Updates are noisy estimates of true gradients obtained via sampling.
- →Value Functions and TD Learning — Actor-critic methods use value estimates and TD errors as lower-variance signals.
- →Linear Algebra — Natural gradients and trust regions involve matrices like the Fisher information and KL divergence.
Detailed Explanation
Tap terms for definitions01Overview
Reinforcement learning (RL) seeks policies that choose actions to maximize long-term rewards. When a policy is parameterized, we want to adjust its parameters to improve performance. The policy gradient theorem provides a direct recipe for computing the gradient of expected return with respect to policy parameters, even when the environment’s dynamics are unknown. It uses the likelihood ratio (score function) trick to express the gradient as an expectation over trajectories of a product between the policy’s log-probability gradients and a return-like signal. This turns a seemingly intractable differentiation problem into something estimable by sampling episodes. Different choices of the signal—total return, action-value (Q), advantage (Q − V), or TD residual—yield algorithms with different variance–bias trade-offs, such as REINFORCE and actor-critic. The theorem underpins modern on-policy optimization methods like A3C, PPO, and TRPO, and connects to geometric ideas via natural gradients that follow the steepest ascent in the space of probability distributions. It also generalizes to deterministic policies for continuous actions, where we backpropagate through the action into the Q-function rather than through a log-probability. Overall, the theorem is a cornerstone for sampling-based policy optimization without requiring a model of the environment.
02Intuition & Analogies
Imagine teaching a beginner to play a game by giving them nudges whenever a move seems to help or hurt the final score. You don’t explain the whole game’s physics; you simply say, “Do more of what led to wins, less of what led to losses.” That is the spirit of policy gradients. Each time the agent chooses an action, we ask: if we increase the probability of this action slightly, does performance improve? The score function (gradient of the log-probability) tells us the direction to push the parameters to make the chosen action more or less likely. Multiplying that directional push by a performance signal credits or blames the action. Using the total return credits everything in a winning episode, but that can be noisy—like praising every move after a victory, even lucky ones. Subtracting a baseline (the value of the state) says, “Only credit the part better than average,” which reduces noise—like tipping a server for service above usual. Actor-critic goes further: it trains a critic to estimate that baseline on the fly, giving steadier feedback. Trust region methods are like safety rails that keep each update from steering too far off course; they say, “Change the policy, but not too much at once.” Natural gradients are like walking uphill with terrain-aware shoes: instead of stepping in raw parameter space, you step using the geometry of the policy’s probability manifold, often making progress with fewer wobbles. For continuous controls (like steering a car), deterministic policy gradients tweak the action directly through how it affects long-term value.
03Formal Definition
04When to Use
Use policy gradients when the action selection must be stochastic or continuous and you have a differentiable policy. They are ideal when you cannot or prefer not to model the environment dynamics but can sample trajectories. REINFORCE (using returns) is simple and unbiased but often high variance; it works best in small problems or with strong variance reduction (e.g., baselines, advantage normalization). Actor-critic variants are suitable when you can learn a value function to stabilize and accelerate learning by using TD errors. If training becomes unstable due to large policy updates, prefer trust-region-styled algorithms like TRPO or its practical cousin PPO, which curb destructive steps by constraining or clipping the policy ratio. If the parameter space is poorly conditioned (updates oscillate or stall), natural gradients can help by respecting the policy’s probability geometry. For continuous control where sampling stochastic actions is awkward or you want deterministic control at test-time, deterministic policy gradients (e.g., DPG/TD3) are effective, provided you maintain exploration with noise during training. In RLHF and large-scale policy optimization (e.g., PPO for aligning language models), policy gradients are the backbone because they support black-box reward signals, on-policy updates, and principled credit assignment.
⚠️Common Mistakes
• Using total returns without variance reduction: Monte Carlo returns make gradients extremely noisy, slowing or preventing learning. Remedy: subtract a baseline (value function), normalize advantages, or move to actor-critic with TD errors. • Letting the baseline leak gradient into the policy: if the baseline shares parameters with the policy or is computed from the policy logits, ensure no gradient flows from the baseline through the policy when computing \nabla_\theta \log \pi. • Oversized learning rates: policy updates that change action probabilities too much can collapse exploration. Remedy: smaller step sizes, entropy regularization, or PPO/TRPO constraints. • Off-policy sampling without correction: mixing replay data with on-policy gradients requires importance sampling ratios; ignoring them biases updates. • Misusing discount factors: setting \gamma too close to 1 in long-horizon tasks without proper bootstrapping amplifies variance. • Not normalizing advantages: large-magnitude advantages can destabilize training; standardize to zero mean and unit variance per batch. • Deterministic policy without exploration: DPG requires explicit exploration noise; otherwise learning stalls. • Ignoring the Fisher geometry: plain SGD on poorly scaled logits can be slow; consider natural gradient or adaptive optimizers. • Forgetting entropy bonuses in sparse-reward problems: without an exploration drive, policies can prematurely collapse.
Key Formulas
Performance Objective
Explanation: Expected discounted return under the policy. This is the function we want to maximize by adjusting parameters.
Policy Gradient Theorem
Explanation: The gradient of expected return equals the expected sum of score functions times a return-like signal. Different valid choices of trade variance and bias.
Valid Signals
Explanation: Using return, Q-value, advantage, or TD residual gives unbiased (or controlled-bias) estimates under the theorem and common approximations.
Discounted Return
Explanation: The cumulative future rewards discounted by . It provides a simple but high-variance estimator for training.
Advantage Definition
Explanation: Advantage measures how much better an action is than average in a state. Subtracting V reduces variance without changing the expected gradient.
TD Residual
Explanation: One-step bootstrapped error used by actor-critic. It provides lower variance at the cost of some bias if V is approximate.
Baseline Property
Explanation: Any function of state can be subtracted from the return signal without changing the expected gradient. This is key for variance reduction.
Likelihood Ratio Trick
Explanation: Because environment dynamics do not depend on , the gradient of the trajectory log-probability depends only on the policy terms.
Deterministic Policy Gradient
Explanation: For deterministic policies, we differentiate the Q-function with respect to the action and chain it through the policy.
Fisher Information
Explanation: Measures local curvature of the policy distribution manifold. It is central to natural gradient methods.
Natural Gradient
Explanation: Preconditions the gradient to follow the steepest ascent measured by KL geometry, often improving stability.
TRPO Constraint
Explanation: Optimize a surrogate objective but cap the expected KL to limit update size, improving monotonic improvement guarantees.
PPO Objective
Explanation: Clips the policy ratio to prevent overly large updates, approximating a trust region with a simple, effective loss.
Policy Ratio
Explanation: The ratio between new and old action probabilities, used in off-policy correction and PPO/TRPO surrogates.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple 1D Random Walk environment. 5 // States: 0 ... 4 (0 and 4 are terminal). Start at 2. 6 // Actions: 0 = Left, 1 = Right. Rewards: +1 at state 4, -1 at state 0, otherwise 0. 7 struct RandomWalkEnv { 8 int n_states = 5; 9 int start = 2; 10 int s; 11 RandomWalkEnv() { reset(); } 12 int reset() { s = start; return s; } 13 // Step returns: next_state, reward, done 14 tuple<int, double, bool> step(int a) { 15 if (s == 0 || s == 4) return {s, 0.0, true}; 16 int ns = s + (a == 1 ? 1 : -1); 17 ns = max(0, min(4, ns)); 18 double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0)); 19 bool done = (ns == 0 || ns == 4); 20 s = ns; 21 return {ns, r, done}; 22 } 23 }; 24 25 // Softmax policy over 2 actions with tabular parameters theta[s][a]. 26 struct SoftmaxPolicy { 27 int S, A; 28 vector<array<double,2>> theta; // parameters per state and action 29 SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) { 30 theta.assign(S, {0.0, 0.0}); 31 } 32 array<double,2> probs(int s) const { 33 double m = max(theta[s][0], theta[s][1]); // for numerical stability 34 double e0 = exp(theta[s][0] - m); 35 double e1 = exp(theta[s][1] - m); 36 double Z = e0 + e1; 37 return { e0 / Z, e1 / Z }; 38 } 39 int sample_action(int s, mt19937 &gen) const { 40 auto p = probs(s); 41 uniform_real_distribution<> dist(0.0, 1.0); 42 double u = dist(gen); 43 return (u < p[0]) ? 0 : 1; 44 } 45 }; 46 47 int main() { 48 ios::sync_with_stdio(false); 49 cin.tie(nullptr); 50 51 RandomWalkEnv env; 52 int S = env.n_states, A = 2; 53 SoftmaxPolicy pi(S, A); 54 55 // Baseline (state-value) V(s) 56 vector<double> V(S, 0.0); 57 58 // Hyperparameters 59 double gamma = 0.99; // discount 60 double alpha_pi = 0.05; // policy step size 61 double alpha_v = 0.05; // value step size 62 int max_episodes = 2000; 63 int max_steps = 30; // safety cap on episode length 64 mt19937 gen(42); 65 66 for (int ep = 0; ep < max_episodes; ++ep) { 67 // Generate one episode 68 vector<int> states, actions; 69 vector<double> rewards; 70 int s = env.reset(); 71 bool done = false; 72 for (int t = 0; t < max_steps && !done; ++t) { 73 int a = pi.sample_action(s, gen); 74 auto [ns, r, d] = env.step(a); 75 states.push_back(s); 76 actions.push_back(a); 77 rewards.push_back(r); 78 s = ns; done = d; 79 } 80 int T = (int)rewards.size(); 81 82 // Compute discounted returns G_t 83 vector<double> G(T, 0.0); 84 double g = 0.0; 85 for (int t = T - 1; t >= 0; --t) { 86 g = rewards[t] + gamma * g; 87 G[t] = g; 88 } 89 90 // Policy and value updates using advantage A_t = G_t - V(s_t) 91 for (int t = 0; t < T; ++t) { 92 int st = states[t], at = actions[t]; 93 auto p = pi.probs(st); 94 double Ahat = G[t] - V[st]; // baseline reduces variance 95 96 // Gradient of log-softmax: d/dtheta(s,a') log pi(a|s) = 1_{a'=a} - pi(a'|s) 97 for (int a = 0; a < A; ++a) { 98 double grad_logpi = (a == at ? 1.0 : 0.0) - (a == 0 ? p[0] : p[1]); 99 pi.theta[st][a] += alpha_pi * grad_logpi * Ahat; 100 } 101 // Update baseline towards return (value regression) 102 V[st] += alpha_v * (G[t] - V[st]); 103 } 104 105 // Optional: monitor learning every 200 episodes 106 if ((ep + 1) % 200 == 0) { 107 // Evaluate greedy tendency from center state 108 auto p = pi.probs(2); 109 cout << "Episode " << (ep + 1) 110 << ": P(Left|s=2)=" << fixed << setprecision(3) << p[0] 111 << ", P(Right|s=2)=" << p[1] << "\n"; 112 } 113 } 114 115 // Final policy probabilities for all states 116 cout << "Final policy (P(Left), P(Right)) per state:\n"; 117 for (int s = 0; s < S; ++s) { 118 auto p = pi.probs(s); 119 cout << "s=" << s << ": (" << fixed << setprecision(3) << p[0] 120 << ", " << p[1] << ")\n"; 121 } 122 return 0; 123 } 124
This program implements REINFORCE with a learned value baseline on a 1D Random Walk. It samples episodes, computes discounted returns G_t, forms advantages A_t = G_t - V(s_t), and updates a tabular softmax policy with the score-function gradient. The value baseline is fit online to reduce variance. Over time, the policy learns to move toward the rewarding terminal state.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct RandomWalkEnv { 5 int n_states = 5; 6 int start = 2; 7 int s; 8 RandomWalkEnv() { reset(); } 9 int reset() { s = start; return s; } 10 tuple<int, double, bool> step(int a) { 11 if (s == 0 || s == 4) return {s, 0.0, true}; 12 int ns = s + (a == 1 ? 1 : -1); 13 ns = max(0, min(4, ns)); 14 double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0)); 15 bool done = (ns == 0 || ns == 4); 16 s = ns; 17 return {ns, r, done}; 18 } 19 }; 20 21 struct SoftmaxPolicy { 22 int S, A; 23 vector<array<double,2>> theta; 24 SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) { theta.assign(S, {0.0, 0.0}); } 25 array<double,2> probs(int s) const { 26 double m = max(theta[s][0], theta[s][1]); 27 double e0 = exp(theta[s][0] - m); 28 double e1 = exp(theta[s][1] - m); 29 double Z = e0 + e1; 30 return { e0 / Z, e1 / Z }; 31 } 32 int sample_action(int s, mt19937 &gen) const { 33 auto p = probs(s); 34 uniform_real_distribution<> dist(0.0, 1.0); 35 double u = dist(gen); 36 return (u < p[0]) ? 0 : 1; 37 } 38 }; 39 40 int main(){ 41 ios::sync_with_stdio(false); 42 cin.tie(nullptr); 43 44 RandomWalkEnv env; 45 int S = env.n_states, A = 2; 46 SoftmaxPolicy pi(S, A); 47 vector<double> V(S, 0.0); 48 49 double gamma = 0.99; 50 double alpha_pi = 0.05; // actor step size 51 double alpha_v = 0.1; // critic step size (often larger) 52 int episodes = 3000; 53 int max_steps = 30; 54 mt19937 gen(123); 55 56 for(int ep=0; ep<episodes; ++ep){ 57 int s = env.reset(); 58 bool done = false; 59 for(int t=0; t<max_steps && !done; ++t){ 60 // Sample action 61 int a = pi.sample_action(s, gen); 62 auto p = pi.probs(s); 63 auto [ns, r, d] = env.step(a); 64 65 // TD error: delta = r + gamma * V(ns) - V(s) 66 double delta = r + (d ? 0.0 : gamma * V[ns]) - V[s]; 67 68 // Critic update 69 V[s] += alpha_v * delta; 70 71 // Actor update using TD error as advantage estimate 72 for(int ap=0; ap<A; ++ap){ 73 double grad_logpi = (ap == a ? 1.0 : 0.0) - (ap == 0 ? p[0] : p[1]); 74 pi.theta[s][ap] += alpha_pi * grad_logpi * delta; // online actor-critic 75 } 76 77 s = ns; done = d; 78 } 79 if((ep+1) % 300 == 0){ 80 auto p2 = pi.probs(2); 81 cout << "Ep " << (ep+1) << ": P(Right|2)=" << fixed << setprecision(3) << p2[1] << "\n"; 82 } 83 } 84 85 cout << "Learned policy probs at center state s=2: "; 86 auto p2 = pi.probs(2); 87 cout << "(" << fixed << setprecision(3) << p2[0] << ", " << p2[1] << ")\n"; 88 return 0; 89 } 90
This is an online actor-critic algorithm. It updates the critic via one-step TD learning and updates the actor using the TD error as an advantage estimate, step-by-step without storing whole episodes. This reduces variance and often learns faster than Monte Carlo REINFORCE.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Two-armed stochastic bandit with Gaussian rewards. 5 struct Bandit2 { 6 double mu0, mu1; // mean rewards for arms 0 and 1 7 mt19937 gen; 8 normal_distribution<> noise; 9 Bandit2(double m0, double m1, int seed=7) : mu0(m0), mu1(m1), gen(seed), noise(0.0, 1.0) {} 10 // Pull arm a -> reward ~ N(mu_a, 1) 11 double pull(int a){ return (a==0 ? mu0 : mu1) + noise(gen); } 12 }; 13 14 struct Softmax2 { 15 array<double,2> theta{0.0, 0.0}; 16 array<double,2> probs() const { 17 double m = max(theta[0], theta[1]); 18 double e0 = exp(theta[0]-m), e1 = exp(theta[1]-m); 19 double Z = e0 + e1; return {e0/Z, e1/Z}; 20 } 21 int sample(mt19937 &g) const { 22 auto p = probs(); 23 uniform_real_distribution<> U(0.0,1.0); 24 return (U(g) < p[0]) ? 0 : 1; 25 } 26 }; 27 28 int main(){ 29 ios::sync_with_stdio(false); cin.tie(nullptr); 30 Bandit2 env(0.0, 0.5); // arm 1 is better 31 Softmax2 pi; 32 mt19937 gen(1234); 33 34 double alpha = 0.1; // natural gradient step size 35 double baseline = 0.0; // running average reward as baseline 36 double beta = 0.01; // baseline update rate 37 38 for(int t=1; t<=5000; ++t){ 39 auto p = pi.probs(); 40 int a = pi.sample(gen); 41 double r = env.pull(a); 42 baseline += beta * (r - baseline); 43 44 // Score function gradient g = (r - b) * (onehot(a) - p) 45 array<double,2> onehot{ (a==0)?1.0:0.0, (a==1)?1.0:0.0 }; 46 array<double,2> g{ 47 (r - baseline) * (onehot[0] - p[0]), 48 (r - baseline) * (onehot[1] - p[1]) 49 }; 50 51 // Fisher for 2-action softmax: F = diag(p) - p p^T 52 // Compute 2x2 F and its inverse explicitly. 53 double p0 = p[0], p1 = p[1]; 54 double F00 = p0*(1.0 - p0); 55 double F11 = p1*(1.0 - p1); 56 double F01 = -p0*p1; 57 double F10 = -p0*p1; 58 // Inverse of 2x2 matrix [[a,b],[c,d]] is (1/det) * [[d,-b],[-c,a]] 59 double det = F00*F11 - F01*F10; 60 // Guard against near-singularity when probabilities saturate 61 if (det < 1e-8) det = 1e-8; 62 double iF00 = F11 / det; 63 double iF01 = -F01 / det; 64 double iF10 = -F10 / det; 65 double iF11 = F00 / det; 66 67 // Natural gradient: theta += alpha * F^{-1} * g 68 array<double,2> ng{ 69 iF00 * g[0] + iF01 * g[1], 70 iF10 * g[0] + iF11 * g[1] 71 }; 72 pi.theta[0] += alpha * ng[0]; 73 pi.theta[1] += alpha * ng[1]; 74 75 if (t % 500 == 0) { 76 auto q = pi.probs(); 77 cout << "t=" << t << ", P(arm1)=" << fixed << setprecision(3) << q[1] 78 << ", avgR~" << baseline << "\n"; 79 } 80 } 81 82 auto p = pi.probs(); 83 cout << "Final probabilities: P(arm0)=" << fixed << setprecision(3) << p[0] 84 << ", P(arm1)=" << p[1] << "\n"; 85 return 0; 86 } 87
This example demonstrates a natural policy gradient step in the simplest RL setting: a 2-armed bandit. It computes the vanilla policy gradient and preconditions it by the inverse Fisher matrix of a 2-action softmax, yielding geometry-aware updates that typically converge faster and more stably than plain gradient ascent.