📚TheoryAdvanced

Policy Gradient Theorem

Key Points

  • The policy gradient theorem tells us how to push a stochastic policy’s parameters to increase expected return by following the gradient of expected rewards.
  • It converts the hard-to-differentiate objective into an expectation of score functions (grad log-probabilities) times a return-like signal using the likelihood ratio trick.
  • You can plug in different signals (return, Q-value, advantage, or TD residual) that keep the gradient unbiased while trading off variance and bias.
  • Subtracting a baseline such as the value function does not change the expectation of the gradient but can dramatically reduce variance.
  • Actor-critic methods learn the baseline (critic) and use TD errors to stabilize and speed up learning compared to pure REINFORCE.
  • Trust region and clipped objectives constrain policy updates to avoid destructive large steps and improve reliability.
  • Natural gradients precondition the update by the Fisher information matrix, following the steepest ascent in distribution space rather than parameter space.
  • Deterministic policy gradients extend the idea to continuous actions by differentiating through the policy into the action-value function.

Prerequisites

  • Markov Decision Processes (MDPs)Understanding states, actions, rewards, transitions, and discounting is necessary to define policies and returns.
  • Probability and ExpectationThe theorem relies on expectations over trajectories and gradients of log-probabilities.
  • Calculus and GradientsPolicy gradients differentiate objectives with respect to parameters using chain rule and score functions.
  • Stochastic Gradient DescentUpdates are noisy estimates of true gradients obtained via sampling.
  • Value Functions and TD LearningActor-critic methods use value estimates and TD errors as lower-variance signals.
  • Linear AlgebraNatural gradients and trust regions involve matrices like the Fisher information and KL divergence.

Detailed Explanation

Tap terms for definitions

01Overview

Reinforcement learning (RL) seeks policies that choose actions to maximize long-term rewards. When a policy is parameterized, we want to adjust its parameters to improve performance. The policy gradient theorem provides a direct recipe for computing the gradient of expected return with respect to policy parameters, even when the environment’s dynamics are unknown. It uses the likelihood ratio (score function) trick to express the gradient as an expectation over trajectories of a product between the policy’s log-probability gradients and a return-like signal. This turns a seemingly intractable differentiation problem into something estimable by sampling episodes. Different choices of the signal—total return, action-value (Q), advantage (Q − V), or TD residual—yield algorithms with different variance–bias trade-offs, such as REINFORCE and actor-critic. The theorem underpins modern on-policy optimization methods like A3C, PPO, and TRPO, and connects to geometric ideas via natural gradients that follow the steepest ascent in the space of probability distributions. It also generalizes to deterministic policies for continuous actions, where we backpropagate through the action into the Q-function rather than through a log-probability. Overall, the theorem is a cornerstone for sampling-based policy optimization without requiring a model of the environment.

02Intuition & Analogies

Imagine teaching a beginner to play a game by giving them nudges whenever a move seems to help or hurt the final score. You don’t explain the whole game’s physics; you simply say, “Do more of what led to wins, less of what led to losses.” That is the spirit of policy gradients. Each time the agent chooses an action, we ask: if we increase the probability of this action slightly, does performance improve? The score function (gradient of the log-probability) tells us the direction to push the parameters to make the chosen action more or less likely. Multiplying that directional push by a performance signal credits or blames the action. Using the total return credits everything in a winning episode, but that can be noisy—like praising every move after a victory, even lucky ones. Subtracting a baseline (the value of the state) says, “Only credit the part better than average,” which reduces noise—like tipping a server for service above usual. Actor-critic goes further: it trains a critic to estimate that baseline on the fly, giving steadier feedback. Trust region methods are like safety rails that keep each update from steering too far off course; they say, “Change the policy, but not too much at once.” Natural gradients are like walking uphill with terrain-aware shoes: instead of stepping in raw parameter space, you step using the geometry of the policy’s probability manifold, often making progress with fewer wobbles. For continuous controls (like steering a car), deterministic policy gradients tweak the action directly through how it affects long-term value.

03Formal Definition

Consider an MDP with states s , actions a , transition dynamics p(s' s,a), reward r(s,a), and discount [0,1). A differentiable stochastic policy _(a s) defines the performance objective J(), commonly J() = \big[ r(,) \big]. The policy gradient theorem states that _ J() = \Big[ _ _( ) \, \Big], where can be chosen as the return , (,), the advantage (,) = (,) - (), or the TD residual = + () - (). The identity relies on the likelihood ratio trick and the Markov property to remove environment dynamics from the gradient. For any baseline b(), [_ _( ) b()] = 0, so subtracting b does not change the expected gradient. Deterministic policy gradients for continuous actions use a deterministic policy a = _(s) and yield _ J() = \mathbb{E}_{s \sim d^{\mu}}\big[ _ _(s) \, (s,a) \big]. Natural gradients define a preconditioned direction J = F()^{-1} J, where F is the Fisher information matrix of the policy, leading to geometry-aware steps. Trust region methods maximize a surrogate objective subject to a KL-divergence constraint, while PPO approximates this with a clipped ratio objective.

04When to Use

Use policy gradients when the action selection must be stochastic or continuous and you have a differentiable policy. They are ideal when you cannot or prefer not to model the environment dynamics but can sample trajectories. REINFORCE (using returns) is simple and unbiased but often high variance; it works best in small problems or with strong variance reduction (e.g., baselines, advantage normalization). Actor-critic variants are suitable when you can learn a value function to stabilize and accelerate learning by using TD errors. If training becomes unstable due to large policy updates, prefer trust-region-styled algorithms like TRPO or its practical cousin PPO, which curb destructive steps by constraining or clipping the policy ratio. If the parameter space is poorly conditioned (updates oscillate or stall), natural gradients can help by respecting the policy’s probability geometry. For continuous control where sampling stochastic actions is awkward or you want deterministic control at test-time, deterministic policy gradients (e.g., DPG/TD3) are effective, provided you maintain exploration with noise during training. In RLHF and large-scale policy optimization (e.g., PPO for aligning language models), policy gradients are the backbone because they support black-box reward signals, on-policy updates, and principled credit assignment.

⚠️Common Mistakes

• Using total returns without variance reduction: Monte Carlo returns make gradients extremely noisy, slowing or preventing learning. Remedy: subtract a baseline (value function), normalize advantages, or move to actor-critic with TD errors. • Letting the baseline leak gradient into the policy: if the baseline shares parameters with the policy or is computed from the policy logits, ensure no gradient flows from the baseline through the policy when computing \nabla_\theta \log \pi. • Oversized learning rates: policy updates that change action probabilities too much can collapse exploration. Remedy: smaller step sizes, entropy regularization, or PPO/TRPO constraints. • Off-policy sampling without correction: mixing replay data with on-policy gradients requires importance sampling ratios; ignoring them biases updates. • Misusing discount factors: setting \gamma too close to 1 in long-horizon tasks without proper bootstrapping amplifies variance. • Not normalizing advantages: large-magnitude advantages can destabilize training; standardize to zero mean and unit variance per batch. • Deterministic policy without exploration: DPG requires explicit exploration noise; otherwise learning stalls. • Ignoring the Fisher geometry: plain SGD on poorly scaled logits can be slow; consider natural gradient or adaptive optimizers. • Forgetting entropy bonuses in sparse-reward problems: without an exploration drive, policies can prematurely collapse.

Key Formulas

Performance Objective

Explanation: Expected discounted return under the policy. This is the function we want to maximize by adjusting parameters.

Policy Gradient Theorem

Explanation: The gradient of expected return equals the expected sum of score functions times a return-like signal. Different valid choices of trade variance and bias.

Valid Signals

Explanation: Using return, Q-value, advantage, or TD residual gives unbiased (or controlled-bias) estimates under the theorem and common approximations.

Discounted Return

Explanation: The cumulative future rewards discounted by . It provides a simple but high-variance estimator for training.

Advantage Definition

Explanation: Advantage measures how much better an action is than average in a state. Subtracting V reduces variance without changing the expected gradient.

TD Residual

Explanation: One-step bootstrapped error used by actor-critic. It provides lower variance at the cost of some bias if V is approximate.

Baseline Property

Explanation: Any function of state can be subtracted from the return signal without changing the expected gradient. This is key for variance reduction.

Likelihood Ratio Trick

Explanation: Because environment dynamics do not depend on , the gradient of the trajectory log-probability depends only on the policy terms.

Deterministic Policy Gradient

Explanation: For deterministic policies, we differentiate the Q-function with respect to the action and chain it through the policy.

Fisher Information

Explanation: Measures local curvature of the policy distribution manifold. It is central to natural gradient methods.

Natural Gradient

Explanation: Preconditions the gradient to follow the steepest ascent measured by KL geometry, often improving stability.

TRPO Constraint

Explanation: Optimize a surrogate objective but cap the expected KL to limit update size, improving monotonic improvement guarantees.

PPO Objective

Explanation: Clips the policy ratio to prevent overly large updates, approximating a trust region with a simple, effective loss.

Policy Ratio

Explanation: The ratio between new and old action probabilities, used in off-policy correction and PPO/TRPO surrogates.

Complexity Analysis

Computing a Monte Carlo policy gradient over an episode of length T with a tabular softmax policy over actions requires O(T\,) time: for each step you compute action probabilities, sample an action, and accumulate score-function terms. If the policy is a neural network with P parameters, forward and backward passes typically cost O(P) per step, leading to O(TP) per trajectory. Actor-critic introduces a critic update with its own parameter count ; with TD updates that are O(1) per step per parameter, the overall per-step cost becomes O(P + ). Trust region methods add constraints: TRPO solves a constrained optimization that often requires conjugate gradient iterations and Fisher–vector products, raising per-update time to roughly O(K P) for K iterations plus line search; PPO replaces this with several epochs of minibatch SGD, O(E P), where E is epochs and is minibatch size. Natural gradient methods require estimating or implicitly multiplying by ; explicit inversion is O() and infeasible for large P, so practical methods use approximate solves (e.g., conjugate gradient), bringing cost to O(KP). Memory usage for on-policy methods can be O(T) to store trajectories (states, actions, rewards, log-probs, and values), though pure online actor-critic can operate in O(1) extra memory per step. Sample complexity is often the bottleneck due to gradient variance; using baselines, advantage normalization, and TD bootstrapping reduces variance and thus the number of trajectories needed for a reliable gradient estimate. Deterministic policy gradients with replay can amortize sample costs but require careful exploration noise and off-policy stability mechanisms.

Code Examples

REINFORCE with Value Baseline on a 1D Random Walk
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple 1D Random Walk environment.
5// States: 0 ... 4 (0 and 4 are terminal). Start at 2.
6// Actions: 0 = Left, 1 = Right. Rewards: +1 at state 4, -1 at state 0, otherwise 0.
7struct RandomWalkEnv {
8 int n_states = 5;
9 int start = 2;
10 int s;
11 RandomWalkEnv() { reset(); }
12 int reset() { s = start; return s; }
13 // Step returns: next_state, reward, done
14 tuple<int, double, bool> step(int a) {
15 if (s == 0 || s == 4) return {s, 0.0, true};
16 int ns = s + (a == 1 ? 1 : -1);
17 ns = max(0, min(4, ns));
18 double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0));
19 bool done = (ns == 0 || ns == 4);
20 s = ns;
21 return {ns, r, done};
22 }
23};
24
25// Softmax policy over 2 actions with tabular parameters theta[s][a].
26struct SoftmaxPolicy {
27 int S, A;
28 vector<array<double,2>> theta; // parameters per state and action
29 SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) {
30 theta.assign(S, {0.0, 0.0});
31 }
32 array<double,2> probs(int s) const {
33 double m = max(theta[s][0], theta[s][1]); // for numerical stability
34 double e0 = exp(theta[s][0] - m);
35 double e1 = exp(theta[s][1] - m);
36 double Z = e0 + e1;
37 return { e0 / Z, e1 / Z };
38 }
39 int sample_action(int s, mt19937 &gen) const {
40 auto p = probs(s);
41 uniform_real_distribution<> dist(0.0, 1.0);
42 double u = dist(gen);
43 return (u < p[0]) ? 0 : 1;
44 }
45};
46
47int main() {
48 ios::sync_with_stdio(false);
49 cin.tie(nullptr);
50
51 RandomWalkEnv env;
52 int S = env.n_states, A = 2;
53 SoftmaxPolicy pi(S, A);
54
55 // Baseline (state-value) V(s)
56 vector<double> V(S, 0.0);
57
58 // Hyperparameters
59 double gamma = 0.99; // discount
60 double alpha_pi = 0.05; // policy step size
61 double alpha_v = 0.05; // value step size
62 int max_episodes = 2000;
63 int max_steps = 30; // safety cap on episode length
64 mt19937 gen(42);
65
66 for (int ep = 0; ep < max_episodes; ++ep) {
67 // Generate one episode
68 vector<int> states, actions;
69 vector<double> rewards;
70 int s = env.reset();
71 bool done = false;
72 for (int t = 0; t < max_steps && !done; ++t) {
73 int a = pi.sample_action(s, gen);
74 auto [ns, r, d] = env.step(a);
75 states.push_back(s);
76 actions.push_back(a);
77 rewards.push_back(r);
78 s = ns; done = d;
79 }
80 int T = (int)rewards.size();
81
82 // Compute discounted returns G_t
83 vector<double> G(T, 0.0);
84 double g = 0.0;
85 for (int t = T - 1; t >= 0; --t) {
86 g = rewards[t] + gamma * g;
87 G[t] = g;
88 }
89
90 // Policy and value updates using advantage A_t = G_t - V(s_t)
91 for (int t = 0; t < T; ++t) {
92 int st = states[t], at = actions[t];
93 auto p = pi.probs(st);
94 double Ahat = G[t] - V[st]; // baseline reduces variance
95
96 // Gradient of log-softmax: d/dtheta(s,a') log pi(a|s) = 1_{a'=a} - pi(a'|s)
97 for (int a = 0; a < A; ++a) {
98 double grad_logpi = (a == at ? 1.0 : 0.0) - (a == 0 ? p[0] : p[1]);
99 pi.theta[st][a] += alpha_pi * grad_logpi * Ahat;
100 }
101 // Update baseline towards return (value regression)
102 V[st] += alpha_v * (G[t] - V[st]);
103 }
104
105 // Optional: monitor learning every 200 episodes
106 if ((ep + 1) % 200 == 0) {
107 // Evaluate greedy tendency from center state
108 auto p = pi.probs(2);
109 cout << "Episode " << (ep + 1)
110 << ": P(Left|s=2)=" << fixed << setprecision(3) << p[0]
111 << ", P(Right|s=2)=" << p[1] << "\n";
112 }
113 }
114
115 // Final policy probabilities for all states
116 cout << "Final policy (P(Left), P(Right)) per state:\n";
117 for (int s = 0; s < S; ++s) {
118 auto p = pi.probs(s);
119 cout << "s=" << s << ": (" << fixed << setprecision(3) << p[0]
120 << ", " << p[1] << ")\n";
121 }
122 return 0;
123}
124

This program implements REINFORCE with a learned value baseline on a 1D Random Walk. It samples episodes, computes discounted returns G_t, forms advantages A_t = G_t - V(s_t), and updates a tabular softmax policy with the score-function gradient. The value baseline is fit online to reduce variance. Over time, the policy learns to move toward the rewarding terminal state.

Time: O(E T |A|) for E episodes and average episode length TSpace: O(SA + S + T) for policy parameters, value table, and an episode buffer
Online Actor-Critic with TD Advantage
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RandomWalkEnv {
5 int n_states = 5;
6 int start = 2;
7 int s;
8 RandomWalkEnv() { reset(); }
9 int reset() { s = start; return s; }
10 tuple<int, double, bool> step(int a) {
11 if (s == 0 || s == 4) return {s, 0.0, true};
12 int ns = s + (a == 1 ? 1 : -1);
13 ns = max(0, min(4, ns));
14 double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0));
15 bool done = (ns == 0 || ns == 4);
16 s = ns;
17 return {ns, r, done};
18 }
19};
20
21struct SoftmaxPolicy {
22 int S, A;
23 vector<array<double,2>> theta;
24 SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) { theta.assign(S, {0.0, 0.0}); }
25 array<double,2> probs(int s) const {
26 double m = max(theta[s][0], theta[s][1]);
27 double e0 = exp(theta[s][0] - m);
28 double e1 = exp(theta[s][1] - m);
29 double Z = e0 + e1;
30 return { e0 / Z, e1 / Z };
31 }
32 int sample_action(int s, mt19937 &gen) const {
33 auto p = probs(s);
34 uniform_real_distribution<> dist(0.0, 1.0);
35 double u = dist(gen);
36 return (u < p[0]) ? 0 : 1;
37 }
38};
39
40int main(){
41 ios::sync_with_stdio(false);
42 cin.tie(nullptr);
43
44 RandomWalkEnv env;
45 int S = env.n_states, A = 2;
46 SoftmaxPolicy pi(S, A);
47 vector<double> V(S, 0.0);
48
49 double gamma = 0.99;
50 double alpha_pi = 0.05; // actor step size
51 double alpha_v = 0.1; // critic step size (often larger)
52 int episodes = 3000;
53 int max_steps = 30;
54 mt19937 gen(123);
55
56 for(int ep=0; ep<episodes; ++ep){
57 int s = env.reset();
58 bool done = false;
59 for(int t=0; t<max_steps && !done; ++t){
60 // Sample action
61 int a = pi.sample_action(s, gen);
62 auto p = pi.probs(s);
63 auto [ns, r, d] = env.step(a);
64
65 // TD error: delta = r + gamma * V(ns) - V(s)
66 double delta = r + (d ? 0.0 : gamma * V[ns]) - V[s];
67
68 // Critic update
69 V[s] += alpha_v * delta;
70
71 // Actor update using TD error as advantage estimate
72 for(int ap=0; ap<A; ++ap){
73 double grad_logpi = (ap == a ? 1.0 : 0.0) - (ap == 0 ? p[0] : p[1]);
74 pi.theta[s][ap] += alpha_pi * grad_logpi * delta; // online actor-critic
75 }
76
77 s = ns; done = d;
78 }
79 if((ep+1) % 300 == 0){
80 auto p2 = pi.probs(2);
81 cout << "Ep " << (ep+1) << ": P(Right|2)=" << fixed << setprecision(3) << p2[1] << "\n";
82 }
83 }
84
85 cout << "Learned policy probs at center state s=2: ";
86 auto p2 = pi.probs(2);
87 cout << "(" << fixed << setprecision(3) << p2[0] << ", " << p2[1] << ")\n";
88 return 0;
89}
90

This is an online actor-critic algorithm. It updates the critic via one-step TD learning and updates the actor using the TD error as an advantage estimate, step-by-step without storing whole episodes. This reduces variance and often learns faster than Monte Carlo REINFORCE.

Time: O(E T |A|) for E episodes and average length TSpace: O(SA + S) in parameters; O(1) additional per-step memory
Natural Policy Gradient in a 2-Armed Bandit
1#include <bits/stdc++.h>
2using namespace std;
3
4// Two-armed stochastic bandit with Gaussian rewards.
5struct Bandit2 {
6 double mu0, mu1; // mean rewards for arms 0 and 1
7 mt19937 gen;
8 normal_distribution<> noise;
9 Bandit2(double m0, double m1, int seed=7) : mu0(m0), mu1(m1), gen(seed), noise(0.0, 1.0) {}
10 // Pull arm a -> reward ~ N(mu_a, 1)
11 double pull(int a){ return (a==0 ? mu0 : mu1) + noise(gen); }
12};
13
14struct Softmax2 {
15 array<double,2> theta{0.0, 0.0};
16 array<double,2> probs() const {
17 double m = max(theta[0], theta[1]);
18 double e0 = exp(theta[0]-m), e1 = exp(theta[1]-m);
19 double Z = e0 + e1; return {e0/Z, e1/Z};
20 }
21 int sample(mt19937 &g) const {
22 auto p = probs();
23 uniform_real_distribution<> U(0.0,1.0);
24 return (U(g) < p[0]) ? 0 : 1;
25 }
26};
27
28int main(){
29 ios::sync_with_stdio(false); cin.tie(nullptr);
30 Bandit2 env(0.0, 0.5); // arm 1 is better
31 Softmax2 pi;
32 mt19937 gen(1234);
33
34 double alpha = 0.1; // natural gradient step size
35 double baseline = 0.0; // running average reward as baseline
36 double beta = 0.01; // baseline update rate
37
38 for(int t=1; t<=5000; ++t){
39 auto p = pi.probs();
40 int a = pi.sample(gen);
41 double r = env.pull(a);
42 baseline += beta * (r - baseline);
43
44 // Score function gradient g = (r - b) * (onehot(a) - p)
45 array<double,2> onehot{ (a==0)?1.0:0.0, (a==1)?1.0:0.0 };
46 array<double,2> g{
47 (r - baseline) * (onehot[0] - p[0]),
48 (r - baseline) * (onehot[1] - p[1])
49 };
50
51 // Fisher for 2-action softmax: F = diag(p) - p p^T
52 // Compute 2x2 F and its inverse explicitly.
53 double p0 = p[0], p1 = p[1];
54 double F00 = p0*(1.0 - p0);
55 double F11 = p1*(1.0 - p1);
56 double F01 = -p0*p1;
57 double F10 = -p0*p1;
58 // Inverse of 2x2 matrix [[a,b],[c,d]] is (1/det) * [[d,-b],[-c,a]]
59 double det = F00*F11 - F01*F10;
60 // Guard against near-singularity when probabilities saturate
61 if (det < 1e-8) det = 1e-8;
62 double iF00 = F11 / det;
63 double iF01 = -F01 / det;
64 double iF10 = -F10 / det;
65 double iF11 = F00 / det;
66
67 // Natural gradient: theta += alpha * F^{-1} * g
68 array<double,2> ng{
69 iF00 * g[0] + iF01 * g[1],
70 iF10 * g[0] + iF11 * g[1]
71 };
72 pi.theta[0] += alpha * ng[0];
73 pi.theta[1] += alpha * ng[1];
74
75 if (t % 500 == 0) {
76 auto q = pi.probs();
77 cout << "t=" << t << ", P(arm1)=" << fixed << setprecision(3) << q[1]
78 << ", avgR~" << baseline << "\n";
79 }
80 }
81
82 auto p = pi.probs();
83 cout << "Final probabilities: P(arm0)=" << fixed << setprecision(3) << p[0]
84 << ", P(arm1)=" << p[1] << "\n";
85 return 0;
86}
87

This example demonstrates a natural policy gradient step in the simplest RL setting: a 2-armed bandit. It computes the vanilla policy gradient and preconditions it by the inverse Fisher matrix of a 2-action softmax, yielding geometry-aware updates that typically converge faster and more stably than plain gradient ascent.

Time: O(T) overall; O(1) per step (2x2 inverse and a few vector ops)Space: O(1) parameters and buffers