📚TheoryAdvanced

Policy Gradient Theorem

Key Points

•
The policy gradient theorem tells us how to push a stochastic policy’s parameters to increase expected return by following the gradient of expected rewards.
•
It converts the hard-to-differentiate objective into an expectation of score functions (grad log-probabilities) times a return-like signal using the likelihood ratio trick.
•
You can plug in different signals (return, Q-value, advantage, or TD residual) that keep the gradient unbiased while trading off variance and bias.
•
Subtracting a baseline such as the value function does not change the expectation of the gradient but can dramatically reduce variance.
•
Actor-critic methods learn the baseline (critic) and use TD errors to stabilize and speed up learning compared to pure REINFORCE.
•
Trust region and clipped objectives $\frac{TRPO}{PPO}$ constrain policy updates to avoid destructive large steps and improve reliability.
•
Natural gradients precondition the update by the Fisher information matrix, following the steepest ascent in distribution space rather than parameter space.
•
Deterministic policy gradients extend the idea to continuous actions by differentiating through the policy into the action-value function.

Prerequisites

→Markov Decision Processes (MDPs) — Understanding states, actions, rewards, transitions, and discounting is necessary to define policies and returns.
→Probability and Expectation — The theorem relies on expectations over trajectories and gradients of log-probabilities.
→Calculus and Gradients — Policy gradients differentiate objectives with respect to parameters using chain rule and score functions.
→Stochastic Gradient Descent — Updates are noisy estimates of true gradients obtained via sampling.
→Value Functions and TD Learning — Actor-critic methods use value estimates and TD errors as lower-variance signals.
→Linear Algebra — Natural gradients and trust regions involve matrices like the Fisher information and KL divergence.

Detailed Explanation

Tap terms for definitions

01Overview

Reinforcement learning (RL) seeks policies that choose actions to maximize long-term rewards. When a policy is parameterized, we want to adjust its parameters to improve performance. The policy gradient theorem provides a direct recipe for computing the gradient of expected return with respect to policy parameters, even when the environment’s dynamics are unknown. It uses the likelihood ratio (score function) trick to express the gradient as an expectation over trajectories of a product between the policy’s log-probability gradients and a return-like signal. This turns a seemingly intractable differentiation problem into something estimable by sampling episodes. Different choices of the signal—total return, action-value (Q), advantage (Q − V), or TD residual—yield algorithms with different variance–bias trade-offs, such as REINFORCE and actor-critic. The theorem underpins modern on-policy optimization methods like A3C, PPO, and TRPO, and connects to geometric ideas via natural gradients that follow the steepest ascent in the space of probability distributions. It also generalizes to deterministic policies for continuous actions, where we backpropagate through the action into the Q-function rather than through a log-probability. Overall, the theorem is a cornerstone for sampling-based policy optimization without requiring a model of the environment.

02Intuition & Analogies

Imagine teaching a beginner to play a game by giving them nudges whenever a move seems to help or hurt the final score. You don’t explain the whole game’s physics; you simply say, “Do more of what led to wins, less of what led to losses.” That is the spirit of policy gradients. Each time the agent chooses an action, we ask: if we increase the probability of this action slightly, does performance improve? The score function (gradient of the log-probability) tells us the direction to push the parameters to make the chosen action more or less likely. Multiplying that directional push by a performance signal credits or blames the action. Using the total return credits everything in a winning episode, but that can be noisy—like praising every move after a victory, even lucky ones. Subtracting a baseline (the value of the state) says, “Only credit the part better than average,” which reduces noise—like tipping a server for service above usual. Actor-critic goes further: it trains a critic to estimate that baseline on the fly, giving steadier feedback. Trust region methods are like safety rails that keep each update from steering too far off course; they say, “Change the policy, but not too much at once.” Natural gradients are like walking uphill with terrain-aware shoes: instead of stepping in raw parameter space, you step using the geometry of the policy’s probability manifold, often making progress with fewer wobbles. For continuous controls (like steering a car), deterministic policy gradients tweak the action directly through how it affects long-term value.

03Formal Definition

Consider an MDP with states s

\in

S

, actions a

\in

A

, transition dynamics p(s'

∣

s,a), reward r(s,a), and discount

γ

\in

[0,1). A differentiable stochastic policy

π

θ

∣

s) defines the performance objective J(

θ

), commonly J(

θ

) =

E_{τ \sim π_{θ}}

\big

[

\sum_{t = 0}^{\infty}

γ^{t}

s_{t}

a_{t}

)

\big

]. The policy gradient theorem states that

\nabla

θ

θ

) =

E_{τ \sim π_{θ}}

\Big

[

\sum_{t = 0}^{\infty}

\nabla

θ

lo g

π

θ

(

a_{t} ∣

s_{t}

) \,

Ψ_{t}

\Big

], where

Ψ_{t}

can be chosen as the return

G_{t}

Q^{π}

(

s_{t}

a_{t}

), the advantage

A^{π}

(

s_{t}

a_{t}

) =

Q^{π}

(

s_{t}

a_{t}

) -

V^{π}

(

s_{t}

), or the TD residual

δ_{t}

r_{t}

γ

V^{π}

(

s_{t + 1}

) -

V^{π}

(

s_{t}

). The identity relies on the likelihood ratio trick and the Markov property to remove environment dynamics from the gradient. For any baseline b(

s_{t}

E

[

\nabla

θ

lo g

π

θ

(

a_{t} ∣

s_{t}

) b(

s_{t}

)] = 0, so subtracting b does not change the expected gradient. Deterministic policy gradients for continuous actions use a deterministic policy a =

μ

θ

(s) and yield

\nabla

θ

θ

) =

\mathbb{E}_{s \sim d^{\mu}}\big

[

\nabla

θ

μ

θ

(s) \,

\nabla_{a}

Q^{μ}

(s,a)

_{a = μ_{θ} (s)}

\big

]. Natural gradients define a preconditioned direction

\tilde{\nabla}

J = F(

θ

)^{-1}

\nabla

J, where F is the Fisher information matrix of the policy, leading to geometry-aware steps. Trust region methods maximize a surrogate objective subject to a KL-divergence constraint, while PPO approximates this with a clipped ratio objective.

04When to Use

Use policy gradients when the action selection must be stochastic or continuous and you have a differentiable policy. They are ideal when you cannot or prefer not to model the environment dynamics but can sample trajectories. REINFORCE (using returns) is simple and unbiased but often high variance; it works best in small problems or with strong variance reduction (e.g., baselines, advantage normalization). Actor-critic variants are suitable when you can learn a value function to stabilize and accelerate learning by using TD errors. If training becomes unstable due to large policy updates, prefer trust-region-styled algorithms like TRPO or its practical cousin PPO, which curb destructive steps by constraining or clipping the policy ratio. If the parameter space is poorly conditioned (updates oscillate or stall), natural gradients can help by respecting the policy’s probability geometry. For continuous control where sampling stochastic actions is awkward or you want deterministic control at test-time, deterministic policy gradients (e.g., DPG/TD3) are effective, provided you maintain exploration with noise during training. In RLHF and large-scale policy optimization (e.g., PPO for aligning language models), policy gradients are the backbone because they support black-box reward signals, on-policy updates, and principled credit assignment.

⚠️Common Mistakes

• Using total returns without variance reduction: Monte Carlo returns make gradients extremely noisy, slowing or preventing learning. Remedy: subtract a baseline (value function), normalize advantages, or move to actor-critic with TD errors. • Letting the baseline leak gradient into the policy: if the baseline shares parameters with the policy or is computed from the policy logits, ensure no gradient flows from the baseline through the policy when computing \nabla_\theta \log \pi. • Oversized learning rates: policy updates that change action probabilities too much can collapse exploration. Remedy: smaller step sizes, entropy regularization, or PPO/TRPO constraints. • Off-policy sampling without correction: mixing replay data with on-policy gradients requires importance sampling ratios; ignoring them biases updates. • Misusing discount factors: setting \gamma too close to 1 in long-horizon tasks without proper bootstrapping amplifies variance. • Not normalizing advantages: large-magnitude advantages can destabilize training; standardize to zero mean and unit variance per batch. • Deterministic policy without exploration: DPG requires explicit exploration noise; otherwise learning stalls. • Ignoring the Fisher geometry: plain SGD on poorly scaled logits can be slow; consider natural gradient or adaptive optimizers. • Forgetting entropy bonuses in sparse-reward problems: without an exploration drive, policies can prematurely collapse.

Key Formulas

Performance Objective

J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum \infty γ^{t} r (s_{t}, a_{t})]

Explanation: Expected discounted return under the policy. This is the function we want to maximize by adjusting parameters.

Policy Gradient Theorem

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum \infty \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) Ψ_{t}]

Explanation: The gradient of expected return equals the expected sum of score functions times a return-like signal. Different valid choices of $Ψ_{t}$ trade variance and bias.

Valid Signals

Ψ_{t} \in {G_{t}, Q^{π} (s_{t}, a_{t}), A^{π} (s_{t}, a_{t}), δ_{t}}

Explanation: Using return, Q-value, advantage, or TD residual gives unbiased (or controlled-bias) estimates under the theorem and common approximations.

Discounted Return

G_{t} = k = 0 \sum \infty γ^{k} r_{t + k}

Explanation: The cumulative future rewards discounted by $γ$ . It provides a simple but high-variance estimator for training.

Advantage Definition

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

Explanation: Advantage measures how much better an action is than average in a state. Subtracting V reduces variance without changing the expected gradient.

TD Residual

δ_{t} = r_{t} + γ V^{π} (s_{t + 1}) - V^{π} (s_{t})

Explanation: One-step bootstrapped error used by actor-critic. It provides lower variance at the cost of some bias if V is approximate.

Baseline Property

E_{a \sim π_{θ} (\cdot ∣ s)} [\nabla_{θ} lo g π_{θ} (a ∣ s) b (s)] = 0

Explanation: Any function of state can be subtracted from the return signal without changing the expected gradient. This is key for variance reduction.

Likelihood Ratio Trick

\nabla_{θ} lo g p_{θ} (τ) = t = 0 \sum \infty \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})

Explanation: Because environment dynamics do not depend on $θ$ , the gradient of the trajectory log-probability depends only on the policy terms.

Deterministic Policy Gradient

\nabla_{θ} J (θ) = E_{s \sim d^{μ}} [\nabla_{θ} μ_{θ} (s) \nabla_{a} Q^{μ} (s, a)_{a = μ_{θ} (s)}]

Explanation: For deterministic policies, we differentiate the Q-function with respect to the action and chain it through the policy.

Fisher Information

F (θ) = E_{s \sim d, a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) \nabla_{θ} lo g π_{θ} (a ∣ s)^{⊤}]

Explanation: Measures local curvature of the policy distribution manifold. It is central to natural gradient methods.

Natural Gradient

\tilde{\nabla} J (θ) = F (θ)^{- 1} \nabla J (θ)

Explanation: Preconditions the gradient to follow the steepest ascent measured by KL geometry, often improving stability.

TRPO Constraint

θ max E [r_{t} (θ) A_{t}] s.t. E [KL (π_{θ_{o l d}} ∥ π_{θ})] \leq δ

Explanation: Optimize a surrogate objective but cap the expected KL to limit update size, improving monotonic improvement guarantees.

PPO Objective

L^{CLIP} (θ) = E [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

Explanation: Clips the policy ratio to prevent overly large updates, approximating a trust region with a simple, effective loss.

Policy Ratio

r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}

Explanation: The ratio between new and old action probabilities, used in off-policy correction and PPO/TRPO surrogates.

Complexity Analysis

Computing a Monte Carlo policy gradient over an episode of length T with a tabular softmax policy over

∣ A ∣

actions requires O(T\,

∣ A ∣

) time: for each step you compute action probabilities, sample an action, and accumulate score-function terms. If the policy is a neural network with P parameters, forward and backward passes typically cost O(P) per step, leading to O(TP) per trajectory. Actor-critic introduces a critic update with its own parameter count

P_{c}

; with TD updates that are O(1) per step per parameter, the overall per-step cost becomes O(P +

P_{c}

). Trust region methods add constraints: TRPO solves a constrained optimization that often requires conjugate gradient iterations and Fisher–vector products, raising per-update time to roughly O(K

\cdot

P) for K iterations plus line search; PPO replaces this with several epochs of minibatch SGD, O(E

N_{b}

P), where E is epochs and

N_{b}

is minibatch size. Natural gradient methods require estimating or implicitly multiplying by

F^{- 1}

; explicit inversion is O(

P^{3}

) and infeasible for large P, so practical methods use approximate solves (e.g., conjugate gradient), bringing cost to O(KP). Memory usage for on-policy methods can be O(T) to store trajectories (states, actions, rewards, log-probs, and values), though pure online actor-critic can operate in O(1) extra memory per step. Sample complexity is often the bottleneck due to gradient variance; using baselines, advantage normalization, and TD bootstrapping reduces variance and thus the number of trajectories needed for a reliable gradient estimate. Deterministic policy gradients with replay can amortize sample costs but require careful exploration noise and off-policy stability mechanisms.

Code Examples

REINFORCE with Value Baseline on a 1D Random Walk

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple 1D Random Walk environment.
5 // States: 0 ... 4 (0 and 4 are terminal). Start at 2.
6 // Actions: 0 = Left, 1 = Right. Rewards: +1 at state 4, -1 at state 0, otherwise 0.
7 struct RandomWalkEnv {
8     int n_states = 5;
9     int start = 2;
10     int s;
11     RandomWalkEnv() { reset(); }
12     int reset() { s = start; return s; }
13     // Step returns: next_state, reward, done
14     tuple<int, double, bool> step(int a) {
15         if (s == 0 || s == 4) return {s, 0.0, true};
16         int ns = s + (a == 1 ? 1 : -1);
17         ns = max(0, min(4, ns));
18         double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0));
19         bool done = (ns == 0 || ns == 4);
20         s = ns;
21         return {ns, r, done};
22     }
23 };
24 
25 // Softmax policy over 2 actions with tabular parameters theta[s][a].
26 struct SoftmaxPolicy {
27     int S, A;
28     vector<array<double,2>> theta; // parameters per state and action
29     SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) {
30         theta.assign(S, {0.0, 0.0});
31     }
32     array<double,2> probs(int s) const {
33         double m = max(theta[s][0], theta[s][1]); // for numerical stability
34         double e0 = exp(theta[s][0] - m);
35         double e1 = exp(theta[s][1] - m);
36         double Z = e0 + e1;
37         return { e0 / Z, e1 / Z };
38     }
39     int sample_action(int s, mt19937 &gen) const {
40         auto p = probs(s);
41         uniform_real_distribution<> dist(0.0, 1.0);
42         double u = dist(gen);
43         return (u < p[0]) ? 0 : 1;
44     }
45 };
46 
47 int main() {
48     ios::sync_with_stdio(false);
49     cin.tie(nullptr);
50 
51     RandomWalkEnv env;
52     int S = env.n_states, A = 2;
53     SoftmaxPolicy pi(S, A);
54 
55     // Baseline (state-value) V(s)
56     vector<double> V(S, 0.0);
57 
58     // Hyperparameters
59     double gamma = 0.99;              // discount
60     double alpha_pi = 0.05;           // policy step size
61     double alpha_v = 0.05;            // value step size
62     int max_episodes = 2000;
63     int max_steps = 30;               // safety cap on episode length
64     mt19937 gen(42);
65 
66     for (int ep = 0; ep < max_episodes; ++ep) {
67         // Generate one episode
68         vector<int> states, actions;
69         vector<double> rewards;
70         int s = env.reset();
71         bool done = false;
72         for (int t = 0; t < max_steps && !done; ++t) {
73             int a = pi.sample_action(s, gen);
74             auto [ns, r, d] = env.step(a);
75             states.push_back(s);
76             actions.push_back(a);
77             rewards.push_back(r);
78             s = ns; done = d;
79         }
80         int T = (int)rewards.size();
81 
82         // Compute discounted returns G_t
83         vector<double> G(T, 0.0);
84         double g = 0.0;
85         for (int t = T - 1; t >= 0; --t) {
86             g = rewards[t] + gamma * g;
87             G[t] = g;
88         }
89 
90         // Policy and value updates using advantage A_t = G_t - V(s_t)
91         for (int t = 0; t < T; ++t) {
92             int st = states[t], at = actions[t];
93             auto p = pi.probs(st);
94             double Ahat = G[t] - V[st]; // baseline reduces variance
95 
96             // Gradient of log-softmax: d/dtheta(s,a') log pi(a|s) = 1_{a'=a} - pi(a'|s)
97             for (int a = 0; a < A; ++a) {
98                 double grad_logpi = (a == at ? 1.0 : 0.0) - (a == 0 ? p[0] : p[1]);
99                 pi.theta[st][a] += alpha_pi * grad_logpi * Ahat;
100             }
101             // Update baseline towards return (value regression)
102             V[st] += alpha_v * (G[t] - V[st]);
103         }
104 
105         // Optional: monitor learning every 200 episodes
106         if ((ep + 1) % 200 == 0) {
107             // Evaluate greedy tendency from center state
108             auto p = pi.probs(2);
109             cout << "Episode " << (ep + 1)
110                  << ": P(Left|s=2)=" << fixed << setprecision(3) << p[0]
111                  << ", P(Right|s=2)=" << p[1] << "\n";
112         }
113     }
114 
115     // Final policy probabilities for all states
116     cout << "Final policy (P(Left), P(Right)) per state:\n";
117     for (int s = 0; s < S; ++s) {
118         auto p = pi.probs(s);
119         cout << "s=" << s << ": (" << fixed << setprecision(3) << p[0]
120              << ", " << p[1] << ")\n";
121     }
122     return 0;
123 }
124

This program implements REINFORCE with a learned value baseline on a 1D Random Walk. It samples episodes, computes discounted returns G_t, forms advantages A_t = G_t - V(s_t), and updates a tabular softmax policy with the score-function gradient. The value baseline is fit online to reduce variance. Over time, the policy learns to move toward the rewarding terminal state.

Time: O(E T |A|) for E episodes and average episode length TSpace: O(SA + S + T) for policy parameters, value table, and an episode buffer

Online Actor-Critic with TD Advantage

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RandomWalkEnv {
5     int n_states = 5;
6     int start = 2;
7     int s;
8     RandomWalkEnv() { reset(); }
9     int reset() { s = start; return s; }
10     tuple<int, double, bool> step(int a) {
11         if (s == 0 || s == 4) return {s, 0.0, true};
12         int ns = s + (a == 1 ? 1 : -1);
13         ns = max(0, min(4, ns));
14         double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0));
15         bool done = (ns == 0 || ns == 4);
16         s = ns;
17         return {ns, r, done};
18     }
19 };
20 
21 struct SoftmaxPolicy {
22     int S, A;
23     vector<array<double,2>> theta;
24     SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) { theta.assign(S, {0.0, 0.0}); }
25     array<double,2> probs(int s) const {
26         double m = max(theta[s][0], theta[s][1]);
27         double e0 = exp(theta[s][0] - m);
28         double e1 = exp(theta[s][1] - m);
29         double Z = e0 + e1;
30         return { e0 / Z, e1 / Z };
31     }
32     int sample_action(int s, mt19937 &gen) const {
33         auto p = probs(s);
34         uniform_real_distribution<> dist(0.0, 1.0);
35         double u = dist(gen);
36         return (u < p[0]) ? 0 : 1;
37     }
38 };
39 
40 int main(){
41     ios::sync_with_stdio(false);
42     cin.tie(nullptr);
43 
44     RandomWalkEnv env;
45     int S = env.n_states, A = 2;
46     SoftmaxPolicy pi(S, A);
47     vector<double> V(S, 0.0);
48 
49     double gamma = 0.99;
50     double alpha_pi = 0.05;  // actor step size
51     double alpha_v  = 0.1;   // critic step size (often larger)
52     int episodes = 3000;
53     int max_steps = 30;
54     mt19937 gen(123);
55 
56     for(int ep=0; ep<episodes; ++ep){
57         int s = env.reset();
58         bool done = false;
59         for(int t=0; t<max_steps && !done; ++t){
60             // Sample action
61             int a = pi.sample_action(s, gen);
62             auto p = pi.probs(s);
63             auto [ns, r, d] = env.step(a);
64 
65             // TD error: delta = r + gamma * V(ns) - V(s)
66             double delta = r + (d ? 0.0 : gamma * V[ns]) - V[s];
67 
68             // Critic update
69             V[s] += alpha_v * delta;
70 
71             // Actor update using TD error as advantage estimate
72             for(int ap=0; ap<A; ++ap){
73                 double grad_logpi = (ap == a ? 1.0 : 0.0) - (ap == 0 ? p[0] : p[1]);
74                 pi.theta[s][ap] += alpha_pi * grad_logpi * delta; // online actor-critic
75             }
76 
77             s = ns; done = d;
78         }
79         if((ep+1) % 300 == 0){
80             auto p2 = pi.probs(2);
81             cout << "Ep " << (ep+1) << ": P(Right|2)=" << fixed << setprecision(3) << p2[1] << "\n";
82         }
83     }
84 
85     cout << "Learned policy probs at center state s=2: ";
86     auto p2 = pi.probs(2);
87     cout << "(" << fixed << setprecision(3) << p2[0] << ", " << p2[1] << ")\n";
88     return 0;
89 }
90

This is an online actor-critic algorithm. It updates the critic via one-step TD learning and updates the actor using the TD error as an advantage estimate, step-by-step without storing whole episodes. This reduces variance and often learns faster than Monte Carlo REINFORCE.

Time: O(E T |A|) for E episodes and average length TSpace: O(SA + S) in parameters; O(1) additional per-step memory

Natural Policy Gradient in a 2-Armed Bandit

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Two-armed stochastic bandit with Gaussian rewards.
5 struct Bandit2 {
6     double mu0, mu1;  // mean rewards for arms 0 and 1
7     mt19937 gen;
8     normal_distribution<> noise;
9     Bandit2(double m0, double m1, int seed=7) : mu0(m0), mu1(m1), gen(seed), noise(0.0, 1.0) {}
10     // Pull arm a -> reward ~ N(mu_a, 1)
11     double pull(int a){ return (a==0 ? mu0 : mu1) + noise(gen); }
12 };
13 
14 struct Softmax2 {
15     array<double,2> theta{0.0, 0.0};
16     array<double,2> probs() const {
17         double m = max(theta[0], theta[1]);
18         double e0 = exp(theta[0]-m), e1 = exp(theta[1]-m);
19         double Z = e0 + e1; return {e0/Z, e1/Z};
20     }
21     int sample(mt19937 &g) const {
22         auto p = probs();
23         uniform_real_distribution<> U(0.0,1.0);
24         return (U(g) < p[0]) ? 0 : 1;
25     }
26 };
27 
28 int main(){
29     ios::sync_with_stdio(false); cin.tie(nullptr);
30     Bandit2 env(0.0, 0.5);      // arm 1 is better
31     Softmax2 pi;
32     mt19937 gen(1234);
33 
34     double alpha = 0.1;         // natural gradient step size
35     double baseline = 0.0;      // running average reward as baseline
36     double beta = 0.01;         // baseline update rate
37 
38     for(int t=1; t<=5000; ++t){
39         auto p = pi.probs();
40         int a = pi.sample(gen);
41         double r = env.pull(a);
42         baseline += beta * (r - baseline);
43 
44         // Score function gradient g = (r - b) * (onehot(a) - p)
45         array<double,2> onehot{ (a==0)?1.0:0.0, (a==1)?1.0:0.0 };
46         array<double,2> g{
47             (r - baseline) * (onehot[0] - p[0]),
48             (r - baseline) * (onehot[1] - p[1])
49         };
50 
51         // Fisher for 2-action softmax: F = diag(p) - p p^T
52         // Compute 2x2 F and its inverse explicitly.
53         double p0 = p[0], p1 = p[1];
54         double F00 = p0*(1.0 - p0);
55         double F11 = p1*(1.0 - p1);
56         double F01 = -p0*p1;
57         double F10 = -p0*p1;
58         // Inverse of 2x2 matrix [[a,b],[c,d]] is (1/det) * [[d,-b],[-c,a]]
59         double det = F00*F11 - F01*F10;
60         // Guard against near-singularity when probabilities saturate
61         if (det < 1e-8) det = 1e-8;
62         double iF00 =  F11 / det;
63         double iF01 = -F01 / det;
64         double iF10 = -F10 / det;
65         double iF11 =  F00 / det;
66 
67         // Natural gradient: theta += alpha * F^{-1} * g
68         array<double,2> ng{
69             iF00 * g[0] + iF01 * g[1],
70             iF10 * g[0] + iF11 * g[1]
71         };
72         pi.theta[0] += alpha * ng[0];
73         pi.theta[1] += alpha * ng[1];
74 
75         if (t % 500 == 0) {
76             auto q = pi.probs();
77             cout << "t=" << t << ", P(arm1)=" << fixed << setprecision(3) << q[1]
78                  << ", avgR~" << baseline << "\n";
79         }
80     }
81 
82     auto p = pi.probs();
83     cout << "Final probabilities: P(arm0)=" << fixed << setprecision(3) << p[0]
84          << ", P(arm1)=" << p[1] << "\n";
85     return 0;
86 }
87

This example demonstrates a natural policy gradient step in the simplest RL setting: a 2-armed bandit. It computes the vanilla policy gradient and preconditions it by the inverse Fisher matrix of a 2-action softmax, yielding geometry-aware updates that typically converge faster and more stably than plain gradient ascent.

Time: O(T) overall; O(1) per step (2x2 inverse and a few vector ops)Space: O(1) parameters and buffers

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple 1D Random Walk environment.
5	// States: 0 ... 4 (0 and 4 are terminal). Start at 2.
6	// Actions: 0 = Left, 1 = Right. Rewards: +1 at state 4, -1 at state 0, otherwise 0.
7	struct RandomWalkEnv {
8	int n_states = 5;
9	int start = 2;
10	int s;
11	RandomWalkEnv() { reset(); }
12	int reset() { s = start; return s; }
13	// Step returns: next_state, reward, done
14	tuple<int, double, bool> step(int a) {
15	if (s == 0 \|\| s == 4) return {s, 0.0, true};
16	int ns = s + (a == 1 ? 1 : -1);
17	ns = max(0, min(4, ns));
18	double r = (ns == 4 ? 1.0 : (ns == 0 ? -1.0 : 0.0));
19	bool done = (ns == 0 \|\| ns == 4);
20	s = ns;
21	return {ns, r, done};
22	}
23	};
24
25	// Softmax policy over 2 actions with tabular parameters theta[s][a].
26	struct SoftmaxPolicy {
27	int S, A;
28	vector<array<double,2>> theta; // parameters per state and action
29	SoftmaxPolicy(int S_, int A_=2) : S(S_), A(A_) {
30	theta.assign(S, {0.0, 0.0});
31	}
32	array<double,2> probs(int s) const {
33	double m = max(theta[s][0], theta[s][1]); // for numerical stability
34	double e0 = exp(theta[s][0] - m);
35	double e1 = exp(theta[s][1] - m);
36	double Z = e0 + e1;
37	return { e0 / Z, e1 / Z };
38	}
39	int sample_action(int s, mt19937 &gen) const {
40	auto p = probs(s);
41	uniform_real_distribution<> dist(0.0, 1.0);
42	double u = dist(gen);
43	return (u < p[0]) ? 0 : 1;
44	}
45	};
46
47	int main() {
48	ios::sync_with_stdio(false);
49	cin.tie(nullptr);
50
51	RandomWalkEnv env;
52	int S = env.n_states, A = 2;
53	SoftmaxPolicy pi(S, A);
54
55	// Baseline (state-value) V(s)
56	vector<double> V(S, 0.0);
57
58	// Hyperparameters
59	double gamma = 0.99; // discount
60	double alpha_pi = 0.05; // policy step size
61	double alpha_v = 0.05; // value step size
62	int max_episodes = 2000;
63	int max_steps = 30; // safety cap on episode length
64	mt19937 gen(42);
65
66	for (int ep = 0; ep < max_episodes; ++ep) {
67	// Generate one episode
68	vector<int> states, actions;
69	vector<double> rewards;
70	int s = env.reset();
71	bool done = false;
72	for (int t = 0; t < max_steps && !done; ++t) {
73	int a = pi.sample_action(s, gen);
74	auto [ns, r, d] = env.step(a);
75	states.push_back(s);
76	actions.push_back(a);
77	rewards.push_back(r);
78	s = ns; done = d;
79	}
80	int T = (int)rewards.size();
81
82	// Compute discounted returns G_t
83	vector<double> G(T, 0.0);
84	double g = 0.0;
85	for (int t = T - 1; t >= 0; --t) {
86	g = rewards[t] + gamma * g;
87	G[t] = g;
88	}
89
90	// Policy and value updates using advantage A_t = G_t - V(s_t)
91	for (int t = 0; t < T; ++t) {
92	int st = states[t], at = actions[t];
93	auto p = pi.probs(st);
94	double Ahat = G[t] - V[st]; // baseline reduces variance
95
96	// Gradient of log-softmax: d/dtheta(s,a') log pi(a\|s) = 1_{a'=a} - pi(a'\|s)
97	for (int a = 0; a < A; ++a) {
98	double grad_logpi = (a == at ? 1.0 : 0.0) - (a == 0 ? p[0] : p[1]);
99	pi.theta[st][a] += alpha_pi * grad_logpi * Ahat;
100	}
101	// Update baseline towards return (value regression)
102	V[st] += alpha_v * (G[t] - V[st]);
103	}
104
105	// Optional: monitor learning every 200 episodes
106	if ((ep + 1) % 200 == 0) {
107	// Evaluate greedy tendency from center state
108	auto p = pi.probs(2);
109	cout << "Episode " << (ep + 1)
110	<< ": P(Left\|s=2)=" << fixed << setprecision(3) << p[0]
111	<< ", P(Right\|s=2)=" << p[1] << "\n";
112	}
113	}
114
115	// Final policy probabilities for all states
116	cout << "Final policy (P(Left), P(Right)) per state:\n";
117	for (int s = 0; s < S; ++s) {
118	auto p = pi.probs(s);
119	cout << "s=" << s << ": (" << fixed << setprecision(3) << p[0]
120	<< ", " << p[1] << ")\n";
121	}
122	return 0;
123	}
124

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Two-armed stochastic bandit with Gaussian rewards.
5	struct Bandit2 {
6	double mu0, mu1; // mean rewards for arms 0 and 1
7	mt19937 gen;
8	normal_distribution<> noise;
9	Bandit2(double m0, double m1, int seed=7) : mu0(m0), mu1(m1), gen(seed), noise(0.0, 1.0) {}
10	// Pull arm a -> reward ~ N(mu_a, 1)
11	double pull(int a){ return (a==0 ? mu0 : mu1) + noise(gen); }
12	};
13
14	struct Softmax2 {
15	array<double,2> theta{0.0, 0.0};
16	array<double,2> probs() const {
17	double m = max(theta[0], theta[1]);
18	double e0 = exp(theta[0]-m), e1 = exp(theta[1]-m);
19	double Z = e0 + e1; return {e0/Z, e1/Z};
20	}
21	int sample(mt19937 &g) const {
22	auto p = probs();
23	uniform_real_distribution<> U(0.0,1.0);
24	return (U(g) < p[0]) ? 0 : 1;
25	}
26	};
27
28	int main(){
29	ios::sync_with_stdio(false); cin.tie(nullptr);
30	Bandit2 env(0.0, 0.5); // arm 1 is better
31	Softmax2 pi;
32	mt19937 gen(1234);
33
34	double alpha = 0.1; // natural gradient step size
35	double baseline = 0.0; // running average reward as baseline
36	double beta = 0.01; // baseline update rate
37
38	for(int t=1; t<=5000; ++t){
39	auto p = pi.probs();
40	int a = pi.sample(gen);
41	double r = env.pull(a);
42	baseline += beta * (r - baseline);
43
44	// Score function gradient g = (r - b) * (onehot(a) - p)
45	array<double,2> onehot{ (a==0)?1.0:0.0, (a==1)?1.0:0.0 };
46	array<double,2> g{
47	(r - baseline) * (onehot[0] - p[0]),
48	(r - baseline) * (onehot[1] - p[1])
49	};
50
51	// Fisher for 2-action softmax: F = diag(p) - p p^T
52	// Compute 2x2 F and its inverse explicitly.
53	double p0 = p[0], p1 = p[1];
54	double F00 = p0*(1.0 - p0);
55	double F11 = p1*(1.0 - p1);
56	double F01 = -p0*p1;
57	double F10 = -p0*p1;
58	// Inverse of 2x2 matrix [[a,b],[c,d]] is (1/det) * [[d,-b],[-c,a]]
59	double det = F00F11 - F01F10;
60	// Guard against near-singularity when probabilities saturate
61	if (det < 1e-8) det = 1e-8;
62	double iF00 = F11 / det;
63	double iF01 = -F01 / det;
64	double iF10 = -F10 / det;
65	double iF11 = F00 / det;
66
67	// Natural gradient: theta += alpha * F^{-1} * g
68	array<double,2> ng{
69	iF00 * g[0] + iF01 * g[1],
70	iF10 * g[0] + iF11 * g[1]
71	};
72	pi.theta[0] += alpha * ng[0];
73	pi.theta[1] += alpha * ng[1];
74
75	if (t % 500 == 0) {
76	auto q = pi.probs();
77	cout << "t=" << t << ", P(arm1)=" << fixed << setprecision(3) << q[1]
78	<< ", avgR~" << baseline << "\n";
79	}
80	}
81
82	auto p = pi.probs();
83	cout << "Final probabilities: P(arm0)=" << fixed << setprecision(3) << p[0]
84	<< ", P(arm1)=" << p[1] << "\n";
85	return 0;
86	}
87