∑MathIntermediate

Discount Factor & Return

Key Points

•
The discounted return $G_{t}$ sums all future rewards but down-weights distant rewards by powers of a discount factor $γ .$
•
Choosing 0 ≤ γ < 1 guarantees convergence in continuing tasks and limits the effective planning horizon to about 1/(1− $γ)$ steps.
•
For episodic tasks with a terminal state, γ can be 1 because the sum is finite.
•
A simple backward pass computes all per-time-step returns for an episode in O(T) time and O(T) space.
•
The recursive identity $G_{t} = r_{t + 1}$ + γ $G_{t + 1}$ enables efficient dynamic programming and TD methods.
•
n-step returns mix real rewards with a bootstrap value estimate, trading bias for variance.
•
Proper indexing matters: rewards are $r_{t + 1}$ , not $r_{t}$ , to align with the next transition.
•
If rewards are bounded by $R_{m} a x$ and 0 ≤ γ < 1, the absolute return is bounded by $R_{m} a x$ /(1− $γ) .$

Prerequisites

→Sequences and series — Understanding how infinite and finite sums behave, especially geometric series, is essential for discounted returns.
→Basic probability and expectation — Value functions are defined as expected returns; Monte Carlo and TD rely on expectations.
→Markov decision processes (MDPs) — States, actions, rewards, and transitions form the context in which discounted returns are used.
→Dynamic programming basics — The recursive structure G_t = r_{t+1} + γ G_{t+1} leads to Bellman equations and DP methods.
→Numerical computing with floating point — Discounted sums can accumulate rounding error; stability considerations matter for implementation.

Detailed Explanation

Tap terms for definitions

01Overview

Discount factor and return are core ideas in reinforcement learning and many time-based optimization problems. The return, denoted G_t, is the total value an agent expects to collect from time t onward. To avoid overemphasizing far-future outcomes and to ensure mathematical convergence in problems without natural endings, each future reward is multiplied by a power of a discount factor γ (gamma), typically with 0 ≤ γ < 1. This produces a geometric weighting where immediate rewards matter most, and each additional step into the future counts slightly less. In episodic tasks that always terminate, setting γ = 1 is safe because the sum is finite; in continuing tasks, γ < 1 guarantees a finite value. The discounted return neatly connects to dynamic programming via a simple recursive decomposition, enabling efficient algorithms like policy evaluation and temporal-difference learning. Practically, γ tunes the agent’s foresight: small γ focuses on short-term gains, while γ close to 1 gives a long-term view but can increase variance and make learning slower. Beyond RL, the same mathematics underlies present value in finance and decaying influence in signal processing—anywhere we need to aggregate a time series with recency preference and convergence guarantees.

02Intuition & Analogies

Hook: Imagine choosing between $10 now or$ 11 tomorrow. Most people prefer $10 now because the future is uncertain and waiting has a cost. Discounting encodes this intuition mathematically. Concept: The discount factor γ is like a “patience knob.” If γ is small (say 0.5), you are impatient: a reward tomorrow is worth only half as much as the same reward today. If γ is large (say 0.99), you’re patient: you value future rewards almost as much as immediate ones. The return G_t is the total “worth” of all future rewards, after applying this patience rule repeatedly—each step multiplies by another γ, so the value fades geometrically with delay. Example: Suppose you get a stream of allowance payments:$ 5 today, then $5 each day after. With γ = 0.9, today’s$ 5 counts as 5, tomorrow’s $5 counts as 0.9×5 = 4.5, the next day’s counts as 0.9^2×5 ≈ 4.05, and so on. Adding them up gives a finite total because those later amounts shrink quickly—like echoes that get quieter with distance. If the stream never ends, γ < 1 keeps the total bounded; if the stream stops after a few days, you can even set γ = 1 since there are only finitely many payments. In decision-making, using discounted returns means you won’t chase distant windfalls at the expense of obvious near-term gains, yet by choosing γ close to 1, you’ll still account for long-term consequences. This balance is fundamental in training agents, evaluating investments, and prioritizing work over time.

03Formal Definition

Let {

r_{t + 1}

r_{t + 2}

, …} be a sequence of real-valued rewards obtained after times t+1, t+2, … . The (infinite-horizon) discounted return from time t is defined as

G_{t}

\sum_{k = 0}^{\infty}

γ^{k}

r_{t + k + 1}

, where the discount factor satisfies 0

\leq

γ

< 1 to ensure convergence when the sequence is infinite. In episodic tasks of finite length T, we use the finite sum

G_{t}

\sum_{k = 0}^{T - t - 1}

γ^{k}

r_{t + k + 1}

and may allow

γ

= 1. A key recursive identity follows directly from the sum’s structure:

G_{t} = r_{t + 1}

γ

G_{t + 1}

. This identity underlies Bellman equations and dynamic programming. If rewards are bounded,

∣ r_{t} ∣

\leq

R_{m a x}

, and 0

\leq

γ

< 1, then

∣ G_{t} ∣

\leq

\frac{R _{m a x}}{1 - γ}

. In Markov decision processes under policy

π

, the state-value function is

V^{π}

(s) =

E_{π}

[

G_{t}

∣

S_{t} = s]

, the expected discounted return starting in state s and following

π

. Similarly, the action-value is

Q^{π}

(s,a) =

E_{π}

[

G_{t}

∣

S_{t} = s

A_{t} = a]

. These expectations provide targets for policy evaluation and improvement. Variants like n-step returns G_

t^{(n)}

\sum_{k = 0}^{n - 1}

γ^{k}

r_{t + k + 1}

γ^{n}

S_{t + n}

) interpolate between pure Monte Carlo (large n) and one-step bootstrapping (

n = 1

04When to Use

Reinforcement Learning (RL): Use discounted return to define objectives that balance immediate versus future rewards, especially in continuing tasks where episodes do not end. γ guarantees finite values and stable learning targets.
Dynamic Programming and TD Learning: The recursion G_t = r_{t+1} + γ G_{t+1} enables Bellman equations and efficient bootstrapping for value estimation. Choose γ based on how farsighted your agent should be and the variance you can tolerate.
Episodic vs. Continuing Tasks: In episodic tasks with guaranteed termination, you can safely set γ = 1 to value all rewards equally across the finite horizon. In continuing tasks, pick γ < 1 to ensure convergence and to control the effective planning horizon ≈ 1/(1−γ).
Finance and Economics: Present value of cash flows uses the same math with γ = 1/(1 + discount rate). Use discounting to compare options with payoffs occurring at different times.
Streaming/Signals: When aggregating signals where recent data should matter more, geometric discounting provides an exponentially decaying memory with simple O(1) updates.

⚠️Common Mistakes

Setting γ ≥ 1 in continuing tasks: This can make G_t diverge or explode variances. Use 0 ≤ γ < 1 for infinite-horizon problems; allow γ = 1 only for finite episodes.
Off-by-one indexing: Rewards are r_{t+1} (after the transition), not r_t. Align arrays so that returns at index t sum rewards starting at rewards[t].
Double-discounting or missing a factor: When mixing Monte Carlo with bootstrapping (n-step returns), ensure exactly k discounts on the k-step-ahead reward and γ^n on the bootstrap term.
Using pow repeatedly in tight loops: Recomputing γ^k with pow can add both time and floating-point error. Prefer the backward recursion G[i] = r[i] + γ G[i+1].
Ignoring terminal states: Do not propagate value past terminal; reset return accumulation at episode boundaries. For γ = 1, forgetting to cut off at terminal can cause incorrect sums.
Choosing γ too close to 1 without enough data: Very large γ inflates variance and lengthens the effective horizon, slowing learning and making estimates noisy. Calibrate γ to the task’s natural timescale.
Mixing scales of rewards: If rewards have large magnitude or heavy tails, returns may be unstable numerically. Consider reward clipping or normalization together with appropriate γ.

Key Formulas

Discounted return

G_{t} = k = 0 \sum \infty γ^{k} r_{t + k + 1}

Explanation: The total value from time t onward is the sum of future rewards, each multiplied by a power of $γ .$ This converges for 0 ≤ γ < 1 in continuing tasks and equals the objective most RL agents optimize.

Finite-horizon return

G_{t} = k = 0 \sum T - t - 1 γ^{k} r_{t + k + 1}

Explanation: In episodic tasks ending at time T, only the remaining rewards contribute. With a finite sum, γ can be 1 without divergence.

Return recursion

G_{t} = r_{t + 1} + γ G_{t + 1}

Explanation: The return decomposes into the immediate reward plus the discounted return from the next time. This identity enables dynamic programming and TD methods.

Return bound

∣ G_{t} ∣ \leq \frac{R _{m a x}}{1 - γ} if ∣ r_{t} ∣ \leq R_{m a x}, 0 \leq γ < 1

Explanation: If rewards are uniformly bounded and γ < 1, the discounted sum is absolutely bounded. This ensures numerical stability and theoretical convergence.

Geometric series

k = 0 \sum \infty γ^{k} = \frac{1}{1 - γ} (∣ γ ∣ < 1)

Explanation: The total weight of all future steps under discounting is finite. It also motivates the effective horizon of about 1/(1− $γ)$ steps.

State value as expected return

V^{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]

Explanation: A value function is the expected discounted return from a state under a policy. It defines the target that policy evaluation seeks to approximate.

Bellman expectation equation

V^{π} (s) = E_{π} [r_{t + 1} + γ V^{π} (S_{t + 1}) ∣ S_{t} = s]

Explanation: Value equals expected immediate reward plus discounted next-state value. This recursive equation is the basis for DP and TD algorithms.

n-step return

G_{t}^{(n)} = k = 0 \sum n - 1 γ^{k} r_{t + k + 1} + γ^{n} V (S_{t + n})

Explanation: Combines n actual rewards with a bootstrapped estimate at step n. It trades bias (from V) for reduced variance compared to full returns.

Present value connection

NPV = k = 0 \sum \infty \frac{c _{t + k + 1}}{( 1 + r ) ^{k + 1}} = k = 0 \sum \infty γ^{k + 1} c_{t + k + 1}, γ = \frac{1}{1 + r}

Explanation: Financial present value is mathematically identical to discounted returns with γ = 1/(1+r). This links RL discounting to standard economic reasoning.

Incremental mean/TD-style update

\hat{V}_{n + 1} (s) = \hat{V}_{n} (s) + α (G - \hat{V}_{n} (s))

Explanation: An incremental update moves the current estimate toward a target G by a fraction $α .$ With α = 1/N, it becomes an online average; with TD, G is a bootstrapped target.

Horizon approximation

Effective horizon \approx \frac{1}{1 - γ}

Explanation: Most of the discounted mass lies within about 1/(1− $γ)$ steps. This heuristic helps choose γ for the task’s timescale.

Complexity Analysis

Computing discounted returns for a single episode of length T via the backward recursion G[i] = r[i] + γ G[i+1] runs in O(T) time and O(T) space if all per-time-step returns are stored, or O(1) extra space if only the running accumulator is kept. This is optimal because every reward must be visited at least once. In contrast, a naive approach that recomputes

γ^{k}

with pow or that recomputes partial sums for each i can degrade to O(

T^{2}

) time; thus, always prefer the backward dynamic programming pass. For Monte Carlo value estimation over E episodes with average length L, total time is O(E·L), dominated by episode generation and the O(L) return computation per episode. Memory usage is O(S + L) where S is the number of states, due to value tables and temporary storage of trajectories; with streaming updates (first-visit sets cleared each episode), auxiliary memory remains linear in S. For n-step returns, computing a single target takes O(n) time and O(1) space given a bootstrap value, and sliding-window implementations over an episode can amortize to O(1) per step. Numerical stability is generally good, but with γ close to 1 and long horizons, accumulated floating-point error can grow; using double precision and the backward recursion reduces error compared to forward summation with pow. In continuing tasks, γ < 1 keeps values bounded by

R_{m} a x

/(1−

γ),

which also bounds intermediate computations in practice.

Code Examples

Compute discounted returns for an episode (backward pass)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute per-time-step discounted returns for a single episode.
5 // rewards[i] corresponds to r_{i+1} in the math notation.
6 vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
7     int T = (int)rewards.size();
8     vector<double> G(T, 0.0);
9     double acc = 0.0; // This will hold G_{i+1} as we move backward
10     for (int i = T - 1; i >= 0; --i) {
11         // G_i = r_{i+1} + gamma * G_{i+1}
12         acc = rewards[i] + gamma * acc;
13         G[i] = acc;
14     }
15     return G;
16 }
17 
18 int main() {
19     // Example: episodic rewards over 5 steps
20     // r_1=1, r_2=0, r_3=2, r_4=-1, r_5=3
21     vector<double> rewards = {1.0, 0.0, 2.0, -1.0, 3.0};
22 
23     double gamma1 = 0.9; // patient
24     double gamma2 = 0.5; // more myopic
25 
26     auto G1 = discountedReturns(rewards, gamma1);
27     auto G2 = discountedReturns(rewards, gamma2);
28 
29     cout.setf(ios::fixed); cout << setprecision(6);
30     cout << "Discounted returns with gamma=0.9\n";
31     for (size_t i = 0; i < G1.size(); ++i) cout << "G[" << i << "] = " << G1[i] << "\n";
32 
33     cout << "\nDiscounted returns with gamma=0.5\n";
34     for (size_t i = 0; i < G2.size(); ++i) cout << "G[" << i << "] = " << G2[i] << "\n";
35 
36     return 0;
37 }
38

This program computes G_i = r_{i+1} + γ G_{i+1} for all time steps in an episode using a single backward pass. The accumulator holds G_{i+1} as we move left, ensuring numerical stability and avoiding repeated calls to pow. The example prints the returns for two different discount factors to show how γ changes the emphasis on near-term versus long-term rewards.

Time: O(T) for an episode of length T.Space: O(T) to store all G; O(1) extra beyond the output array.

First-visit Monte Carlo state-value estimation on a random walk

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Episode {
5     vector<int> states;    // includes nonterminals and terminal at end
6     vector<double> rewards; // rewards[i] is r_{i+1}
7 };
8 
9 mt19937 rng(123);
10 
11 // Generate one episode for the 7-state random walk:
12 // States 0 and 6 are terminal. Nonterminals are 1..5.
13 // Start at state 3. Each step moves left/right with prob 0.5.
14 // Reward is +1 only when entering state 6; otherwise 0.
15 Episode generate_episode() {
16     uniform_real_distribution<double> uni(0.0, 1.0);
17     Episode ep;
18     int s = 3; // start
19     ep.states.push_back(s);
20     while (s != 0 && s != 6) {
21         double u = uni(rng);
22         int a = (u < 0.5 ? -1 : +1); // left or right
23         int s_next = s + a;
24         double r = (s_next == 6 ? 1.0 : 0.0);
25         ep.rewards.push_back(r);
26         s = s_next;
27         ep.states.push_back(s);
28     }
29     return ep;
30 }
31 
32 // Compute per-time-step returns via backward pass
33 vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
34     int T = (int)rewards.size();
35     vector<double> G(T, 0.0);
36     double acc = 0.0;
37     for (int i = T - 1; i >= 0; --i) {
38         acc = rewards[i] + gamma * acc;
39         G[i] = acc;
40     }
41     return G;
42 }
43 
44 int main() {
45     const double gamma = 1.0; // episodic, finite-horizon; safe to use γ=1 here
46     const int S = 7;          // states 0..6
47     const int episodes = 20000;
48 
49     vector<double> V(S, 0.0);
50     vector<int> N(S, 0); // visit counts for first-visit MC
51 
52     for (int e = 0; e < episodes; ++e) {
53         Episode ep = generate_episode();
54         // Compute returns for each time index in the episode
55         vector<double> G = discountedReturns(ep.rewards, gamma);
56 
57         // Track first visit times within this episode
58         vector<int> first_visit_index(S, -1);
59         for (int t = 0; t < (int)ep.states.size() - 1; ++t) { // exclude terminal state's index for updates
60             int s = ep.states[t];
61             if (first_visit_index[s] == -1) first_visit_index[s] = t;
62         }
63 
64         // First-visit MC updates for nonterminal states 1..5
65         for (int s = 1; s <= 5; ++s) {
66             int t = first_visit_index[s];
67             if (t != -1) {
68                 double Gt = G[t];
69                 N[s] += 1;
70                 // Incremental mean: V[s] <- V[s] + (G - V[s]) / N[s]
71                 V[s] += (Gt - V[s]) / (double)N[s];
72             }
73         }
74     }
75 
76     cout.setf(ios::fixed); cout << setprecision(4);
77     cout << "Estimated state values (states 1..5):\n";
78     for (int s = 1; s <= 5; ++s) {
79         cout << "V[" << s << "] = " << V[s] << "\n";
80     }
81     return 0;
82 }
83

This program estimates state values in the classic 7-state random walk. An episode starts at state 3, moves left/right with equal probability, and terminates on reaching state 0 or 6. The only nonzero reward (+1) occurs when entering state 6. With γ = 1 (episodic), the return is the total reward accumulated until termination. First-visit Monte Carlo averages the return from the first occurrence of each state in an episode to estimate V(s). The discountedReturns function uses the backward recursion for efficient computation.

Time: O(E · L) where E is the number of episodes and L is the average episode length.Space: O(S + L) for value tables and temporary episode storage.

Compute an n-step return with bootstrapping

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute G_t^{(n)} = sum_{k=0}^{n-1} gamma^k r_{t+k+1} + gamma^n * V(S_{t+n})
5 // Given: a window of future rewards starting at time t (rewards[t] = r_{t+1}),
6 // the discount factor gamma, the step size n, and a bootstrap value V(S_{t+n}).
7 double n_step_return(const vector<double>& rewards_from_t, double gamma, int n, double bootstrap_value) {
8     double acc = 0.0;
9     double pow_gamma = 1.0;
10     int K = min(n, (int)rewards_from_t.size());
11     for (int k = 0; k < K; ++k) {
12         acc += pow_gamma * rewards_from_t[k];
13         pow_gamma *= gamma; // prepare for the next step
14     }
15     // If we reached fewer than n rewards (due to episode termination),
16     // the bootstrap typically should NOT be applied in episodic tasks.
17     // Here we only add bootstrap if we actually have n steps available.
18     if (n <= (int)rewards_from_t.size()) acc += pow_gamma * bootstrap_value;
19     return acc;
20 }
21 
22 int main() {
23     // Example: suppose from time t onward we observe rewards: 2, 0, -1, 3, ...
24     vector<double> future_rewards = {2.0, 0.0, -1.0, 3.0};
25     double gamma = 0.9;
26 
27     // Assume a value baseline V(S_{t+3}) = 5.0 for n=3
28     int n = 3;
29     double V_bootstrap = 5.0;
30 
31     double G_n = n_step_return(future_rewards, gamma, n, V_bootstrap);
32 
33     cout.setf(ios::fixed); cout << setprecision(6);
34     cout << "n-step return (n=3, gamma=0.9, V_bootstrap=5.0): " << G_n << "\n";
35 
36     // Compare with full (finite) return without bootstrapping (γ^n term omitted)
37     // This equals the n->infty Monte Carlo target truncated to the episode length.
38     double full = 0.0, pow_gamma = 1.0;
39     for (double r : future_rewards) { full += pow_gamma * r; pow_gamma *= gamma; }
40     cout << "Truncated full return (no bootstrap): " << full << "\n";
41     return 0;
42 }
43

This example computes the n-step return G_t^{(n)} that mixes real rewards for n steps with a bootstrap value estimate at step n. The geometric factor is built iteratively without calling pow repeatedly. If the episode terminates before n steps, the bootstrap is typically omitted (since there is no next state), as shown in the guard condition.

Time: O(n) per n-step target.Space: O(1) auxiliary space.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute per-time-step discounted returns for a single episode.
5	// rewards[i] corresponds to r_{i+1} in the math notation.
6	vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
7	int T = (int)rewards.size();
8	vector<double> G(T, 0.0);
9	double acc = 0.0; // This will hold G_{i+1} as we move backward
10	for (int i = T - 1; i >= 0; --i) {
11	// G_i = r_{i+1} + gamma * G_{i+1}
12	acc = rewards[i] + gamma * acc;
13	G[i] = acc;
14	}
15	return G;
16	}
17
18	int main() {
19	// Example: episodic rewards over 5 steps
20	// r_1=1, r_2=0, r_3=2, r_4=-1, r_5=3
21	vector<double> rewards = {1.0, 0.0, 2.0, -1.0, 3.0};
22
23	double gamma1 = 0.9; // patient
24	double gamma2 = 0.5; // more myopic
25
26	auto G1 = discountedReturns(rewards, gamma1);
27	auto G2 = discountedReturns(rewards, gamma2);
28
29	cout.setf(ios::fixed); cout << setprecision(6);
30	cout << "Discounted returns with gamma=0.9\n";
31	for (size_t i = 0; i < G1.size(); ++i) cout << "G[" << i << "] = " << G1[i] << "\n";
32
33	cout << "\nDiscounted returns with gamma=0.5\n";
34	for (size_t i = 0; i < G2.size(); ++i) cout << "G[" << i << "] = " << G2[i] << "\n";
35
36	return 0;
37	}
38

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Episode {
5	vector<int> states; // includes nonterminals and terminal at end
6	vector<double> rewards; // rewards[i] is r_{i+1}
7	};
8
9	mt19937 rng(123);
10
11	// Generate one episode for the 7-state random walk:
12	// States 0 and 6 are terminal. Nonterminals are 1..5.
13	// Start at state 3. Each step moves left/right with prob 0.5.
14	// Reward is +1 only when entering state 6; otherwise 0.
15	Episode generate_episode() {
16	uniform_real_distribution<double> uni(0.0, 1.0);
17	Episode ep;
18	int s = 3; // start
19	ep.states.push_back(s);
20	while (s != 0 && s != 6) {
21	double u = uni(rng);
22	int a = (u < 0.5 ? -1 : +1); // left or right
23	int s_next = s + a;
24	double r = (s_next == 6 ? 1.0 : 0.0);
25	ep.rewards.push_back(r);
26	s = s_next;
27	ep.states.push_back(s);
28	}
29	return ep;
30	}
31
32	// Compute per-time-step returns via backward pass
33	vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
34	int T = (int)rewards.size();
35	vector<double> G(T, 0.0);
36	double acc = 0.0;
37	for (int i = T - 1; i >= 0; --i) {
38	acc = rewards[i] + gamma * acc;
39	G[i] = acc;
40	}
41	return G;
42	}
43
44	int main() {
45	const double gamma = 1.0; // episodic, finite-horizon; safe to use γ=1 here
46	const int S = 7; // states 0..6
47	const int episodes = 20000;
48
49	vector<double> V(S, 0.0);
50	vector<int> N(S, 0); // visit counts for first-visit MC
51
52	for (int e = 0; e < episodes; ++e) {
53	Episode ep = generate_episode();
54	// Compute returns for each time index in the episode
55	vector<double> G = discountedReturns(ep.rewards, gamma);
56
57	// Track first visit times within this episode
58	vector<int> first_visit_index(S, -1);
59	for (int t = 0; t < (int)ep.states.size() - 1; ++t) { // exclude terminal state's index for updates
60	int s = ep.states[t];
61	if (first_visit_index[s] == -1) first_visit_index[s] = t;
62	}
63
64	// First-visit MC updates for nonterminal states 1..5
65	for (int s = 1; s <= 5; ++s) {
66	int t = first_visit_index[s];
67	if (t != -1) {
68	double Gt = G[t];
69	N[s] += 1;
70	// Incremental mean: V[s] <- V[s] + (G - V[s]) / N[s]
71	V[s] += (Gt - V[s]) / (double)N[s];
72	}
73	}
74	}
75
76	cout.setf(ios::fixed); cout << setprecision(4);
77	cout << "Estimated state values (states 1..5):\n";
78	for (int s = 1; s <= 5; ++s) {
79	cout << "V[" << s << "] = " << V[s] << "\n";
80	}
81	return 0;
82	}
83

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute G_t^{(n)} = sum_{k=0}^{n-1} gamma^k r_{t+k+1} + gamma^n * V(S_{t+n})
5	// Given: a window of future rewards starting at time t (rewards[t] = r_{t+1}),
6	// the discount factor gamma, the step size n, and a bootstrap value V(S_{t+n}).
7	double n_step_return(const vector<double>& rewards_from_t, double gamma, int n, double bootstrap_value) {
8	double acc = 0.0;
9	double pow_gamma = 1.0;
10	int K = min(n, (int)rewards_from_t.size());
11	for (int k = 0; k < K; ++k) {
12	acc += pow_gamma * rewards_from_t[k];
13	pow_gamma *= gamma; // prepare for the next step
14	}
15	// If we reached fewer than n rewards (due to episode termination),
16	// the bootstrap typically should NOT be applied in episodic tasks.
17	// Here we only add bootstrap if we actually have n steps available.
18	if (n <= (int)rewards_from_t.size()) acc += pow_gamma * bootstrap_value;
19	return acc;
20	}
21
22	int main() {
23	// Example: suppose from time t onward we observe rewards: 2, 0, -1, 3, ...
24	vector<double> future_rewards = {2.0, 0.0, -1.0, 3.0};
25	double gamma = 0.9;
26
27	// Assume a value baseline V(S_{t+3}) = 5.0 for n=3
28	int n = 3;
29	double V_bootstrap = 5.0;
30
31	double G_n = n_step_return(future_rewards, gamma, n, V_bootstrap);
32
33	cout.setf(ios::fixed); cout << setprecision(6);
34	cout << "n-step return (n=3, gamma=0.9, V_bootstrap=5.0): " << G_n << "\n";
35
36	// Compare with full (finite) return without bootstrapping (γ^n term omitted)
37	// This equals the n->infty Monte Carlo target truncated to the episode length.
38	double full = 0.0, pow_gamma = 1.0;
39	for (double r : future_rewards) { full += pow_gamma * r; pow_gamma *= gamma; }
40	cout << "Truncated full return (no bootstrap): " << full << "\n";
41	return 0;
42	}
43