Discount Factor & Return
Key Points
- •The discounted return sums all future rewards but down-weights distant rewards by powers of a discount factor
- •Choosing 0 ≤ γ < 1 guarantees convergence in continuing tasks and limits the effective planning horizon to about 1/(1− steps.
- •For episodic tasks with a terminal state, γ can be 1 because the sum is finite.
- •A simple backward pass computes all per-time-step returns for an episode in O(T) time and O(T) space.
- •The recursive identity + γ enables efficient dynamic programming and TD methods.
- •n-step returns mix real rewards with a bootstrap value estimate, trading bias for variance.
- •Proper indexing matters: rewards are , not , to align with the next transition.
- •If rewards are bounded by and 0 ≤ γ < 1, the absolute return is bounded by /(1−
Prerequisites
- →Sequences and series — Understanding how infinite and finite sums behave, especially geometric series, is essential for discounted returns.
- →Basic probability and expectation — Value functions are defined as expected returns; Monte Carlo and TD rely on expectations.
- →Markov decision processes (MDPs) — States, actions, rewards, and transitions form the context in which discounted returns are used.
- →Dynamic programming basics — The recursive structure G_t = r_{t+1} + γ G_{t+1} leads to Bellman equations and DP methods.
- →Numerical computing with floating point — Discounted sums can accumulate rounding error; stability considerations matter for implementation.
Detailed Explanation
Tap terms for definitions01Overview
Discount factor and return are core ideas in reinforcement learning and many time-based optimization problems. The return, denoted G_t, is the total value an agent expects to collect from time t onward. To avoid overemphasizing far-future outcomes and to ensure mathematical convergence in problems without natural endings, each future reward is multiplied by a power of a discount factor γ (gamma), typically with 0 ≤ γ < 1. This produces a geometric weighting where immediate rewards matter most, and each additional step into the future counts slightly less. In episodic tasks that always terminate, setting γ = 1 is safe because the sum is finite; in continuing tasks, γ < 1 guarantees a finite value. The discounted return neatly connects to dynamic programming via a simple recursive decomposition, enabling efficient algorithms like policy evaluation and temporal-difference learning. Practically, γ tunes the agent’s foresight: small γ focuses on short-term gains, while γ close to 1 gives a long-term view but can increase variance and make learning slower. Beyond RL, the same mathematics underlies present value in finance and decaying influence in signal processing—anywhere we need to aggregate a time series with recency preference and convergence guarantees.
02Intuition & Analogies
Hook: Imagine choosing between 11 tomorrow. Most people prefer 5 today, then 5 counts as 5, tomorrow’s $5 counts as 0.9×5 = 4.5, the next day’s counts as 0.9^2×5 ≈ 4.05, and so on. Adding them up gives a finite total because those later amounts shrink quickly—like echoes that get quieter with distance. If the stream never ends, γ < 1 keeps the total bounded; if the stream stops after a few days, you can even set γ = 1 since there are only finitely many payments. In decision-making, using discounted returns means you won’t chase distant windfalls at the expense of obvious near-term gains, yet by choosing γ close to 1, you’ll still account for long-term consequences. This balance is fundamental in training agents, evaluating investments, and prioritizing work over time.
03Formal Definition
04When to Use
- Reinforcement Learning (RL): Use discounted return to define objectives that balance immediate versus future rewards, especially in continuing tasks where episodes do not end. γ guarantees finite values and stable learning targets.
- Dynamic Programming and TD Learning: The recursion G_t = r_{t+1} + γ G_{t+1} enables Bellman equations and efficient bootstrapping for value estimation. Choose γ based on how farsighted your agent should be and the variance you can tolerate.
- Episodic vs. Continuing Tasks: In episodic tasks with guaranteed termination, you can safely set γ = 1 to value all rewards equally across the finite horizon. In continuing tasks, pick γ < 1 to ensure convergence and to control the effective planning horizon ≈ 1/(1−γ).
- Finance and Economics: Present value of cash flows uses the same math with γ = 1/(1 + discount rate). Use discounting to compare options with payoffs occurring at different times.
- Streaming/Signals: When aggregating signals where recent data should matter more, geometric discounting provides an exponentially decaying memory with simple O(1) updates.
⚠️Common Mistakes
- Setting γ ≥ 1 in continuing tasks: This can make G_t diverge or explode variances. Use 0 ≤ γ < 1 for infinite-horizon problems; allow γ = 1 only for finite episodes.
- Off-by-one indexing: Rewards are r_{t+1} (after the transition), not r_t. Align arrays so that returns at index t sum rewards starting at rewards[t].
- Double-discounting or missing a factor: When mixing Monte Carlo with bootstrapping (n-step returns), ensure exactly k discounts on the k-step-ahead reward and γ^n on the bootstrap term.
- Using pow repeatedly in tight loops: Recomputing γ^k with pow can add both time and floating-point error. Prefer the backward recursion G[i] = r[i] + γ G[i+1].
- Ignoring terminal states: Do not propagate value past terminal; reset return accumulation at episode boundaries. For γ = 1, forgetting to cut off at terminal can cause incorrect sums.
- Choosing γ too close to 1 without enough data: Very large γ inflates variance and lengthens the effective horizon, slowing learning and making estimates noisy. Calibrate γ to the task’s natural timescale.
- Mixing scales of rewards: If rewards have large magnitude or heavy tails, returns may be unstable numerically. Consider reward clipping or normalization together with appropriate γ.
Key Formulas
Discounted return
Explanation: The total value from time t onward is the sum of future rewards, each multiplied by a power of This converges for 0 ≤ γ < 1 in continuing tasks and equals the objective most RL agents optimize.
Finite-horizon return
Explanation: In episodic tasks ending at time T, only the remaining rewards contribute. With a finite sum, γ can be 1 without divergence.
Return recursion
Explanation: The return decomposes into the immediate reward plus the discounted return from the next time. This identity enables dynamic programming and TD methods.
Return bound
Explanation: If rewards are uniformly bounded and γ < 1, the discounted sum is absolutely bounded. This ensures numerical stability and theoretical convergence.
Geometric series
Explanation: The total weight of all future steps under discounting is finite. It also motivates the effective horizon of about 1/(1− steps.
State value as expected return
Explanation: A value function is the expected discounted return from a state under a policy. It defines the target that policy evaluation seeks to approximate.
Bellman expectation equation
Explanation: Value equals expected immediate reward plus discounted next-state value. This recursive equation is the basis for DP and TD algorithms.
n-step return
Explanation: Combines n actual rewards with a bootstrapped estimate at step n. It trades bias (from V) for reduced variance compared to full returns.
Present value connection
Explanation: Financial present value is mathematically identical to discounted returns with γ = 1/(1+r). This links RL discounting to standard economic reasoning.
Incremental mean/TD-style update
Explanation: An incremental update moves the current estimate toward a target G by a fraction With α = 1/N, it becomes an online average; with TD, G is a bootstrapped target.
Horizon approximation
Explanation: Most of the discounted mass lies within about 1/(1− steps. This heuristic helps choose γ for the task’s timescale.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute per-time-step discounted returns for a single episode. 5 // rewards[i] corresponds to r_{i+1} in the math notation. 6 vector<double> discountedReturns(const vector<double>& rewards, double gamma) { 7 int T = (int)rewards.size(); 8 vector<double> G(T, 0.0); 9 double acc = 0.0; // This will hold G_{i+1} as we move backward 10 for (int i = T - 1; i >= 0; --i) { 11 // G_i = r_{i+1} + gamma * G_{i+1} 12 acc = rewards[i] + gamma * acc; 13 G[i] = acc; 14 } 15 return G; 16 } 17 18 int main() { 19 // Example: episodic rewards over 5 steps 20 // r_1=1, r_2=0, r_3=2, r_4=-1, r_5=3 21 vector<double> rewards = {1.0, 0.0, 2.0, -1.0, 3.0}; 22 23 double gamma1 = 0.9; // patient 24 double gamma2 = 0.5; // more myopic 25 26 auto G1 = discountedReturns(rewards, gamma1); 27 auto G2 = discountedReturns(rewards, gamma2); 28 29 cout.setf(ios::fixed); cout << setprecision(6); 30 cout << "Discounted returns with gamma=0.9\n"; 31 for (size_t i = 0; i < G1.size(); ++i) cout << "G[" << i << "] = " << G1[i] << "\n"; 32 33 cout << "\nDiscounted returns with gamma=0.5\n"; 34 for (size_t i = 0; i < G2.size(); ++i) cout << "G[" << i << "] = " << G2[i] << "\n"; 35 36 return 0; 37 } 38
This program computes G_i = r_{i+1} + γ G_{i+1} for all time steps in an episode using a single backward pass. The accumulator holds G_{i+1} as we move left, ensuring numerical stability and avoiding repeated calls to pow. The example prints the returns for two different discount factors to show how γ changes the emphasis on near-term versus long-term rewards.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Episode { 5 vector<int> states; // includes nonterminals and terminal at end 6 vector<double> rewards; // rewards[i] is r_{i+1} 7 }; 8 9 mt19937 rng(123); 10 11 // Generate one episode for the 7-state random walk: 12 // States 0 and 6 are terminal. Nonterminals are 1..5. 13 // Start at state 3. Each step moves left/right with prob 0.5. 14 // Reward is +1 only when entering state 6; otherwise 0. 15 Episode generate_episode() { 16 uniform_real_distribution<double> uni(0.0, 1.0); 17 Episode ep; 18 int s = 3; // start 19 ep.states.push_back(s); 20 while (s != 0 && s != 6) { 21 double u = uni(rng); 22 int a = (u < 0.5 ? -1 : +1); // left or right 23 int s_next = s + a; 24 double r = (s_next == 6 ? 1.0 : 0.0); 25 ep.rewards.push_back(r); 26 s = s_next; 27 ep.states.push_back(s); 28 } 29 return ep; 30 } 31 32 // Compute per-time-step returns via backward pass 33 vector<double> discountedReturns(const vector<double>& rewards, double gamma) { 34 int T = (int)rewards.size(); 35 vector<double> G(T, 0.0); 36 double acc = 0.0; 37 for (int i = T - 1; i >= 0; --i) { 38 acc = rewards[i] + gamma * acc; 39 G[i] = acc; 40 } 41 return G; 42 } 43 44 int main() { 45 const double gamma = 1.0; // episodic, finite-horizon; safe to use γ=1 here 46 const int S = 7; // states 0..6 47 const int episodes = 20000; 48 49 vector<double> V(S, 0.0); 50 vector<int> N(S, 0); // visit counts for first-visit MC 51 52 for (int e = 0; e < episodes; ++e) { 53 Episode ep = generate_episode(); 54 // Compute returns for each time index in the episode 55 vector<double> G = discountedReturns(ep.rewards, gamma); 56 57 // Track first visit times within this episode 58 vector<int> first_visit_index(S, -1); 59 for (int t = 0; t < (int)ep.states.size() - 1; ++t) { // exclude terminal state's index for updates 60 int s = ep.states[t]; 61 if (first_visit_index[s] == -1) first_visit_index[s] = t; 62 } 63 64 // First-visit MC updates for nonterminal states 1..5 65 for (int s = 1; s <= 5; ++s) { 66 int t = first_visit_index[s]; 67 if (t != -1) { 68 double Gt = G[t]; 69 N[s] += 1; 70 // Incremental mean: V[s] <- V[s] + (G - V[s]) / N[s] 71 V[s] += (Gt - V[s]) / (double)N[s]; 72 } 73 } 74 } 75 76 cout.setf(ios::fixed); cout << setprecision(4); 77 cout << "Estimated state values (states 1..5):\n"; 78 for (int s = 1; s <= 5; ++s) { 79 cout << "V[" << s << "] = " << V[s] << "\n"; 80 } 81 return 0; 82 } 83
This program estimates state values in the classic 7-state random walk. An episode starts at state 3, moves left/right with equal probability, and terminates on reaching state 0 or 6. The only nonzero reward (+1) occurs when entering state 6. With γ = 1 (episodic), the return is the total reward accumulated until termination. First-visit Monte Carlo averages the return from the first occurrence of each state in an episode to estimate V(s). The discountedReturns function uses the backward recursion for efficient computation.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute G_t^{(n)} = sum_{k=0}^{n-1} gamma^k r_{t+k+1} + gamma^n * V(S_{t+n}) 5 // Given: a window of future rewards starting at time t (rewards[t] = r_{t+1}), 6 // the discount factor gamma, the step size n, and a bootstrap value V(S_{t+n}). 7 double n_step_return(const vector<double>& rewards_from_t, double gamma, int n, double bootstrap_value) { 8 double acc = 0.0; 9 double pow_gamma = 1.0; 10 int K = min(n, (int)rewards_from_t.size()); 11 for (int k = 0; k < K; ++k) { 12 acc += pow_gamma * rewards_from_t[k]; 13 pow_gamma *= gamma; // prepare for the next step 14 } 15 // If we reached fewer than n rewards (due to episode termination), 16 // the bootstrap typically should NOT be applied in episodic tasks. 17 // Here we only add bootstrap if we actually have n steps available. 18 if (n <= (int)rewards_from_t.size()) acc += pow_gamma * bootstrap_value; 19 return acc; 20 } 21 22 int main() { 23 // Example: suppose from time t onward we observe rewards: 2, 0, -1, 3, ... 24 vector<double> future_rewards = {2.0, 0.0, -1.0, 3.0}; 25 double gamma = 0.9; 26 27 // Assume a value baseline V(S_{t+3}) = 5.0 for n=3 28 int n = 3; 29 double V_bootstrap = 5.0; 30 31 double G_n = n_step_return(future_rewards, gamma, n, V_bootstrap); 32 33 cout.setf(ios::fixed); cout << setprecision(6); 34 cout << "n-step return (n=3, gamma=0.9, V_bootstrap=5.0): " << G_n << "\n"; 35 36 // Compare with full (finite) return without bootstrapping (γ^n term omitted) 37 // This equals the n->infty Monte Carlo target truncated to the episode length. 38 double full = 0.0, pow_gamma = 1.0; 39 for (double r : future_rewards) { full += pow_gamma * r; pow_gamma *= gamma; } 40 cout << "Truncated full return (no bootstrap): " << full << "\n"; 41 return 0; 42 } 43
This example computes the n-step return G_t^{(n)} that mixes real rewards for n steps with a bootstrap value estimate at step n. The geometric factor is built iteratively without calling pow repeatedly. If the episode terminates before n steps, the bootstrap is typically omitted (since there is no next state), as shown in the guard condition.