🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Discount Factor & Return

Key Points

  • •
    The discounted return Gt​ sums all future rewards but down-weights distant rewards by powers of a discount factor γ.
  • •
    Choosing 0 ≤ γ < 1 guarantees convergence in continuing tasks and limits the effective planning horizon to about 1/(1−γ) steps.
  • •
    For episodic tasks with a terminal state, γ can be 1 because the sum is finite.
  • •
    A simple backward pass computes all per-time-step returns for an episode in O(T) time and O(T) space.
  • •
    The recursive identity Gt​=rt+1​ + γ Gt+1​ enables efficient dynamic programming and TD methods.
  • •
    n-step returns mix real rewards with a bootstrap value estimate, trading bias for variance.
  • •
    Proper indexing matters: rewards are rt+1​, not rt​, to align with the next transition.
  • •
    If rewards are bounded by Rm​ax and 0 ≤ γ < 1, the absolute return is bounded by Rm​ax/(1−γ).

Prerequisites

  • →Sequences and series — Understanding how infinite and finite sums behave, especially geometric series, is essential for discounted returns.
  • →Basic probability and expectation — Value functions are defined as expected returns; Monte Carlo and TD rely on expectations.
  • →Markov decision processes (MDPs) — States, actions, rewards, and transitions form the context in which discounted returns are used.
  • →Dynamic programming basics — The recursive structure G_t = r_{t+1} + γ G_{t+1} leads to Bellman equations and DP methods.
  • →Numerical computing with floating point — Discounted sums can accumulate rounding error; stability considerations matter for implementation.

Detailed Explanation

Tap terms for definitions

01Overview

Discount factor and return are core ideas in reinforcement learning and many time-based optimization problems. The return, denoted G_t, is the total value an agent expects to collect from time t onward. To avoid overemphasizing far-future outcomes and to ensure mathematical convergence in problems without natural endings, each future reward is multiplied by a power of a discount factor γ (gamma), typically with 0 ≤ γ < 1. This produces a geometric weighting where immediate rewards matter most, and each additional step into the future counts slightly less. In episodic tasks that always terminate, setting γ = 1 is safe because the sum is finite; in continuing tasks, γ < 1 guarantees a finite value. The discounted return neatly connects to dynamic programming via a simple recursive decomposition, enabling efficient algorithms like policy evaluation and temporal-difference learning. Practically, γ tunes the agent’s foresight: small γ focuses on short-term gains, while γ close to 1 gives a long-term view but can increase variance and make learning slower. Beyond RL, the same mathematics underlies present value in finance and decaying influence in signal processing—anywhere we need to aggregate a time series with recency preference and convergence guarantees.

02Intuition & Analogies

Hook: Imagine choosing between 10nowor10 now or 10nowor11 tomorrow. Most people prefer 10nowbecausethefutureisuncertainandwaitinghasacost.Discountingencodesthisintuitionmathematically.Concept:Thediscountfactorγislikea“patienceknob.”Ifγissmall(say0.5),youareimpatient:arewardtomorrowisworthonlyhalfasmuchasthesamerewardtoday.Ifγislarge(say0.99),you’repatient:youvaluefuturerewardsalmostasmuchasimmediateones.ThereturnGtisthetotal“worth”ofallfuturerewards,afterapplyingthispatiencerulerepeatedly—eachstepmultipliesbyanotherγ,sothevaluefadesgeometricallywithdelay.Example:Supposeyougetastreamofallowancepayments:10 now because the future is uncertain and waiting has a cost. Discounting encodes this intuition mathematically. Concept: The discount factor γ is like a “patience knob.” If γ is small (say 0.5), you are impatient: a reward tomorrow is worth only half as much as the same reward today. If γ is large (say 0.99), you’re patient: you value future rewards almost as much as immediate ones. The return G_t is the total “worth” of all future rewards, after applying this patience rule repeatedly—each step multiplies by another γ, so the value fades geometrically with delay. Example: Suppose you get a stream of allowance payments: 10nowbecausethefutureisuncertainandwaitinghasacost.Discountingencodesthisintuitionmathematically.Concept:Thediscountfactorγislikea“patienceknob.”Ifγissmall(say0.5),youareimpatient:arewardtomorrowisworthonlyhalfasmuchasthesamerewardtoday.Ifγislarge(say0.99),you’repatient:youvaluefuturerewardsalmostasmuchasimmediateones.ThereturnGt​isthetotal“worth”ofallfuturerewards,afterapplyingthispatiencerulerepeatedly—eachstepmultipliesbyanotherγ,sothevaluefadesgeometricallywithdelay.Example:Supposeyougetastreamofallowancepayments:5 today, then 5eachdayafter.Withγ=0.9,today’s5 each day after. With γ = 0.9, today’s 5eachdayafter.Withγ=0.9,today’s5 counts as 5, tomorrow’s $5 counts as 0.9×5 = 4.5, the next day’s counts as 0.9^2×5 ≈ 4.05, and so on. Adding them up gives a finite total because those later amounts shrink quickly—like echoes that get quieter with distance. If the stream never ends, γ < 1 keeps the total bounded; if the stream stops after a few days, you can even set γ = 1 since there are only finitely many payments. In decision-making, using discounted returns means you won’t chase distant windfalls at the expense of obvious near-term gains, yet by choosing γ close to 1, you’ll still account for long-term consequences. This balance is fundamental in training agents, evaluating investments, and prioritizing work over time.

03Formal Definition

Let {rt+1​, rt+2​, …} be a sequence of real-valued rewards obtained after times t+1, t+2, … . The (infinite-horizon) discounted return from time t is defined as Gt​ = ∑k=0∞​ γk rt+k+1​, where the discount factor satisfies 0 ≤ γ < 1 to ensure convergence when the sequence is infinite. In episodic tasks of finite length T, we use the finite sum Gt​ = ∑k=0T−t−1​ γk rt+k+1​ and may allow γ = 1. A key recursive identity follows directly from the sum’s structure: Gt​=rt+1​ + γ Gt+1​. This identity underlies Bellman equations and dynamic programming. If rewards are bounded, ∣rt​∣ ≤ Rmax​, and 0 ≤ γ < 1, then ∣Gt​∣ ≤ 1−γRmax​​. In Markov decision processes under policy π, the state-value function is Vπ(s) = Eπ​[Gt​ ∣ St​=s], the expected discounted return starting in state s and following π. Similarly, the action-value is Qπ(s,a) = Eπ​[Gt​ ∣ St​=s, At​=a]. These expectations provide targets for policy evaluation and improvement. Variants like n-step returns G_t(n) = ∑k=0n−1​ γk rt+k+1​ + γn V(St+n​) interpolate between pure Monte Carlo (large n) and one-step bootstrapping (n=1).

04When to Use

  • Reinforcement Learning (RL): Use discounted return to define objectives that balance immediate versus future rewards, especially in continuing tasks where episodes do not end. γ guarantees finite values and stable learning targets.
  • Dynamic Programming and TD Learning: The recursion G_t = r_{t+1} + γ G_{t+1} enables Bellman equations and efficient bootstrapping for value estimation. Choose γ based on how farsighted your agent should be and the variance you can tolerate.
  • Episodic vs. Continuing Tasks: In episodic tasks with guaranteed termination, you can safely set γ = 1 to value all rewards equally across the finite horizon. In continuing tasks, pick γ < 1 to ensure convergence and to control the effective planning horizon ≈ 1/(1−γ).
  • Finance and Economics: Present value of cash flows uses the same math with γ = 1/(1 + discount rate). Use discounting to compare options with payoffs occurring at different times.
  • Streaming/Signals: When aggregating signals where recent data should matter more, geometric discounting provides an exponentially decaying memory with simple O(1) updates.

⚠️Common Mistakes

  • Setting γ ≥ 1 in continuing tasks: This can make G_t diverge or explode variances. Use 0 ≤ γ < 1 for infinite-horizon problems; allow γ = 1 only for finite episodes.
  • Off-by-one indexing: Rewards are r_{t+1} (after the transition), not r_t. Align arrays so that returns at index t sum rewards starting at rewards[t].
  • Double-discounting or missing a factor: When mixing Monte Carlo with bootstrapping (n-step returns), ensure exactly k discounts on the k-step-ahead reward and γ^n on the bootstrap term.
  • Using pow repeatedly in tight loops: Recomputing γ^k with pow can add both time and floating-point error. Prefer the backward recursion G[i] = r[i] + γ G[i+1].
  • Ignoring terminal states: Do not propagate value past terminal; reset return accumulation at episode boundaries. For γ = 1, forgetting to cut off at terminal can cause incorrect sums.
  • Choosing γ too close to 1 without enough data: Very large γ inflates variance and lengthens the effective horizon, slowing learning and making estimates noisy. Calibrate γ to the task’s natural timescale.
  • Mixing scales of rewards: If rewards have large magnitude or heavy tails, returns may be unstable numerically. Consider reward clipping or normalization together with appropriate γ.

Key Formulas

Discounted return

Gt​=k=0∑∞​γkrt+k+1​

Explanation: The total value from time t onward is the sum of future rewards, each multiplied by a power of γ. This converges for 0 ≤ γ < 1 in continuing tasks and equals the objective most RL agents optimize.

Finite-horizon return

Gt​=k=0∑T−t−1​γkrt+k+1​

Explanation: In episodic tasks ending at time T, only the remaining rewards contribute. With a finite sum, γ can be 1 without divergence.

Return recursion

Gt​=rt+1​+γGt+1​

Explanation: The return decomposes into the immediate reward plus the discounted return from the next time. This identity enables dynamic programming and TD methods.

Return bound

∣Gt​∣≤1−γRmax​​if ∣rt​∣≤Rmax​, 0≤γ<1

Explanation: If rewards are uniformly bounded and γ < 1, the discounted sum is absolutely bounded. This ensures numerical stability and theoretical convergence.

Geometric series

k=0∑∞​γk=1−γ1​(∣γ∣<1)

Explanation: The total weight of all future steps under discounting is finite. It also motivates the effective horizon of about 1/(1−γ) steps.

State value as expected return

Vπ(s)=Eπ​[Gt​∣St​=s]

Explanation: A value function is the expected discounted return from a state under a policy. It defines the target that policy evaluation seeks to approximate.

Bellman expectation equation

Vπ(s)=Eπ​[rt+1​+γVπ(St+1​)∣St​=s]

Explanation: Value equals expected immediate reward plus discounted next-state value. This recursive equation is the basis for DP and TD algorithms.

n-step return

Gt(n)​=k=0∑n−1​γkrt+k+1​+γnV(St+n​)

Explanation: Combines n actual rewards with a bootstrapped estimate at step n. It trades bias (from V) for reduced variance compared to full returns.

Present value connection

NPV=k=0∑∞​(1+r)k+1ct+k+1​​=k=0∑∞​γk+1ct+k+1​, γ=1+r1​

Explanation: Financial present value is mathematically identical to discounted returns with γ = 1/(1+r). This links RL discounting to standard economic reasoning.

Incremental mean/TD-style update

V^n+1​(s)=V^n​(s)+α(G−V^n​(s))

Explanation: An incremental update moves the current estimate toward a target G by a fraction α. With α = 1/N, it becomes an online average; with TD, G is a bootstrapped target.

Horizon approximation

Effective horizon≈1−γ1​

Explanation: Most of the discounted mass lies within about 1/(1−γ) steps. This heuristic helps choose γ for the task’s timescale.

Complexity Analysis

Computing discounted returns for a single episode of length T via the backward recursion G[i] = r[i] + γ G[i+1] runs in O(T) time and O(T) space if all per-time-step returns are stored, or O(1) extra space if only the running accumulator is kept. This is optimal because every reward must be visited at least once. In contrast, a naive approach that recomputes γk with pow or that recomputes partial sums for each i can degrade to O(T2) time; thus, always prefer the backward dynamic programming pass. For Monte Carlo value estimation over E episodes with average length L, total time is O(E·L), dominated by episode generation and the O(L) return computation per episode. Memory usage is O(S + L) where S is the number of states, due to value tables and temporary storage of trajectories; with streaming updates (first-visit sets cleared each episode), auxiliary memory remains linear in S. For n-step returns, computing a single target takes O(n) time and O(1) space given a bootstrap value, and sliding-window implementations over an episode can amortize to O(1) per step. Numerical stability is generally good, but with γ close to 1 and long horizons, accumulated floating-point error can grow; using double precision and the backward recursion reduces error compared to forward summation with pow. In continuing tasks, γ < 1 keeps values bounded by Rm​ax/(1−γ), which also bounds intermediate computations in practice.

Code Examples

Compute discounted returns for an episode (backward pass)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute per-time-step discounted returns for a single episode.
5// rewards[i] corresponds to r_{i+1} in the math notation.
6vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
7 int T = (int)rewards.size();
8 vector<double> G(T, 0.0);
9 double acc = 0.0; // This will hold G_{i+1} as we move backward
10 for (int i = T - 1; i >= 0; --i) {
11 // G_i = r_{i+1} + gamma * G_{i+1}
12 acc = rewards[i] + gamma * acc;
13 G[i] = acc;
14 }
15 return G;
16}
17
18int main() {
19 // Example: episodic rewards over 5 steps
20 // r_1=1, r_2=0, r_3=2, r_4=-1, r_5=3
21 vector<double> rewards = {1.0, 0.0, 2.0, -1.0, 3.0};
22
23 double gamma1 = 0.9; // patient
24 double gamma2 = 0.5; // more myopic
25
26 auto G1 = discountedReturns(rewards, gamma1);
27 auto G2 = discountedReturns(rewards, gamma2);
28
29 cout.setf(ios::fixed); cout << setprecision(6);
30 cout << "Discounted returns with gamma=0.9\n";
31 for (size_t i = 0; i < G1.size(); ++i) cout << "G[" << i << "] = " << G1[i] << "\n";
32
33 cout << "\nDiscounted returns with gamma=0.5\n";
34 for (size_t i = 0; i < G2.size(); ++i) cout << "G[" << i << "] = " << G2[i] << "\n";
35
36 return 0;
37}
38

This program computes G_i = r_{i+1} + γ G_{i+1} for all time steps in an episode using a single backward pass. The accumulator holds G_{i+1} as we move left, ensuring numerical stability and avoiding repeated calls to pow. The example prints the returns for two different discount factors to show how γ changes the emphasis on near-term versus long-term rewards.

Time: O(T) for an episode of length T.Space: O(T) to store all G; O(1) extra beyond the output array.
First-visit Monte Carlo state-value estimation on a random walk
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Episode {
5 vector<int> states; // includes nonterminals and terminal at end
6 vector<double> rewards; // rewards[i] is r_{i+1}
7};
8
9mt19937 rng(123);
10
11// Generate one episode for the 7-state random walk:
12// States 0 and 6 are terminal. Nonterminals are 1..5.
13// Start at state 3. Each step moves left/right with prob 0.5.
14// Reward is +1 only when entering state 6; otherwise 0.
15Episode generate_episode() {
16 uniform_real_distribution<double> uni(0.0, 1.0);
17 Episode ep;
18 int s = 3; // start
19 ep.states.push_back(s);
20 while (s != 0 && s != 6) {
21 double u = uni(rng);
22 int a = (u < 0.5 ? -1 : +1); // left or right
23 int s_next = s + a;
24 double r = (s_next == 6 ? 1.0 : 0.0);
25 ep.rewards.push_back(r);
26 s = s_next;
27 ep.states.push_back(s);
28 }
29 return ep;
30}
31
32// Compute per-time-step returns via backward pass
33vector<double> discountedReturns(const vector<double>& rewards, double gamma) {
34 int T = (int)rewards.size();
35 vector<double> G(T, 0.0);
36 double acc = 0.0;
37 for (int i = T - 1; i >= 0; --i) {
38 acc = rewards[i] + gamma * acc;
39 G[i] = acc;
40 }
41 return G;
42}
43
44int main() {
45 const double gamma = 1.0; // episodic, finite-horizon; safe to use γ=1 here
46 const int S = 7; // states 0..6
47 const int episodes = 20000;
48
49 vector<double> V(S, 0.0);
50 vector<int> N(S, 0); // visit counts for first-visit MC
51
52 for (int e = 0; e < episodes; ++e) {
53 Episode ep = generate_episode();
54 // Compute returns for each time index in the episode
55 vector<double> G = discountedReturns(ep.rewards, gamma);
56
57 // Track first visit times within this episode
58 vector<int> first_visit_index(S, -1);
59 for (int t = 0; t < (int)ep.states.size() - 1; ++t) { // exclude terminal state's index for updates
60 int s = ep.states[t];
61 if (first_visit_index[s] == -1) first_visit_index[s] = t;
62 }
63
64 // First-visit MC updates for nonterminal states 1..5
65 for (int s = 1; s <= 5; ++s) {
66 int t = first_visit_index[s];
67 if (t != -1) {
68 double Gt = G[t];
69 N[s] += 1;
70 // Incremental mean: V[s] <- V[s] + (G - V[s]) / N[s]
71 V[s] += (Gt - V[s]) / (double)N[s];
72 }
73 }
74 }
75
76 cout.setf(ios::fixed); cout << setprecision(4);
77 cout << "Estimated state values (states 1..5):\n";
78 for (int s = 1; s <= 5; ++s) {
79 cout << "V[" << s << "] = " << V[s] << "\n";
80 }
81 return 0;
82}
83

This program estimates state values in the classic 7-state random walk. An episode starts at state 3, moves left/right with equal probability, and terminates on reaching state 0 or 6. The only nonzero reward (+1) occurs when entering state 6. With γ = 1 (episodic), the return is the total reward accumulated until termination. First-visit Monte Carlo averages the return from the first occurrence of each state in an episode to estimate V(s). The discountedReturns function uses the backward recursion for efficient computation.

Time: O(E · L) where E is the number of episodes and L is the average episode length.Space: O(S + L) for value tables and temporary episode storage.
Compute an n-step return with bootstrapping
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute G_t^{(n)} = sum_{k=0}^{n-1} gamma^k r_{t+k+1} + gamma^n * V(S_{t+n})
5// Given: a window of future rewards starting at time t (rewards[t] = r_{t+1}),
6// the discount factor gamma, the step size n, and a bootstrap value V(S_{t+n}).
7double n_step_return(const vector<double>& rewards_from_t, double gamma, int n, double bootstrap_value) {
8 double acc = 0.0;
9 double pow_gamma = 1.0;
10 int K = min(n, (int)rewards_from_t.size());
11 for (int k = 0; k < K; ++k) {
12 acc += pow_gamma * rewards_from_t[k];
13 pow_gamma *= gamma; // prepare for the next step
14 }
15 // If we reached fewer than n rewards (due to episode termination),
16 // the bootstrap typically should NOT be applied in episodic tasks.
17 // Here we only add bootstrap if we actually have n steps available.
18 if (n <= (int)rewards_from_t.size()) acc += pow_gamma * bootstrap_value;
19 return acc;
20}
21
22int main() {
23 // Example: suppose from time t onward we observe rewards: 2, 0, -1, 3, ...
24 vector<double> future_rewards = {2.0, 0.0, -1.0, 3.0};
25 double gamma = 0.9;
26
27 // Assume a value baseline V(S_{t+3}) = 5.0 for n=3
28 int n = 3;
29 double V_bootstrap = 5.0;
30
31 double G_n = n_step_return(future_rewards, gamma, n, V_bootstrap);
32
33 cout.setf(ios::fixed); cout << setprecision(6);
34 cout << "n-step return (n=3, gamma=0.9, V_bootstrap=5.0): " << G_n << "\n";
35
36 // Compare with full (finite) return without bootstrapping (γ^n term omitted)
37 // This equals the n->infty Monte Carlo target truncated to the episode length.
38 double full = 0.0, pow_gamma = 1.0;
39 for (double r : future_rewards) { full += pow_gamma * r; pow_gamma *= gamma; }
40 cout << "Truncated full return (no bootstrap): " << full << "\n";
41 return 0;
42}
43

This example computes the n-step return G_t^{(n)} that mixes real rewards for n steps with a bootstrap value estimate at step n. The geometric factor is built iteratively without calling pow repeatedly. If the episode terminates before n steps, the bootstrap is typically omitted (since there is no next state), as shown in the guard condition.

Time: O(n) per n-step target.Space: O(1) auxiliary space.
#discount factor#discounted return#reinforcement learning#monte carlo#temporal-difference#bellman equation#n-step return#geometric series#episodic task#continuing task#value function#present value#bootstrapping#effective horizon#policy evaluation