Value Function Approximation
Key Points
- •Value function approximation replaces a huge table of values with a small set of parameters that can generalize across similar states.
- •We approximate V^(s) with a parametric model V(s; ) and learn from data collected under policy .
- •A common objective is mean-squared value error weighted by the state distribution, and we optimize it by gradient or semi-gradient methods.
- •Linear function approximation uses feature vectors (s) so that V(s; ) = ^ (s), yielding fast, stable updates.
- •Temporal-Difference learning provides bootstrapped targets and semi-gradient updates: + \, \, _ V(; ).
- •With linear features and on-policy sampling, TD(0) converges to the projected Bellman fixed point; off-policy with bootstrapping can diverge (deadly triad).
- •Feature scaling, bias terms, and step-size tuning are crucial for stable and efficient learning.
- •C++ implementations are straightforward for linear approximators and can run in O(d) time per step, where d is the number of features.
Prerequisites
- →Markov Decision Processes (MDPs) and policies — Value functions are defined on MDPs; understanding transitions, rewards, and policies is essential.
- →Probability and expectations — Value functions and learning objectives are defined as expectations over stochastic trajectories.
- →Linear algebra and vector calculus — Features, parameter vectors, gradients, and projections rely on linear algebra.
- →Stochastic Gradient Descent (SGD) — Learning updates for approximators use stochastic gradients and step-size tuning.
- →Temporal-Difference learning — TD provides the bootstrapped targets and update rules used with function approximators.
Detailed Explanation
Tap terms for definitions01Overview
Value function approximation is a cornerstone technique in reinforcement learning (RL) for handling large or continuous state spaces where storing a separate value for each state is impossible. Instead of keeping a giant table of V^\pi(s), we choose a family of functions V(s; \mathbf{w}) parameterized by a small vector \mathbf{w}. Our goal is to pick \mathbf{w} so that V(s; \mathbf{w}) closely matches the true value V^\pi(s), the expected long-term return when following a fixed policy \pi. This lets the learner generalize: learning about one state also informs nearby or similar states via shared parameters. Approaches range from linear models with hand-designed features to deep neural networks that learn features automatically. Learning typically uses stochastic gradient methods driven by either Monte Carlo targets (full returns) or bootstrapped Temporal-Difference (TD) targets that combine immediate rewards with the model’s own predictions for the next state. Linear, on-policy TD has strong convergence guarantees to a specific fixed point (the projected Bellman solution), while non-linear or off-policy settings can be powerful but risk instability. In practice, value approximation reduces memory, speeds up computation, and is essential for control algorithms like actor-critic and DQN variants.
02Intuition & Analogies
Imagine you’re estimating the market value of houses in a city. You can’t memorize a price for every address (there are too many), but you can build a model that estimates price from features like size, neighborhood, and age. When you learn the effect of 10 extra square meters in one house, that knowledge generalizes to other houses. Value function approximation is the same idea for RL: we estimate the long-term desirability (value) of states using a model with a few knobs (parameters). If two states share similar features—like similar positions in a grid or similar sensor readings in robotics—they’ll get similar value estimates automatically. Bootstrapping adds another twist: sometimes we don’t have the final selling price (full return) yet, so we use our current estimate of the neighbor’s value to update this one—like pricing a house partly by looking at the estimated prices of nearby houses. This self-consistency speeds learning but can cause feedback loops if we’re careless (bad estimates can reinforce themselves), especially if we try to learn about neighborhoods we rarely or never actually visit under our current policy. Linear models are like using a weighted sum of known features—simple and robust. Neural networks are like complex appraisal models that can discover powerful new features, but they need more care to avoid wild price swings. In all cases, the art is picking features (or architectures), step sizes, and targets so that we learn fast without destabilizing the estimates.
03Formal Definition
04When to Use
Use value function approximation when the state space is very large, continuous, or when you want generalization across states. It is ideal for robotics with continuous sensors, finance with high-dimensional features, and board games with too many distinct positions to tabulate. Linear approximation is preferred when you need fast, stable learning with interpretable features (e.g., tile coding, radial basis functions, hand-crafted statistics). Semi-gradient TD(0) is the go-to on-policy method for continual prediction in streaming settings where you cannot wait for complete returns. Monte Carlo methods are attractive when episodes are short and you can tolerate higher variance for unbiased targets. Least-Squares TD (LSTD) is useful when data are limited but you can afford O(d^2) updates for improved sample efficiency. For control (improving policies), value approximators power actor-critic (critic estimates V or Q) and Q-learning variants; non-linear approximators like deep networks are standard when the raw state is high-dimensional (images, audio) and you can use techniques (experience replay, target networks) to stabilize learning. Avoid bootstrapping with off-policy sampling and function approximation unless you add safeguards (importance sampling with variance control, gradient TD, emphatic TD, or target networks) due to potential divergence.
⚠️Common Mistakes
• Omitting a bias feature: Without a constant feature (\phi_0 = 1), linear models can be systematically biased. Add a bias term or center features. • Poor feature scaling: Large-magnitude features cause unstable gradients. Normalize or standardize features so each dimension has similar scale. • Step size too large: TD methods are sensitive to \alpha. Use small learning rates, decay schedules, or adaptive methods; monitor the TD error. • Deadly triad: Combining function approximation, bootstrapping, and off-policy sampling can diverge. If off-policy, use safer algorithms (e.g., Gradient TD, Emphatic TD) or target networks with replay. • Bootstrapping with terminal states: Forgetting to zero out the bootstrap at terminal states inflates values. Ensure V(s_{t+1}) = 0 when done = true. • Mixing policies: Using data from one policy to evaluate another without correction biases estimates. Use importance sampling or evaluate on-policy. • High-variance Monte Carlo: Using full returns in long or continuing tasks produces huge variance. Prefer TD methods or use eligibility traces (\lambda) to trade bias and variance. • Data leakage and overfitting: In batch settings with limited trajectories, excessive training on the same data can overfit. Use validation, regularization, or more data. • Wrong discounting: Ensure the same \gamma is used in targets and definitions; continuing tasks may need average-reward formulations. • Ignoring stochasticity: Treating \delta_t as a deterministic gradient can be misleading; it’s a noisy estimate—use many samples and monitor learning curves.
Key Formulas
State-value definition
Explanation: This defines the expected discounted return starting from state s and following policy The expectation is over trajectories induced by π and the environment dynamics.
Linear value approximator
Explanation: The approximated value is a weighted sum of features of the state. It is linear in parameters, enabling fast and stable updates.
MSVE objective
Explanation: This is the mean-squared value error under the on-policy state distribution. Minimizing it finds the best approximation within the function class.
Gradient of MSVE
Explanation: This gradient guides how to adjust parameters to reduce squared error. In linear models, the gradient simplifies because V is just the feature vector.
TD error
Explanation: The TD error measures the mismatch between current prediction and a one-step bootstrapped target. It drives online learning in TD methods.
Semi-gradient TD(0) update
Explanation: We update parameters in the direction that would reduce the TD error for the current state. For linear features, the increment is proportional to the feature vector.
Projected Bellman equation
Explanation: The solution for linear, on-policy TD equals the projection of the Bellman backup onto the function space. This explains convergence to a particular fixed point.
MSPBE
Explanation: The mean-squared projected Bellman error measures how far a value function is from satisfying the projected Bellman equation. Gradient TD algorithms minimize this quantity.
LSTD solution
Explanation: Least-Squares TD solves a linear system to find weights that best satisfy the TD fixed-point conditions. It is more sample-efficient but computationally heavier.
Gradient Monte Carlo
Explanation: Using full returns as targets yields an unbiased gradient step on MSVE. Variance can be high for long horizons, so it suits short episodes.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple linear value function V(s; w) = w^T phi(s) 5 struct LinearValueFunction { 6 vector<double> w; // parameters 7 explicit LinearValueFunction(size_t d) : w(d, 0.0) {} 8 9 // Dot product w^T x 10 static double dot(const vector<double>& a, const vector<double>& b) { 11 double s = 0.0; size_t n = a.size(); 12 for (size_t i = 0; i < n; ++i) s += a[i] * b[i]; 13 return s; 14 } 15 16 double predict(const vector<double>& phi) const { 17 return dot(w, phi); 18 } 19 20 // One SGD step on squared error with target y: w <- w + alpha * (y - w^T phi) * phi 21 void update(const vector<double>& phi, double target, double alpha) { 22 double err = target - predict(phi); 23 for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * err * phi[i]; 24 } 25 }; 26 27 // Example feature map for a 1D state s in [0,1]: polynomial features with bias 28 static vector<double> features(double s) { 29 return {1.0, s, s * s}; // bias + s + s^2 30 } 31 32 int main() { 33 ios::sync_with_stdio(false); 34 cin.tie(nullptr); 35 36 // We will "pretend" Monte Carlo targets approximate V^pi(s) = sin(pi s) 37 // In practice, these targets would come from episodic returns G_t. 38 mt19937 rng(42); 39 uniform_real_distribution<double> U(0.0, 1.0); 40 41 LinearValueFunction vf(3); 42 double alpha = 0.1; // step size 43 44 // Train with SGD on synthetic (s, target) pairs 45 for (int it = 0; it < 20000; ++it) { 46 double s = U(rng); 47 double target = sin(M_PI * s); // stand-in for Monte Carlo return 48 vector<double> phi = features(s); 49 vf.update(phi, target, alpha); 50 } 51 52 // Evaluate the learned approximator 53 for (double s : {0.0, 0.25, 0.5, 0.75, 1.0}) { 54 vector<double> phi = features(s); 55 double pred = vf.predict(phi); 56 cout << fixed << setprecision(4) 57 << "s=" << s << " -> V_hat=" << pred 58 << " (true ~ " << sin(M_PI * s) << ")\n"; 59 } 60 61 return 0; 62 } 63
This C++ program implements a linear value function approximator and trains it with stochastic gradient descent on supervised targets that mimic Monte Carlo returns. The features include a bias term and polynomial terms of the state. After training, predictions are printed at several states to show generalization.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Random Walk environment (7 states: 0 and 6 terminal; start at 3) 5 // Actions: -1 (left) or +1 (right); policy: choose uniformly at random. 6 struct RandomWalk { 7 int state = 3; // start in the middle 8 static constexpr int LEFT_TERM = 0; 9 static constexpr int RIGHT_TERM = 6; 10 11 void reset() { state = 3; } 12 13 // Step with action a in {-1, +1} 14 tuple<int, double, bool> step(int a) { 15 int next = state + a; 16 bool done = (next == LEFT_TERM) || (next == RIGHT_TERM); 17 double reward = (next == RIGHT_TERM) ? 1.0 : 0.0; // reward only on right terminal 18 state = next; 19 return {state, reward, done}; 20 } 21 }; 22 23 struct LinearVF { 24 vector<double> w; // weights 25 explicit LinearVF(size_t d) : w(d, 0.0) {} 26 27 static double dot(const vector<double>& a, const vector<double>& b) { 28 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 29 } 30 31 double V(const vector<double>& phi) const { return dot(w, phi); } 32 33 // Semi-gradient TD(0): w <- w + alpha * delta * phi(s) 34 void td0_update(const vector<double>& phi_s, double r, const vector<double>& phi_sp, bool terminal, double gamma, double alpha) { 35 double v_s = V(phi_s); 36 double v_sp = terminal ? 0.0 : V(phi_sp); 37 double delta = r + gamma * v_sp - v_s; 38 for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * delta * phi_s[i]; 39 } 40 }; 41 42 // Simple 2D feature map with bias for discrete states 0..6 (terminals included) 43 static vector<double> phi(int s) { 44 // We'll use a coarse representation to force approximation: 45 // phi(s) = [1, s/6], so only two parameters must fit all nonterminal states. 46 return {1.0, static_cast<double>(s)/6.0}; 47 } 48 49 int main() { 50 ios::sync_with_stdio(false); 51 cin.tie(nullptr); 52 53 RandomWalk env; 54 LinearVF vf(2); // two parameters: bias and slope 55 56 mt19937 rng(123); 57 uniform_int_distribution<int> coin(0, 1); // 0 -> left, 1 -> right 58 59 double alpha = 0.05; // step size for TD 60 double gamma = 1.0; // undiscounted episodic task 61 62 // Train for many episodes 63 for (int episode = 0; episode < 5000; ++episode) { 64 env.reset(); 65 bool done = false; 66 while (!done) { 67 int a = coin(rng) ? +1 : -1; // on-policy random 68 int s = env.state; 69 auto phis = phi(s); 70 auto [sp, r, terminal] = env.step(a); 71 auto phisp = phi(sp); 72 vf.td0_update(phis, r, phisp, terminal, gamma, alpha); 73 done = terminal; 74 } 75 } 76 77 // Report predicted values for each state (nonterminals 1..5) 78 cout << fixed << setprecision(4); 79 for (int s = 0; s <= 6; ++s) { 80 cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n"; 81 } 82 return 0; 83 } 84
This program evaluates a fixed random policy on a 7-state Random Walk using semi-gradient TD(0) with a very low-dimensional feature map (bias and normalized index). Because the features cannot represent the true tabular values exactly, the algorithm finds the projected Bellman fixed point in the 2D function space.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct RandomWalk { 5 int state = 3; 6 static constexpr int LEFT_TERM = 0; 7 static constexpr int RIGHT_TERM = 6; 8 void reset() { state = 3; } 9 tuple<int,double,bool> step(int a) { 10 int next = state + a; 11 bool done = (next == LEFT_TERM) || (next == RIGHT_TERM); 12 double reward = (next == RIGHT_TERM) ? 1.0 : 0.0; 13 state = next; 14 return {state, reward, done}; 15 } 16 }; 17 18 struct LinearVF { 19 vector<double> w; explicit LinearVF(size_t d): w(d,0.0) {} 20 static double dot(const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 21 double V(const vector<double>& phi) const { return dot(w, phi); } 22 void update(const vector<double>& phi, double target, double alpha){ double err = target - V(phi); for(size_t i=0;i<w.size();++i) w[i] += alpha * err * phi[i]; } 23 }; 24 25 static vector<double> phi(int s){ return {1.0, static_cast<double>(s)/6.0}; } 26 27 int main(){ 28 ios::sync_with_stdio(false); 29 cin.tie(nullptr); 30 31 RandomWalk env; LinearVF vf(2); 32 mt19937 rng(7); uniform_int_distribution<int> coin(0,1); 33 double alpha = 0.02; double gamma = 1.0; 34 35 // Train with first-visit Gradient Monte Carlo 36 for(int episode=0; episode<4000; ++episode){ 37 env.reset(); 38 vector<int> states; vector<double> rewards; states.reserve(20); rewards.reserve(20); 39 bool done=false; states.push_back(env.state); 40 while(!done){ 41 int a = coin(rng) ? +1 : -1; // random policy 42 auto [sp, r, terminal] = env.step(a); 43 rewards.push_back(r); 44 states.push_back(sp); 45 done = terminal; 46 } 47 // Compute returns G_t backward and update once per state visit 48 double G = 0.0; // since gamma=1 and terminal reward already included 49 for(int t = (int)states.size()-2; t >= 0; --t){ // last state is terminal, no update for it 50 G = rewards[t] + gamma * G; 51 vf.update(phi(states[t]), G, alpha); 52 } 53 } 54 55 cout << fixed << setprecision(4); 56 for(int s=0; s<=6; ++s){ 57 cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n"; 58 } 59 return 0; 60 } 61
This variant uses Gradient Monte Carlo with full returns as targets, which is unbiased with respect to MSVE but has higher variance than TD. It is stable for on-policy evaluation and short episodes, providing a useful contrast to bootstrapped TD(0).