🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Value Function Approximation

Key Points

  • •
    Value function approximation replaces a huge table of values with a small set of parameters that can generalize across similar states.
  • •
    We approximate V^π(s) with a parametric model V(s; w) and learn w from data collected under policy π.
  • •
    A common objective is mean-squared value error weighted by the state distribution, and we optimize it by gradient or semi-gradient methods.
  • •
    Linear function approximation uses feature vectors ϕ(s) so that V(s; w) = w^⊤ ϕ(s), yielding fast, stable updates.
  • •
    Temporal-Difference learning provides bootstrapped targets and semi-gradient updates: w ← w + α \, δt​ \, ∇_w V(st​; w).
  • •
    With linear features and on-policy sampling, TD(0) converges to the projected Bellman fixed point; off-policy with bootstrapping can diverge (deadly triad).
  • •
    Feature scaling, bias terms, and step-size tuning are crucial for stable and efficient learning.
  • •
    C++ implementations are straightforward for linear approximators and can run in O(d) time per step, where d is the number of features.

Prerequisites

  • →Markov Decision Processes (MDPs) and policies — Value functions are defined on MDPs; understanding transitions, rewards, and policies is essential.
  • →Probability and expectations — Value functions and learning objectives are defined as expectations over stochastic trajectories.
  • →Linear algebra and vector calculus — Features, parameter vectors, gradients, and projections rely on linear algebra.
  • →Stochastic Gradient Descent (SGD) — Learning updates for approximators use stochastic gradients and step-size tuning.
  • →Temporal-Difference learning — TD provides the bootstrapped targets and update rules used with function approximators.

Detailed Explanation

Tap terms for definitions

01Overview

Value function approximation is a cornerstone technique in reinforcement learning (RL) for handling large or continuous state spaces where storing a separate value for each state is impossible. Instead of keeping a giant table of V^\pi(s), we choose a family of functions V(s; \mathbf{w}) parameterized by a small vector \mathbf{w}. Our goal is to pick \mathbf{w} so that V(s; \mathbf{w}) closely matches the true value V^\pi(s), the expected long-term return when following a fixed policy \pi. This lets the learner generalize: learning about one state also informs nearby or similar states via shared parameters. Approaches range from linear models with hand-designed features to deep neural networks that learn features automatically. Learning typically uses stochastic gradient methods driven by either Monte Carlo targets (full returns) or bootstrapped Temporal-Difference (TD) targets that combine immediate rewards with the model’s own predictions for the next state. Linear, on-policy TD has strong convergence guarantees to a specific fixed point (the projected Bellman solution), while non-linear or off-policy settings can be powerful but risk instability. In practice, value approximation reduces memory, speeds up computation, and is essential for control algorithms like actor-critic and DQN variants.

02Intuition & Analogies

Imagine you’re estimating the market value of houses in a city. You can’t memorize a price for every address (there are too many), but you can build a model that estimates price from features like size, neighborhood, and age. When you learn the effect of 10 extra square meters in one house, that knowledge generalizes to other houses. Value function approximation is the same idea for RL: we estimate the long-term desirability (value) of states using a model with a few knobs (parameters). If two states share similar features—like similar positions in a grid or similar sensor readings in robotics—they’ll get similar value estimates automatically. Bootstrapping adds another twist: sometimes we don’t have the final selling price (full return) yet, so we use our current estimate of the neighbor’s value to update this one—like pricing a house partly by looking at the estimated prices of nearby houses. This self-consistency speeds learning but can cause feedback loops if we’re careless (bad estimates can reinforce themselves), especially if we try to learn about neighborhoods we rarely or never actually visit under our current policy. Linear models are like using a weighted sum of known features—simple and robust. Neural networks are like complex appraisal models that can discover powerful new features, but they need more care to avoid wild price swings. In all cases, the art is picking features (or architectures), step sizes, and targets so that we learn fast without destabilizing the estimates.

03Formal Definition

Consider an MDP (S, A, P, r, γ) and a fixed policy π. The state-value function is V^π(s) = E_π [∑t=0∞​ γt rt+1​ ∣ s0​=s]. We approximate V^π with a parameterized family F = \{ V(⋅; w) : w ∈ Rd \}. A standard objective is the mean-squared value error (MSVE) weighted by the on-policy state distribution d_π: J(w) = Es∼dπ​​[(V^π(s) - V(s; w))^2]. In practice, V^π is unknown, so we use data-driven targets. With Monte Carlo, targets are returns Gt​; with TD(0), targets are rt+1​ + γ V(st+1​; w). For linear function approximation, V(s; w) = w^⊤ ϕ(s), where ϕ(s) ∈ Rd are features. Semi-gradient TD(0) performs stochastic approximation of the projected Bellman fixed point: wt+1​ = wt​ + α \, δt​ \, ∇_w V(st​; wt​), where δt​ = rt+1​ + γ V(st+1​; wt​) - V(st​; wt​). For linear models, ∇_w V = ϕ(st​), giving wt+1​ = wt​ + α \, δt​ \, ϕ(st​). Under on-policy sampling and suitable conditions, the iterate converges to the solution of the projected Bellman equation: V_{w^*} = Π T_π V_{w^*}, where Π projects onto F under the d_π-weighted norm.

04When to Use

Use value function approximation when the state space is very large, continuous, or when you want generalization across states. It is ideal for robotics with continuous sensors, finance with high-dimensional features, and board games with too many distinct positions to tabulate. Linear approximation is preferred when you need fast, stable learning with interpretable features (e.g., tile coding, radial basis functions, hand-crafted statistics). Semi-gradient TD(0) is the go-to on-policy method for continual prediction in streaming settings where you cannot wait for complete returns. Monte Carlo methods are attractive when episodes are short and you can tolerate higher variance for unbiased targets. Least-Squares TD (LSTD) is useful when data are limited but you can afford O(d^2) updates for improved sample efficiency. For control (improving policies), value approximators power actor-critic (critic estimates V or Q) and Q-learning variants; non-linear approximators like deep networks are standard when the raw state is high-dimensional (images, audio) and you can use techniques (experience replay, target networks) to stabilize learning. Avoid bootstrapping with off-policy sampling and function approximation unless you add safeguards (importance sampling with variance control, gradient TD, emphatic TD, or target networks) due to potential divergence.

⚠️Common Mistakes

• Omitting a bias feature: Without a constant feature (\phi_0 = 1), linear models can be systematically biased. Add a bias term or center features. • Poor feature scaling: Large-magnitude features cause unstable gradients. Normalize or standardize features so each dimension has similar scale. • Step size too large: TD methods are sensitive to \alpha. Use small learning rates, decay schedules, or adaptive methods; monitor the TD error. • Deadly triad: Combining function approximation, bootstrapping, and off-policy sampling can diverge. If off-policy, use safer algorithms (e.g., Gradient TD, Emphatic TD) or target networks with replay. • Bootstrapping with terminal states: Forgetting to zero out the bootstrap at terminal states inflates values. Ensure V(s_{t+1}) = 0 when done = true. • Mixing policies: Using data from one policy to evaluate another without correction biases estimates. Use importance sampling or evaluate on-policy. • High-variance Monte Carlo: Using full returns in long or continuing tasks produces huge variance. Prefer TD methods or use eligibility traces (\lambda) to trade bias and variance. • Data leakage and overfitting: In batch settings with limited trajectories, excessive training on the same data can overfit. Use validation, regularization, or more data. • Wrong discounting: Ensure the same \gamma is used in targets and definitions; continuing tasks may need average-reward formulations. • Ignoring stochasticity: Treating \delta_t as a deterministic gradient can be misleading; it’s a noisy estimate—use many samples and monitor learning curves.

Key Formulas

State-value definition

Vπ(s)=Eπ​[t=0∑∞​γtrt+1​∣s0​=s]

Explanation: This defines the expected discounted return starting from state s and following policy π. The expectation is over trajectories induced by π and the environment dynamics.

Linear value approximator

V(s;w)=w⊤ϕ(s)

Explanation: The approximated value is a weighted sum of features of the state. It is linear in parameters, enabling fast and stable updates.

MSVE objective

J(w)=Es∼dπ​​[(Vπ(s)−V(s;w))2]

Explanation: This is the mean-squared value error under the on-policy state distribution. Minimizing it finds the best approximation within the function class.

Gradient of MSVE

∇w​J(w)=−2Edπ​​[(Vπ(s)−V(s;w))∇w​V(s;w)]

Explanation: This gradient guides how to adjust parameters to reduce squared error. In linear models, the gradient simplifies because ∇ V is just the feature vector.

TD error

δt​=rt+1​+γV(st+1​;w)−V(st​;w)

Explanation: The TD error measures the mismatch between current prediction and a one-step bootstrapped target. It drives online learning in TD methods.

Semi-gradient TD(0) update

wt+1​=wt​+αδt​∇w​V(st​;wt​)

Explanation: We update parameters in the direction that would reduce the TD error for the current state. For linear features, the increment is proportional to the feature vector.

Projected Bellman equation

Vw∗​=ΠTπ​Vw∗​

Explanation: The solution for linear, on-policy TD equals the projection of the Bellman backup onto the function space. This explains convergence to a particular fixed point.

MSPBE

MSPBE(w)=∥Vw​−ΠTπ​Vw​∥dπ​2​

Explanation: The mean-squared projected Bellman error measures how far a value function is from satisfying the projected Bellman equation. Gradient TD algorithms minimize this quantity.

LSTD solution

A=E[ϕt​(ϕt​−γϕt+1​)⊤],b=E[rt+1​ϕt​],w=A−1b

Explanation: Least-Squares TD solves a linear system to find weights that best satisfy the TD fixed-point conditions. It is more sample-efficient but computationally heavier.

Gradient Monte Carlo

Gt​=k=0∑∞​γkrt+k+1​,w←w+α(Gt​−V(st​;w))∇w​V(st​;w)

Explanation: Using full returns as targets yields an unbiased gradient step on MSVE. Variance can be high for long horizons, so it suits short episodes.

Complexity Analysis

Let d be the number of features and n the number of samples (transitions). For linear function approximation with either supervised targets or Monte Carlo/TD targets, each prediction and gradient evaluation is O(d) time and O(d) memory (to store the weights). A single online update thus costs O(d); processing n samples is O(nd). If features are sparse with k nonzeros, operations reduce to O(k). Batch gradient descent with mini-batches of size B costs O(Bd) per step. For TD(0), each step computes a TD error and performs one weight update; this remains O(d) with negligible constant factors. In episodic tasks with average episode length L, a pass over M episodes is O(M L d). Memory footprint is dominated by the parameter vector (O(d)) plus temporary feature storage; streaming implementations require no additional buffers. Least-Squares TD (LSTD) and related second-order methods improve sample efficiency but increase computation. Maintaining and inverting the A matrix naïvely is O(d3) for a batch, or O(d2) per step with recursive updates; memory becomes O(d2). These methods are practical for modest d (hundreds) but not for very high-dimensional features. For non-linear approximators (e.g., neural networks with P parameters), per-step cost is O(P) for forward and backward passes, with memory O(P). Stabilization tricks (target networks, replay) can add storage O(n d) to keep a buffer of transitions. Overall, linear semi-gradient TD achieves a favorable O(d) per-step cost, making it well-suited for real-time applications and embedded systems when d is moderate.

Code Examples

Linear value function with supervised (Monte Carlo) targets via SGD
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple linear value function V(s; w) = w^T phi(s)
5struct LinearValueFunction {
6 vector<double> w; // parameters
7 explicit LinearValueFunction(size_t d) : w(d, 0.0) {}
8
9 // Dot product w^T x
10 static double dot(const vector<double>& a, const vector<double>& b) {
11 double s = 0.0; size_t n = a.size();
12 for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
13 return s;
14 }
15
16 double predict(const vector<double>& phi) const {
17 return dot(w, phi);
18 }
19
20 // One SGD step on squared error with target y: w <- w + alpha * (y - w^T phi) * phi
21 void update(const vector<double>& phi, double target, double alpha) {
22 double err = target - predict(phi);
23 for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * err * phi[i];
24 }
25};
26
27// Example feature map for a 1D state s in [0,1]: polynomial features with bias
28static vector<double> features(double s) {
29 return {1.0, s, s * s}; // bias + s + s^2
30}
31
32int main() {
33 ios::sync_with_stdio(false);
34 cin.tie(nullptr);
35
36 // We will "pretend" Monte Carlo targets approximate V^pi(s) = sin(pi s)
37 // In practice, these targets would come from episodic returns G_t.
38 mt19937 rng(42);
39 uniform_real_distribution<double> U(0.0, 1.0);
40
41 LinearValueFunction vf(3);
42 double alpha = 0.1; // step size
43
44 // Train with SGD on synthetic (s, target) pairs
45 for (int it = 0; it < 20000; ++it) {
46 double s = U(rng);
47 double target = sin(M_PI * s); // stand-in for Monte Carlo return
48 vector<double> phi = features(s);
49 vf.update(phi, target, alpha);
50 }
51
52 // Evaluate the learned approximator
53 for (double s : {0.0, 0.25, 0.5, 0.75, 1.0}) {
54 vector<double> phi = features(s);
55 double pred = vf.predict(phi);
56 cout << fixed << setprecision(4)
57 << "s=" << s << " -> V_hat=" << pred
58 << " (true ~ " << sin(M_PI * s) << ")\n";
59 }
60
61 return 0;
62}
63

This C++ program implements a linear value function approximator and trains it with stochastic gradient descent on supervised targets that mimic Monte Carlo returns. The features include a bias term and polynomial terms of the state. After training, predictions are printed at several states to show generalization.

Time: O(n d) for n samples and d featuresSpace: O(d) for parameters plus O(d) temporary storage
Semi-gradient TD(0) policy evaluation on a Random Walk with linear features
1#include <bits/stdc++.h>
2using namespace std;
3
4// Random Walk environment (7 states: 0 and 6 terminal; start at 3)
5// Actions: -1 (left) or +1 (right); policy: choose uniformly at random.
6struct RandomWalk {
7 int state = 3; // start in the middle
8 static constexpr int LEFT_TERM = 0;
9 static constexpr int RIGHT_TERM = 6;
10
11 void reset() { state = 3; }
12
13 // Step with action a in {-1, +1}
14 tuple<int, double, bool> step(int a) {
15 int next = state + a;
16 bool done = (next == LEFT_TERM) || (next == RIGHT_TERM);
17 double reward = (next == RIGHT_TERM) ? 1.0 : 0.0; // reward only on right terminal
18 state = next;
19 return {state, reward, done};
20 }
21};
22
23struct LinearVF {
24 vector<double> w; // weights
25 explicit LinearVF(size_t d) : w(d, 0.0) {}
26
27 static double dot(const vector<double>& a, const vector<double>& b) {
28 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
29 }
30
31 double V(const vector<double>& phi) const { return dot(w, phi); }
32
33 // Semi-gradient TD(0): w <- w + alpha * delta * phi(s)
34 void td0_update(const vector<double>& phi_s, double r, const vector<double>& phi_sp, bool terminal, double gamma, double alpha) {
35 double v_s = V(phi_s);
36 double v_sp = terminal ? 0.0 : V(phi_sp);
37 double delta = r + gamma * v_sp - v_s;
38 for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * delta * phi_s[i];
39 }
40};
41
42// Simple 2D feature map with bias for discrete states 0..6 (terminals included)
43static vector<double> phi(int s) {
44 // We'll use a coarse representation to force approximation:
45 // phi(s) = [1, s/6], so only two parameters must fit all nonterminal states.
46 return {1.0, static_cast<double>(s)/6.0};
47}
48
49int main() {
50 ios::sync_with_stdio(false);
51 cin.tie(nullptr);
52
53 RandomWalk env;
54 LinearVF vf(2); // two parameters: bias and slope
55
56 mt19937 rng(123);
57 uniform_int_distribution<int> coin(0, 1); // 0 -> left, 1 -> right
58
59 double alpha = 0.05; // step size for TD
60 double gamma = 1.0; // undiscounted episodic task
61
62 // Train for many episodes
63 for (int episode = 0; episode < 5000; ++episode) {
64 env.reset();
65 bool done = false;
66 while (!done) {
67 int a = coin(rng) ? +1 : -1; // on-policy random
68 int s = env.state;
69 auto phis = phi(s);
70 auto [sp, r, terminal] = env.step(a);
71 auto phisp = phi(sp);
72 vf.td0_update(phis, r, phisp, terminal, gamma, alpha);
73 done = terminal;
74 }
75 }
76
77 // Report predicted values for each state (nonterminals 1..5)
78 cout << fixed << setprecision(4);
79 for (int s = 0; s <= 6; ++s) {
80 cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n";
81 }
82 return 0;
83}
84

This program evaluates a fixed random policy on a 7-state Random Walk using semi-gradient TD(0) with a very low-dimensional feature map (bias and normalized index). Because the features cannot represent the true tabular values exactly, the algorithm finds the projected Bellman fixed point in the 2D function space.

Time: O(E L d), where E is episodes, L is average episode length, and d=2 features per stepSpace: O(d) for parameters
Gradient Monte Carlo policy evaluation on the same Random Walk
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RandomWalk {
5 int state = 3;
6 static constexpr int LEFT_TERM = 0;
7 static constexpr int RIGHT_TERM = 6;
8 void reset() { state = 3; }
9 tuple<int,double,bool> step(int a) {
10 int next = state + a;
11 bool done = (next == LEFT_TERM) || (next == RIGHT_TERM);
12 double reward = (next == RIGHT_TERM) ? 1.0 : 0.0;
13 state = next;
14 return {state, reward, done};
15 }
16};
17
18struct LinearVF {
19 vector<double> w; explicit LinearVF(size_t d): w(d,0.0) {}
20 static double dot(const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
21 double V(const vector<double>& phi) const { return dot(w, phi); }
22 void update(const vector<double>& phi, double target, double alpha){ double err = target - V(phi); for(size_t i=0;i<w.size();++i) w[i] += alpha * err * phi[i]; }
23};
24
25static vector<double> phi(int s){ return {1.0, static_cast<double>(s)/6.0}; }
26
27int main(){
28 ios::sync_with_stdio(false);
29 cin.tie(nullptr);
30
31 RandomWalk env; LinearVF vf(2);
32 mt19937 rng(7); uniform_int_distribution<int> coin(0,1);
33 double alpha = 0.02; double gamma = 1.0;
34
35 // Train with first-visit Gradient Monte Carlo
36 for(int episode=0; episode<4000; ++episode){
37 env.reset();
38 vector<int> states; vector<double> rewards; states.reserve(20); rewards.reserve(20);
39 bool done=false; states.push_back(env.state);
40 while(!done){
41 int a = coin(rng) ? +1 : -1; // random policy
42 auto [sp, r, terminal] = env.step(a);
43 rewards.push_back(r);
44 states.push_back(sp);
45 done = terminal;
46 }
47 // Compute returns G_t backward and update once per state visit
48 double G = 0.0; // since gamma=1 and terminal reward already included
49 for(int t = (int)states.size()-2; t >= 0; --t){ // last state is terminal, no update for it
50 G = rewards[t] + gamma * G;
51 vf.update(phi(states[t]), G, alpha);
52 }
53 }
54
55 cout << fixed << setprecision(4);
56 for(int s=0; s<=6; ++s){
57 cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n";
58 }
59 return 0;
60}
61

This variant uses Gradient Monte Carlo with full returns as targets, which is unbiased with respect to MSVE but has higher variance than TD. It is stable for on-policy evaluation and short episodes, providing a useful contrast to bootstrapped TD(0).

Time: O(E L d) overall; each state update is O(d)Space: O(d) for parameters plus O(L) to store an episode
#reinforcement learning#value function approximation#linear function approximator#temporal-difference learning#semi-gradient#msve#projected bellman equation#policy evaluation#stochastic gradient descent#lstd#eligibility traces#off-policy divergence#deadly triad#feature engineering#actor-critic