📚TheoryIntermediate

Value Function Approximation

Key Points

•
Value function approximation replaces a huge table of values with a small set of parameters that can generalize across similar states.
•
We approximate V^ $π$ (s) with a parametric model V(s; $w$ ) and learn $w$ from data collected under policy $π$ .
•
A common objective is mean-squared value error weighted by the state distribution, and we optimize it by gradient or semi-gradient methods.
•
Linear function approximation uses feature vectors $ϕ$ (s) so that V(s; $w$ ) = $w$ ^ $⊤$ $ϕ$ (s), yielding fast, stable updates.
•
Temporal-Difference learning provides bootstrapped targets and semi-gradient updates: $w$ $\leftarrow$ $w$ + $α$ \, $δ_{t}$ \, $\nabla$ _ $w$ V( $s_{t}$ ; $w$ ).
•
With linear features and on-policy sampling, TD(0) converges to the projected Bellman fixed point; off-policy with bootstrapping can diverge (deadly triad).
•
Feature scaling, bias terms, and step-size tuning are crucial for stable and efficient learning.
•
C++ implementations are straightforward for linear approximators and can run in O(d) time per step, where d is the number of features.

Prerequisites

→Markov Decision Processes (MDPs) and policies — Value functions are defined on MDPs; understanding transitions, rewards, and policies is essential.
→Probability and expectations — Value functions and learning objectives are defined as expectations over stochastic trajectories.
→Linear algebra and vector calculus — Features, parameter vectors, gradients, and projections rely on linear algebra.
→Stochastic Gradient Descent (SGD) — Learning updates for approximators use stochastic gradients and step-size tuning.
→Temporal-Difference learning — TD provides the bootstrapped targets and update rules used with function approximators.

Detailed Explanation

Tap terms for definitions

01Overview

Value function approximation is a cornerstone technique in reinforcement learning (RL) for handling large or continuous state spaces where storing a separate value for each state is impossible. Instead of keeping a giant table of V^\pi(s), we choose a family of functions V(s; \mathbf{w}) parameterized by a small vector \mathbf{w}. Our goal is to pick \mathbf{w} so that V(s; \mathbf{w}) closely matches the true value V^\pi(s), the expected long-term return when following a fixed policy \pi. This lets the learner generalize: learning about one state also informs nearby or similar states via shared parameters. Approaches range from linear models with hand-designed features to deep neural networks that learn features automatically. Learning typically uses stochastic gradient methods driven by either Monte Carlo targets (full returns) or bootstrapped Temporal-Difference (TD) targets that combine immediate rewards with the model’s own predictions for the next state. Linear, on-policy TD has strong convergence guarantees to a specific fixed point (the projected Bellman solution), while non-linear or off-policy settings can be powerful but risk instability. In practice, value approximation reduces memory, speeds up computation, and is essential for control algorithms like actor-critic and DQN variants.

02Intuition & Analogies

Imagine you’re estimating the market value of houses in a city. You can’t memorize a price for every address (there are too many), but you can build a model that estimates price from features like size, neighborhood, and age. When you learn the effect of 10 extra square meters in one house, that knowledge generalizes to other houses. Value function approximation is the same idea for RL: we estimate the long-term desirability (value) of states using a model with a few knobs (parameters). If two states share similar features—like similar positions in a grid or similar sensor readings in robotics—they’ll get similar value estimates automatically. Bootstrapping adds another twist: sometimes we don’t have the final selling price (full return) yet, so we use our current estimate of the neighbor’s value to update this one—like pricing a house partly by looking at the estimated prices of nearby houses. This self-consistency speeds learning but can cause feedback loops if we’re careless (bad estimates can reinforce themselves), especially if we try to learn about neighborhoods we rarely or never actually visit under our current policy. Linear models are like using a weighted sum of known features—simple and robust. Neural networks are like complex appraisal models that can discover powerful new features, but they need more care to avoid wild price swings. In all cases, the art is picking features (or architectures), step sizes, and targets so that we learn fast without destabilizing the estimates.

03Formal Definition

Consider an MDP (

S

A

, P, r,

γ

) and a fixed policy

π

. The state-value function is V^

π

(s) =

E

π

[

\sum_{t = 0}^{\infty}

γ^{t}

r_{t + 1}

∣

s_{0} = s]

. We approximate V^

π

with a parameterized family

F

= \{ V(

\cdot

;

w

) :

w

\in

R^{d}

\}. A standard objective is the mean-squared value error (MSVE) weighted by the on-policy state distribution d_

π

: J(

w

) =

E_{s \sim d_{π}}

[(V^

π

(s) - V(s;

w

))^2]. In practice, V^

π

is unknown, so we use data-driven targets. With Monte Carlo, targets are returns

G_{t}

; with TD(0), targets are

r_{t + 1}

γ

s_{t + 1}

;

w

). For linear function approximation, V(s;

w

) =

w

⊤

ϕ

(s), where

ϕ

(s)

\in

R^{d}

are features. Semi-gradient TD(0) performs stochastic approximation of the projected Bellman fixed point:

w_{t + 1}

w_{t}

α

δ_{t}

\nabla

w

s_{t}

;

w_{t}

), where

δ_{t}

r_{t + 1}

γ

s_{t + 1}

;

w_{t}

) - V(

s_{t}

;

w_{t}

). For linear models,

\nabla

w

V =

ϕ

(

s_{t}

), giving

w_{t + 1}

w_{t}

α

δ_{t}

ϕ

(

s_{t}

). Under on-policy sampling and suitable conditions, the iterate converges to the solution of the projected Bellman equation: V_{

w

^*} =

Π

π

V_{

w

^*}, where

Π

projects onto

F

under the d_

π

-weighted norm.

04When to Use

Use value function approximation when the state space is very large, continuous, or when you want generalization across states. It is ideal for robotics with continuous sensors, finance with high-dimensional features, and board games with too many distinct positions to tabulate. Linear approximation is preferred when you need fast, stable learning with interpretable features (e.g., tile coding, radial basis functions, hand-crafted statistics). Semi-gradient TD(0) is the go-to on-policy method for continual prediction in streaming settings where you cannot wait for complete returns. Monte Carlo methods are attractive when episodes are short and you can tolerate higher variance for unbiased targets. Least-Squares TD (LSTD) is useful when data are limited but you can afford O(d^2) updates for improved sample efficiency. For control (improving policies), value approximators power actor-critic (critic estimates V or Q) and Q-learning variants; non-linear approximators like deep networks are standard when the raw state is high-dimensional (images, audio) and you can use techniques (experience replay, target networks) to stabilize learning. Avoid bootstrapping with off-policy sampling and function approximation unless you add safeguards (importance sampling with variance control, gradient TD, emphatic TD, or target networks) due to potential divergence.

⚠️Common Mistakes

• Omitting a bias feature: Without a constant feature (\phi_0 = 1), linear models can be systematically biased. Add a bias term or center features. • Poor feature scaling: Large-magnitude features cause unstable gradients. Normalize or standardize features so each dimension has similar scale. • Step size too large: TD methods are sensitive to \alpha. Use small learning rates, decay schedules, or adaptive methods; monitor the TD error. • Deadly triad: Combining function approximation, bootstrapping, and off-policy sampling can diverge. If off-policy, use safer algorithms (e.g., Gradient TD, Emphatic TD) or target networks with replay. • Bootstrapping with terminal states: Forgetting to zero out the bootstrap at terminal states inflates values. Ensure V(s_{t+1}) = 0 when done = true. • Mixing policies: Using data from one policy to evaluate another without correction biases estimates. Use importance sampling or evaluate on-policy. • High-variance Monte Carlo: Using full returns in long or continuing tasks produces huge variance. Prefer TD methods or use eligibility traces (\lambda) to trade bias and variance. • Data leakage and overfitting: In batch settings with limited trajectories, excessive training on the same data can overfit. Use validation, regularization, or more data. • Wrong discounting: Ensure the same \gamma is used in targets and definitions; continuing tasks may need average-reward formulations. • Ignoring stochasticity: Treating \delta_t as a deterministic gradient can be misleading; it’s a noisy estimate—use many samples and monitor learning curves.

Key Formulas

State-value definition

V^{π} (s) = E_{π} [t = 0 \sum \infty γ^{t} r_{t + 1} ∣ s_{0} = s]

Explanation: This defines the expected discounted return starting from state s and following policy $π .$ The expectation is over trajectories induced by π and the environment dynamics.

Linear value approximator

V (s; w) = w^{⊤} ϕ (s)

Explanation: The approximated value is a weighted sum of features of the state. It is linear in parameters, enabling fast and stable updates.

MSVE objective

J (w) = E_{s \sim d_{π}} [(V^{π} (s) - V (s; w))^{2}]

Explanation: This is the mean-squared value error under the on-policy state distribution. Minimizing it finds the best approximation within the function class.

Gradient of MSVE

\nabla_{w} J (w) = - 2 E_{d_{π}} [(V^{π} (s) - V (s; w)) \nabla_{w} V (s; w)]

Explanation: This gradient guides how to adjust parameters to reduce squared error. In linear models, the gradient simplifies because $\nabla$ V is just the feature vector.

TD error

δ_{t} = r_{t + 1} + γV (s_{t + 1}; w) - V (s_{t}; w)

Explanation: The TD error measures the mismatch between current prediction and a one-step bootstrapped target. It drives online learning in TD methods.

Semi-gradient TD(0) update

w_{t + 1} = w_{t} + α δ_{t} \nabla_{w} V (s_{t}; w_{t})

Explanation: We update parameters in the direction that would reduce the TD error for the current state. For linear features, the increment is proportional to the feature vector.

Projected Bellman equation

V_{w^{*}} = Π T_{π} V_{w^{*}}

Explanation: The solution for linear, on-policy TD equals the projection of the Bellman backup onto the function space. This explains convergence to a particular fixed point.

MSPBE

MSPBE (w) = ∥ V_{w} - Π T_{π} V_{w} ∥_{d_{π}}^{2}

Explanation: The mean-squared projected Bellman error measures how far a value function is from satisfying the projected Bellman equation. Gradient TD algorithms minimize this quantity.

LSTD solution

A = E [ϕ_{t} (ϕ_{t} - γ ϕ_{t + 1})^{⊤}], b = E [r_{t + 1} ϕ_{t}], w = A^{- 1} b

Explanation: Least-Squares TD solves a linear system to find weights that best satisfy the TD fixed-point conditions. It is more sample-efficient but computationally heavier.

Gradient Monte Carlo

G_{t} = k = 0 \sum \infty γ^{k} r_{t + k + 1}, w \leftarrow w + α (G_{t} - V (s_{t}; w)) \nabla_{w} V (s_{t}; w)

Explanation: Using full returns as targets yields an unbiased gradient step on MSVE. Variance can be high for long horizons, so it suits short episodes.

Complexity Analysis

Let d be the number of features and n the number of samples (transitions). For linear function approximation with either supervised targets or Monte Carlo/TD targets, each prediction and gradient evaluation is O(d) time and O(d) memory (to store the weights). A single online update thus costs O(d); processing n samples is O(nd). If features are sparse with k nonzeros, operations reduce to O(k). Batch gradient descent with mini-batches of size B costs O(Bd) per step. For TD(0), each step computes a TD error and performs one weight update; this remains O(d) with negligible constant factors. In episodic tasks with average episode length L, a pass over M episodes is O(M L d). Memory footprint is dominated by the parameter vector (O(d)) plus temporary feature storage; streaming implementations require no additional buffers. Least-Squares TD (LSTD) and related second-order methods improve sample efficiency but increase computation. Maintaining and inverting the A matrix naïvely is O(

d^{3}

) for a batch, or O(

d^{2}

) per step with recursive updates; memory becomes O(

d^{2}

). These methods are practical for modest d (hundreds) but not for very high-dimensional features. For non-linear approximators (e.g., neural networks with P parameters), per-step cost is O(P) for forward and backward passes, with memory O(P). Stabilization tricks (target networks, replay) can add storage O(n d) to keep a buffer of transitions. Overall, linear semi-gradient TD achieves a favorable O(d) per-step cost, making it well-suited for real-time applications and embedded systems when d is moderate.

Code Examples

Linear value function with supervised (Monte Carlo) targets via SGD

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple linear value function V(s; w) = w^T phi(s)
5 struct LinearValueFunction {
6     vector<double> w; // parameters
7     explicit LinearValueFunction(size_t d) : w(d, 0.0) {}
8 
9     // Dot product w^T x
10     static double dot(const vector<double>& a, const vector<double>& b) {
11         double s = 0.0; size_t n = a.size();
12         for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
13         return s;
14     }
15 
16     double predict(const vector<double>& phi) const {
17         return dot(w, phi);
18     }
19 
20     // One SGD step on squared error with target y: w <- w + alpha * (y - w^T phi) * phi
21     void update(const vector<double>& phi, double target, double alpha) {
22         double err = target - predict(phi);
23         for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * err * phi[i];
24     }
25 };
26 
27 // Example feature map for a 1D state s in [0,1]: polynomial features with bias
28 static vector<double> features(double s) {
29     return {1.0, s, s * s}; // bias + s + s^2
30 }
31 
32 int main() {
33     ios::sync_with_stdio(false);
34     cin.tie(nullptr);
35 
36     // We will "pretend" Monte Carlo targets approximate V^pi(s) = sin(pi s)
37     // In practice, these targets would come from episodic returns G_t.
38     mt19937 rng(42);
39     uniform_real_distribution<double> U(0.0, 1.0);
40 
41     LinearValueFunction vf(3);
42     double alpha = 0.1; // step size
43 
44     // Train with SGD on synthetic (s, target) pairs
45     for (int it = 0; it < 20000; ++it) {
46         double s = U(rng);
47         double target = sin(M_PI * s); // stand-in for Monte Carlo return
48         vector<double> phi = features(s);
49         vf.update(phi, target, alpha);
50     }
51 
52     // Evaluate the learned approximator
53     for (double s : {0.0, 0.25, 0.5, 0.75, 1.0}) {
54         vector<double> phi = features(s);
55         double pred = vf.predict(phi);
56         cout << fixed << setprecision(4)
57              << "s=" << s << " -> V_hat=" << pred
58              << " (true ~ " << sin(M_PI * s) << ")\n";
59     }
60 
61     return 0;
62 }
63

This C++ program implements a linear value function approximator and trains it with stochastic gradient descent on supervised targets that mimic Monte Carlo returns. The features include a bias term and polynomial terms of the state. After training, predictions are printed at several states to show generalization.

Time: O(n d) for n samples and d featuresSpace: O(d) for parameters plus O(d) temporary storage

Semi-gradient TD(0) policy evaluation on a Random Walk with linear features

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Random Walk environment (7 states: 0 and 6 terminal; start at 3)
5 // Actions: -1 (left) or +1 (right); policy: choose uniformly at random.
6 struct RandomWalk {
7     int state = 3; // start in the middle
8     static constexpr int LEFT_TERM = 0;
9     static constexpr int RIGHT_TERM = 6;
10 
11     void reset() { state = 3; }
12 
13     // Step with action a in {-1, +1}
14     tuple<int, double, bool> step(int a) {
15         int next = state + a;
16         bool done = (next == LEFT_TERM) || (next == RIGHT_TERM);
17         double reward = (next == RIGHT_TERM) ? 1.0 : 0.0; // reward only on right terminal
18         state = next;
19         return {state, reward, done};
20     }
21 };
22 
23 struct LinearVF {
24     vector<double> w; // weights
25     explicit LinearVF(size_t d) : w(d, 0.0) {}
26 
27     static double dot(const vector<double>& a, const vector<double>& b) {
28         double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
29     }
30 
31     double V(const vector<double>& phi) const { return dot(w, phi); }
32 
33     // Semi-gradient TD(0): w <- w + alpha * delta * phi(s)
34     void td0_update(const vector<double>& phi_s, double r, const vector<double>& phi_sp, bool terminal, double gamma, double alpha) {
35         double v_s = V(phi_s);
36         double v_sp = terminal ? 0.0 : V(phi_sp);
37         double delta = r + gamma * v_sp - v_s;
38         for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * delta * phi_s[i];
39     }
40 };
41 
42 // Simple 2D feature map with bias for discrete states 0..6 (terminals included)
43 static vector<double> phi(int s) {
44     // We'll use a coarse representation to force approximation:
45     // phi(s) = [1, s/6], so only two parameters must fit all nonterminal states.
46     return {1.0, static_cast<double>(s)/6.0};
47 }
48 
49 int main() {
50     ios::sync_with_stdio(false);
51     cin.tie(nullptr);
52 
53     RandomWalk env;
54     LinearVF vf(2); // two parameters: bias and slope
55 
56     mt19937 rng(123);
57     uniform_int_distribution<int> coin(0, 1); // 0 -> left, 1 -> right
58 
59     double alpha = 0.05; // step size for TD
60     double gamma = 1.0;  // undiscounted episodic task
61 
62     // Train for many episodes
63     for (int episode = 0; episode < 5000; ++episode) {
64         env.reset();
65         bool done = false;
66         while (!done) {
67             int a = coin(rng) ? +1 : -1; // on-policy random
68             int s = env.state;
69             auto phis = phi(s);
70             auto [sp, r, terminal] = env.step(a);
71             auto phisp = phi(sp);
72             vf.td0_update(phis, r, phisp, terminal, gamma, alpha);
73             done = terminal;
74         }
75     }
76 
77     // Report predicted values for each state (nonterminals 1..5)
78     cout << fixed << setprecision(4);
79     for (int s = 0; s <= 6; ++s) {
80         cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n";
81     }
82     return 0;
83 }
84

This program evaluates a fixed random policy on a 7-state Random Walk using semi-gradient TD(0) with a very low-dimensional feature map (bias and normalized index). Because the features cannot represent the true tabular values exactly, the algorithm finds the projected Bellman fixed point in the 2D function space.

Time: O(E L d), where E is episodes, L is average episode length, and d=2 features per stepSpace: O(d) for parameters

Gradient Monte Carlo policy evaluation on the same Random Walk

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RandomWalk {
5     int state = 3;
6     static constexpr int LEFT_TERM = 0;
7     static constexpr int RIGHT_TERM = 6;
8     void reset() { state = 3; }
9     tuple<int,double,bool> step(int a) {
10         int next = state + a;
11         bool done = (next == LEFT_TERM) || (next == RIGHT_TERM);
12         double reward = (next == RIGHT_TERM) ? 1.0 : 0.0;
13         state = next;
14         return {state, reward, done};
15     }
16 };
17 
18 struct LinearVF {
19     vector<double> w; explicit LinearVF(size_t d): w(d,0.0) {}
20     static double dot(const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
21     double V(const vector<double>& phi) const { return dot(w, phi); }
22     void update(const vector<double>& phi, double target, double alpha){ double err = target - V(phi); for(size_t i=0;i<w.size();++i) w[i] += alpha * err * phi[i]; }
23 };
24 
25 static vector<double> phi(int s){ return {1.0, static_cast<double>(s)/6.0}; }
26 
27 int main(){
28     ios::sync_with_stdio(false);
29     cin.tie(nullptr);
30 
31     RandomWalk env; LinearVF vf(2);
32     mt19937 rng(7); uniform_int_distribution<int> coin(0,1);
33     double alpha = 0.02; double gamma = 1.0;
34 
35     // Train with first-visit Gradient Monte Carlo
36     for(int episode=0; episode<4000; ++episode){
37         env.reset();
38         vector<int> states; vector<double> rewards; states.reserve(20); rewards.reserve(20);
39         bool done=false; states.push_back(env.state);
40         while(!done){
41             int a = coin(rng) ? +1 : -1; // random policy
42             auto [sp, r, terminal] = env.step(a);
43             rewards.push_back(r);
44             states.push_back(sp);
45             done = terminal;
46         }
47         // Compute returns G_t backward and update once per state visit
48         double G = 0.0; // since gamma=1 and terminal reward already included
49         for(int t = (int)states.size()-2; t >= 0; --t){ // last state is terminal, no update for it
50             G = rewards[t] + gamma * G;
51             vf.update(phi(states[t]), G, alpha);
52         }
53     }
54 
55     cout << fixed << setprecision(4);
56     for(int s=0; s<=6; ++s){
57         cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n";
58     }
59     return 0;
60 }
61

This variant uses Gradient Monte Carlo with full returns as targets, which is unbiased with respect to MSVE but has higher variance than TD. It is stable for on-policy evaluation and short episodes, providing a useful contrast to bootstrapped TD(0).

Time: O(E L d) overall; each state update is O(d)Space: O(d) for parameters plus O(L) to store an episode

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple linear value function V(s; w) = w^T phi(s)
5	struct LinearValueFunction {
6	vector<double> w; // parameters
7	explicit LinearValueFunction(size_t d) : w(d, 0.0) {}
8
9	// Dot product w^T x
10	static double dot(const vector<double>& a, const vector<double>& b) {
11	double s = 0.0; size_t n = a.size();
12	for (size_t i = 0; i < n; ++i) s += a[i] * b[i];
13	return s;
14	}
15
16	double predict(const vector<double>& phi) const {
17	return dot(w, phi);
18	}
19
20	// One SGD step on squared error with target y: w <- w + alpha * (y - w^T phi) * phi
21	void update(const vector<double>& phi, double target, double alpha) {
22	double err = target - predict(phi);
23	for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * err * phi[i];
24	}
25	};
26
27	// Example feature map for a 1D state s in [0,1]: polynomial features with bias
28	static vector<double> features(double s) {
29	return {1.0, s, s * s}; // bias + s + s^2
30	}
31
32	int main() {
33	ios::sync_with_stdio(false);
34	cin.tie(nullptr);
35
36	// We will "pretend" Monte Carlo targets approximate V^pi(s) = sin(pi s)
37	// In practice, these targets would come from episodic returns G_t.
38	mt19937 rng(42);
39	uniform_real_distribution<double> U(0.0, 1.0);
40
41	LinearValueFunction vf(3);
42	double alpha = 0.1; // step size
43
44	// Train with SGD on synthetic (s, target) pairs
45	for (int it = 0; it < 20000; ++it) {
46	double s = U(rng);
47	double target = sin(M_PI * s); // stand-in for Monte Carlo return
48	vector<double> phi = features(s);
49	vf.update(phi, target, alpha);
50	}
51
52	// Evaluate the learned approximator
53	for (double s : {0.0, 0.25, 0.5, 0.75, 1.0}) {
54	vector<double> phi = features(s);
55	double pred = vf.predict(phi);
56	cout << fixed << setprecision(4)
57	<< "s=" << s << " -> V_hat=" << pred
58	<< " (true ~ " << sin(M_PI * s) << ")\n";
59	}
60
61	return 0;
62	}
63

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Random Walk environment (7 states: 0 and 6 terminal; start at 3)
5	// Actions: -1 (left) or +1 (right); policy: choose uniformly at random.
6	struct RandomWalk {
7	int state = 3; // start in the middle
8	static constexpr int LEFT_TERM = 0;
9	static constexpr int RIGHT_TERM = 6;
10
11	void reset() { state = 3; }
12
13	// Step with action a in {-1, +1}
14	tuple<int, double, bool> step(int a) {
15	int next = state + a;
16	bool done = (next == LEFT_TERM) \|\| (next == RIGHT_TERM);
17	double reward = (next == RIGHT_TERM) ? 1.0 : 0.0; // reward only on right terminal
18	state = next;
19	return {state, reward, done};
20	}
21	};
22
23	struct LinearVF {
24	vector<double> w; // weights
25	explicit LinearVF(size_t d) : w(d, 0.0) {}
26
27	static double dot(const vector<double>& a, const vector<double>& b) {
28	double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
29	}
30
31	double V(const vector<double>& phi) const { return dot(w, phi); }
32
33	// Semi-gradient TD(0): w <- w + alpha * delta * phi(s)
34	void td0_update(const vector<double>& phi_s, double r, const vector<double>& phi_sp, bool terminal, double gamma, double alpha) {
35	double v_s = V(phi_s);
36	double v_sp = terminal ? 0.0 : V(phi_sp);
37	double delta = r + gamma * v_sp - v_s;
38	for (size_t i = 0; i < w.size(); ++i) w[i] += alpha * delta * phi_s[i];
39	}
40	};
41
42	// Simple 2D feature map with bias for discrete states 0..6 (terminals included)
43	static vector<double> phi(int s) {
44	// We'll use a coarse representation to force approximation:
45	// phi(s) = [1, s/6], so only two parameters must fit all nonterminal states.
46	return {1.0, static_cast<double>(s)/6.0};
47	}
48
49	int main() {
50	ios::sync_with_stdio(false);
51	cin.tie(nullptr);
52
53	RandomWalk env;
54	LinearVF vf(2); // two parameters: bias and slope
55
56	mt19937 rng(123);
57	uniform_int_distribution<int> coin(0, 1); // 0 -> left, 1 -> right
58
59	double alpha = 0.05; // step size for TD
60	double gamma = 1.0; // undiscounted episodic task
61
62	// Train for many episodes
63	for (int episode = 0; episode < 5000; ++episode) {
64	env.reset();
65	bool done = false;
66	while (!done) {
67	int a = coin(rng) ? +1 : -1; // on-policy random
68	int s = env.state;
69	auto phis = phi(s);
70	auto [sp, r, terminal] = env.step(a);
71	auto phisp = phi(sp);
72	vf.td0_update(phis, r, phisp, terminal, gamma, alpha);
73	done = terminal;
74	}
75	}
76
77	// Report predicted values for each state (nonterminals 1..5)
78	cout << fixed << setprecision(4);
79	for (int s = 0; s <= 6; ++s) {
80	cout << "s=" << s << ": V_hat=" << vf.V(phi(s)) << (s==6?" (goal)":"") << "\n";
81	}
82	return 0;
83	}
84