Temporal Difference Learning
Key Points
- •Temporal Difference (TD) Learning updates value estimates by bootstrapping from the next state's current estimate, enabling fast, online learning.
- •TD(0) uses the simple update ) + + − V(s)], driven by the TD error δ = r + − V(s).
- •It combines advantages of Dynamic Programming (bootstrapping) and Monte Carlo (learning from experience without a model).
- •TD(0) learns after every single step, so it works well in streaming or continuing tasks where episodes can be long or unbounded.
- •Tabular TD(0) has O(1) time per step and O(|S|) memory; with linear features it has O(d) time and O(d) memory, where d is the number of features.
- •On-policy control with TD(0) is commonly implemented as SARSA(0), which learns Q(s,a) while following an policy.
- •TD methods can diverge when combining function approximation, bootstrapping, and off-policy learning (the deadly triad); take care with settings.
- •Learning rates and discounts matter: choose α carefully and set V(terminal)=0 to avoid bias and instability.
Prerequisites
- →Markov Decision Processes (MDPs) — TD(0) is defined on MDPs; you need states, actions, rewards, transitions, and discounting.
- →Bellman Equations and Dynamic Programming — TD(0) is a stochastic approximation to the Bellman expectation equation.
- →Monte Carlo Policy Evaluation — Understanding the contrast between full-return updates and TD bootstrapping clarifies TD’s bias-variance trade-off.
- →Stochastic Approximation and Learning Rates — Convergence and stability of TD(0) depend on step-size schedules and noise properties.
- →Linear Function Approximation — Extending TD(0) to large state spaces requires parameterized value functions and gradients.
Detailed Explanation
Tap terms for definitions01Overview
Temporal Difference (TD) Learning is a family of reinforcement learning methods that update value estimates from incomplete experience by bootstrapping—using existing estimates to refine themselves. The simplest form, TD(0), updates the value of the current state after one step by comparing the observed reward plus the discounted estimate of the next state to the current estimate. This comparison is the TD error. Unlike Dynamic Programming, TD does not need a model of the environment’s transitions, and unlike Monte Carlo methods, it does not need to wait until the end of an episode to learn. TD(0) is especially powerful in online, streaming scenarios where immediate, incremental updates are beneficial. It applies to both state values V(s) (policy evaluation) and action values Q(s,a) (control) with small modifications. TD methods are computationally lightweight per step and can be extended with function approximation to handle large or continuous state spaces. However, care is needed with learning rates, policy choices (on-policy vs off-policy), and stability when combining with function approximation. TD(0) serves as a foundational algorithm that leads to deeper methods such as n-step TD, eligibility traces (TD(λ)), SARSA, and Q-learning.
02Intuition & Analogies
Imagine you’re hiking to a mountain cabin for the first time. You’re trying to guess how long it will take to reach the cabin from your current point on the trail. After walking five minutes, you look at your watch and your map, and revise your guess based on where you are now and how much you’ve already walked. You don’t need to finish the hike to update your estimate—you just use your current rough plan (map) and short-term experience to refine your guess. That’s the essence of TD learning: you constantly correct your expectations using partial progress and your current beliefs. Another analogy: navigating traffic with live GPS. The GPS predicts your arrival time. As you drive and observe actual speeds and new conditions, the GPS updates the ETA immediately using its current traffic model. It doesn’t wait until you arrive to learn. TD’s bootstrapping is like using the current model estimate (the next state’s value) to improve the present estimate after every observation. Contrast this with Monte Carlo, which is like waiting until the trip ends before you revise your estimate for the entire route—accurate but slow to react. Dynamic Programming is like having a perfect simulator of roads and traffic; you could compute ideal ETAs without actually driving, but only if you know the whole system. TD sits in the sweet spot: it learns from real experience without a model and reacts quickly, making it practical for real-time decision making.
03Formal Definition
04When to Use
Use TD(0) when you learn from real or simulated experience without a full model and want immediate, incremental improvements. It excels in:
- Online policy evaluation: estimating V^{π} while executing π in environments where episodes are long or continuing.
- Streaming and real-time control: robotics or games where updates must happen every time step.
- Sample-limited settings: when data is precious, TD’s bootstrapping often learns faster than Monte Carlo because it reuses estimates.
- Large state spaces: combine TD(0) with function approximation (e.g., linear features) to generalize across similar states.
- On-policy control (SARSA): learning Q(s,a) and improving the ε-greedy policy simultaneously for safer learning than off-policy methods in stochastic domains. Avoid plain TD(0) when strong off-policy learning with function approximation is required, as it may diverge. In such cases, prefer algorithms with stability guarantees (e.g., gradient TD methods) or ensure sufficient on-policy sampling and well-behaved features.
⚠️Common Mistakes
- Wrong terminal handling: forgetting to set V(terminal) = 0 (or the true terminal value) causes biased bootstrapping. Always special-case terminal transitions by using V(s') = 0.
- Mis-tuned learning rate α: too large leads to divergence or oscillation; too small slows learning. Consider decaying schedules and monitor the TD error’s variance.
- Mixing policies unintentionally: evaluating a target policy π while following a different behavior policy without correction can bias results. For plain TD(0), stay on-policy unless you use importance sampling or specialized off-policy TD.
- Feature scaling issues: with function approximation, unnormalized features cause unstable updates and slow convergence. Standardize or bound features, and consider smaller α proportional to 1/|x|^{2}.
- Ignoring exploration in control: in SARSA(0), failing to use ε-greedy or annealed exploration can trap learning in suboptimal behaviors.
- Not resetting episode state: carrying over transient variables or traces (in TD(λ)) across episodes incorrectly skews learning.
- Overinterpreting early values: TD updates are biased but low-variance; early estimates can be systematically off. Track performance trends, not single-episode spikes.
- Off-policy with function approximation: the deadly triad (bootstrapping + function approximation + off-policy) can cause divergence; use on-policy methods, gradient TD, or target networks to mitigate.
Key Formulas
State Value
Explanation: The value of a state under policy π is the expected discounted sum of future rewards starting from that state. The discount γ makes distant rewards count less.
TD Error
Explanation: The TD error measures how surprising the new observation is compared to the current estimate. It drives the direction and magnitude of the TD update.
TD(0) Update
Explanation: We adjust the current value estimate by a fraction α of the TD error. Larger α makes learning faster but can be unstable.
Bellman Expectation Equation
Explanation: This recursive equation characterizes the true value function under a fixed policy. TD(0) aims to solve it by stochastic approximation.
SARSA(0) Update
Explanation: On-policy control version of TD(0) that updates action values using the next action actually taken under the behavior policy.
Linear TD(0)
Explanation: With linear features x(s), the value estimate is a dot product with weights w. The gradient step moves w in proportion to the TD error times the feature vector.
Robbins–Monro Conditions
Explanation: For convergence of stochastic approximation, learning rates should decrease but not too fast. These conditions are common in proofs of TD(0) convergence.
n-step Return
Explanation: The n-step target generalizes TD(0) (n=1) and Monte Carlo (n=episode length). It trades bias and variance.
MSE and Semi-Gradient
Explanation: Linear TD(0) can be interpreted as a semi-gradient method minimizing mean squared value error under a projection. The update uses δ times features.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // 1D Random Walk: states 0..6, where 0 and 6 are terminal. 5 // Start at state 3 each episode. Policy: move left/right with 0.5 each. 6 // Reward: +1 only when reaching state 6 (right terminal), else 0. 7 // Goal: Estimate V(s) for non-terminal states using TD(0). 8 9 struct RandomWalk { 10 static constexpr int N = 7; // states 0..6 11 int start = 3; 12 int s; 13 mt19937 rng{12345}; 14 uniform_real_distribution<double> uni{0.0, 1.0}; 15 16 void reset() { s = start; } 17 18 // Step under the fixed policy: left/right with 0.5 19 // Returns (next_state, reward, done) 20 tuple<int,double,bool> step() { 21 bool go_right = (uni(rng) < 0.5); 22 int s_next = s + (go_right ? 1 : -1); 23 double r = 0.0; 24 bool done = false; 25 if (s_next <= 0) { s_next = 0; done = true; r = 0.0; } 26 if (s_next >= N-1) { s_next = N-1; done = true; r = 1.0; } 27 s = s_next; 28 return {s_next, r, done}; 29 } 30 }; 31 32 int main() { 33 ios::sync_with_stdio(false); 34 cin.tie(nullptr); 35 36 RandomWalk env; 37 vector<double> V(RandomWalk::N, 0.5); // init: optimistic for non-terminals 38 V[0] = 0.0; V[RandomWalk::N-1] = 0.0; // terminal values fixed to 0 39 40 double alpha = 0.1; // learning rate 41 double gamma = 1.0; // undiscounted episodic task 42 int episodes = 5000; // number of training episodes 43 44 mt19937 rng(42); 45 46 for (int ep = 0; ep < episodes; ++ep) { 47 env.reset(); 48 bool done = false; 49 while (!done) { 50 int s = env.s; // current state 51 auto [s_next, r, terminal] = env.step(); 52 53 double target = r + (terminal ? 0.0 : gamma * V[s_next]); 54 double delta = target - V[s]; // TD error 55 V[s] += alpha * delta; // TD(0) update 56 57 done = terminal; 58 } 59 } 60 61 cout << fixed << setprecision(3); 62 cout << "Estimated V(s):\n"; 63 for (int s = 0; s < RandomWalk::N; ++s) { 64 cout << "s=" << s << " -> V=" << V[s] << "\n"; 65 } 66 return 0; 67 } 68
This program evaluates a fixed symmetric policy on a 1D random walk using TD(0). After each transition, it computes the TD target r + γV(s') (with V=0 at terminals) and updates V(s) by a fraction α of the TD error. Because updates happen every step, learning is fast and online.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple 4x4 Gridworld, start at (0,0), goal at (3,3). 5 // Deterministic moves: up=0, right=1, down=2, left=3. 6 // Reward: -1 per step until reaching goal (reward 0 at goal). Episodic. 7 // Learn Q(s,a) using SARSA(0) with epsilon-greedy policy. 8 9 struct Gridworld { 10 int H = 4, W = 4; 11 pair<int,int> start{0,0}; 12 pair<int,int> goal{3,3}; 13 pair<int,int> s; 14 15 void reset() { s = start; } 16 17 bool is_terminal(pair<int,int> p) const { return p == goal; } 18 19 // Apply action; return (next_state_index, reward, done) 20 tuple<int,double,bool> step(int a) { 21 int r = s.first, c = s.second; 22 if (a == 0) r = max(0, r-1); 23 if (a == 1) c = min(W-1, c+1); 24 if (a == 2) r = min(H-1, r+1); 25 if (a == 3) c = max(0, c-1); 26 s = {r,c}; 27 bool done = is_terminal(s); 28 double rew = done ? 0.0 : -1.0; // encourage shortest path 29 int idx = r*W + c; 30 return {idx, rew, done}; 31 } 32 33 int state_index() const { return s.first*W + s.second; } 34 int n_states() const { return H*W; } 35 }; 36 37 int main(){ 38 ios::sync_with_stdio(false); 39 cin.tie(nullptr); 40 41 Gridworld env; 42 int S = env.n_states(); 43 int A = 4; 44 45 vector<vector<double>> Q(S, vector<double>(A, 0.0)); 46 47 double alpha = 0.1; 48 double gamma = 0.99; 49 double eps = 0.1; // epsilon-greedy exploration 50 int episodes = 5000; 51 mt19937 rng(123); 52 uniform_real_distribution<double> uni(0.0,1.0); 53 54 auto greedy_action = [&](int s)->int{ 55 // break ties randomly 56 vector<int> best; double bestv = -1e18; 57 for(int a=0;a<A;++a){ 58 if (Q[s][a] > bestv + 1e-12) { bestv = Q[s][a]; best = {a}; } 59 else if (abs(Q[s][a]-bestv) <= 1e-12) best.push_back(a); 60 } 61 uniform_int_distribution<int> pick(0, (int)best.size()-1); 62 return best[pick(rng)]; 63 }; 64 65 auto eps_greedy = [&](int s)->int{ 66 if (uni(rng) < eps) { uniform_int_distribution<int> pick(0, A-1); return pick(rng); } 67 return greedy_action(s); 68 }; 69 70 for (int ep=0; ep<episodes; ++ep){ 71 env.reset(); 72 int s = env.state_index(); 73 int a = eps_greedy(s); 74 bool done = false; 75 while(!done){ 76 auto [s_next, r, terminal] = env.step(a); 77 int a_next = terminal ? 0 : eps_greedy(s_next); 78 79 double td_target = r + (terminal ? 0.0 : gamma * Q[s_next][a_next]); 80 double td_error = td_target - Q[s][a]; 81 Q[s][a] += alpha * td_error; // SARSA(0) update 82 83 s = s_next; a = a_next; done = terminal; 84 } 85 // Optionally decay epsilon over time 86 // eps = max(0.01, eps * 0.999); 87 } 88 89 // Derive greedy policy and print its value per state (best action index) 90 cout << "Greedy action (0=U,1=R,2=D,3=L) per cell:\n"; 91 for (int r=0;r<env.H;++r){ 92 for (int c=0;c<env.W;++c){ 93 int s = r*env.W + c; 94 int a = 0; double best = -1e18; 95 for(int i=0;i<A;++i) if(Q[s][i] > best){best=Q[s][i]; a=i;} 96 cout << a << ' '; 97 } 98 cout << '\n'; 99 } 100 return 0; 101 } 102
This example learns an action-value function with SARSA(0), an on-policy TD(0) control method. The ε-greedy policy selects actions, and updates use the next action actually taken. With a per-step penalty and zero terminal reward, the learned policy approximates the shortest path to the goal.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Evaluate a fixed policy on the 1D random walk using linear function approximation. 5 // Feature vector x(s) = [1, s_norm, s_norm^2] for non-terminals; x(terminal) = 0. 6 7 struct RandomWalkFA { 8 static constexpr int N = 7; // 0..6 terminals at 0 and 6 9 int start = 3; 10 int s; 11 mt19937 rng{7}; 12 uniform_real_distribution<double> uni{0.0, 1.0}; 13 14 void reset(){ s = start; } 15 tuple<int,double,bool> step(){ 16 bool right = (uni(rng) < 0.5); 17 int s_next = s + (right ? 1 : -1); 18 double r = 0.0; bool done = false; 19 if (s_next <= 0){ s_next = 0; done = true; r = 0.0; } 20 if (s_next >= N-1){ s_next = N-1; done = true; r = 1.0; } 21 s = s_next; return {s_next, r, done}; 22 } 23 }; 24 25 static inline vector<double> features(int s){ 26 // Map state index to normalized [0,1]; terminals get zero features 27 if (s == 0 || s == RandomWalkFA::N-1) return {0.0,0.0,0.0}; 28 double x = (double)s / (RandomWalkFA::N - 1); // in (0,1) 29 return {1.0, x, x*x}; 30 } 31 32 static inline double dot(const vector<double>& a, const vector<double>& b){ 33 double v=0; for(size_t i=0;i<a.size();++i) v += a[i]*b[i]; return v; 34 } 35 36 int main(){ 37 ios::sync_with_stdio(false); 38 cin.tie(nullptr); 39 40 RandomWalkFA env; 41 42 // weights for features [1, x, x^2] 43 vector<double> w = {0.0, 0.0, 0.0}; 44 45 double alpha = 0.05; // smaller step due to function approximation 46 double gamma = 1.0; // episodic 47 int episodes = 10000; 48 49 for(int ep=0; ep<episodes; ++ep){ 50 env.reset(); 51 bool done=false; 52 while(!done){ 53 int s = env.s; 54 auto x_s = features(s); 55 double v_s = dot(w, x_s); 56 57 auto [s_next, r, terminal] = env.step(); 58 auto x_sn = features(s_next); 59 double v_sn = dot(w, x_sn); // zero if terminal due to features 60 61 double delta = r + (terminal ? 0.0 : gamma * v_sn) - v_s; 62 // Semi-gradient TD(0): w <- w + alpha * delta * x(s) 63 for(size_t i=0;i<w.size();++i) w[i] += alpha * delta * x_s[i]; 64 65 done = terminal; 66 } 67 } 68 69 cout.setf(std::ios::fixed); cout << setprecision(4); 70 for(int s=0; s<RandomWalkFA::N; ++s){ 71 double v = dot(w, features(s)); 72 cout << "s=" << s << " V_hat=" << v << '\n'; 73 } 74 return 0; 75 } 76
This program replaces the tabular value function with a linear approximator using polynomial features of the state index. TD(0) updates the weight vector w via the semi-gradient rule w ← w + α δ x(s). Terminal features are zero so that V(terminal)=0 by construction.