🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Temporal Difference Learning

Key Points

  • •
    Temporal Difference (TD) Learning updates value estimates by bootstrapping from the next state's current estimate, enabling fast, online learning.
  • •
    TD(0) uses the simple update V(s)←V(s) + α[r + γV(s′) − V(s)], driven by the TD error δ = r + γV(s′) − V(s).
  • •
    It combines advantages of Dynamic Programming (bootstrapping) and Monte Carlo (learning from experience without a model).
  • •
    TD(0) learns after every single step, so it works well in streaming or continuing tasks where episodes can be long or unbounded.
  • •
    Tabular TD(0) has O(1) time per step and O(|S|) memory; with linear features it has O(d) time and O(d) memory, where d is the number of features.
  • •
    On-policy control with TD(0) is commonly implemented as SARSA(0), which learns Q(s,a) while following an ε−greedy policy.
  • •
    TD methods can diverge when combining function approximation, bootstrapping, and off-policy learning (the deadly triad); take care with settings.
  • •
    Learning rates and discounts matter: choose α carefully and set V(terminal)=0 to avoid bias and instability.

Prerequisites

  • →Markov Decision Processes (MDPs) — TD(0) is defined on MDPs; you need states, actions, rewards, transitions, and discounting.
  • →Bellman Equations and Dynamic Programming — TD(0) is a stochastic approximation to the Bellman expectation equation.
  • →Monte Carlo Policy Evaluation — Understanding the contrast between full-return updates and TD bootstrapping clarifies TD’s bias-variance trade-off.
  • →Stochastic Approximation and Learning Rates — Convergence and stability of TD(0) depend on step-size schedules and noise properties.
  • →Linear Function Approximation — Extending TD(0) to large state spaces requires parameterized value functions and gradients.

Detailed Explanation

Tap terms for definitions

01Overview

Temporal Difference (TD) Learning is a family of reinforcement learning methods that update value estimates from incomplete experience by bootstrapping—using existing estimates to refine themselves. The simplest form, TD(0), updates the value of the current state after one step by comparing the observed reward plus the discounted estimate of the next state to the current estimate. This comparison is the TD error. Unlike Dynamic Programming, TD does not need a model of the environment’s transitions, and unlike Monte Carlo methods, it does not need to wait until the end of an episode to learn. TD(0) is especially powerful in online, streaming scenarios where immediate, incremental updates are beneficial. It applies to both state values V(s) (policy evaluation) and action values Q(s,a) (control) with small modifications. TD methods are computationally lightweight per step and can be extended with function approximation to handle large or continuous state spaces. However, care is needed with learning rates, policy choices (on-policy vs off-policy), and stability when combining with function approximation. TD(0) serves as a foundational algorithm that leads to deeper methods such as n-step TD, eligibility traces (TD(λ)), SARSA, and Q-learning.

02Intuition & Analogies

Imagine you’re hiking to a mountain cabin for the first time. You’re trying to guess how long it will take to reach the cabin from your current point on the trail. After walking five minutes, you look at your watch and your map, and revise your guess based on where you are now and how much you’ve already walked. You don’t need to finish the hike to update your estimate—you just use your current rough plan (map) and short-term experience to refine your guess. That’s the essence of TD learning: you constantly correct your expectations using partial progress and your current beliefs. Another analogy: navigating traffic with live GPS. The GPS predicts your arrival time. As you drive and observe actual speeds and new conditions, the GPS updates the ETA immediately using its current traffic model. It doesn’t wait until you arrive to learn. TD’s bootstrapping is like using the current model estimate (the next state’s value) to improve the present estimate after every observation. Contrast this with Monte Carlo, which is like waiting until the trip ends before you revise your estimate for the entire route—accurate but slow to react. Dynamic Programming is like having a perfect simulator of roads and traffic; you could compute ideal ETAs without actually driving, but only if you know the whole system. TD sits in the sweet spot: it learns from real experience without a model and reacts quickly, making it practical for real-time decision making.

03Formal Definition

Consider a Markov Decision Process (MDP) with states S, actions A, transition dynamics P(s'|s,a), rewards R(s,a,s'), and discount factor γ ∈ [0,1). For a fixed policy π(a∣s), the state-value function is Vπ(s) = Eπ​[∑t=0∞​ γt rt​ ∣s0​=s].TD(0)estimatesVπincrementallyfromsampletransitions(s,r,s′).Afterobservingatransitionfollowingpolicyπ,TD(0)performs:V(s)←V(s)+αδ,wheretheTDerrorδ=r+γV(s′)−V(s).ThisisastochasticapproximationoftheBellmanexpectationequationVπ(s)=Eπ​[r+γVπ(s′)∣ s ]. For action values in on-policy control (SARSA(0)), the update becomes Q(s,a)←Q(s,a) + α [ r + γ Q(s',a') − Q(s,a) ], where a' is drawn from π(⋅∣s′). With linear function approximation Vw​(s) = w⊤ x(s), TD(0) updates parameters as w←w + α δ x(s), using the same δ but with Vw​(s') in place of V(s'). Under suitable conditions (e.g., on-policy sampling, appropriate α schedules), tabular TD(0) converges to Vπ. With linear features, TD(0) converges to the projection of Vπ onto the span of features under the sampling distribution.

04When to Use

Use TD(0) when you learn from real or simulated experience without a full model and want immediate, incremental improvements. It excels in:

  • Online policy evaluation: estimating V^{π} while executing π in environments where episodes are long or continuing.
  • Streaming and real-time control: robotics or games where updates must happen every time step.
  • Sample-limited settings: when data is precious, TD’s bootstrapping often learns faster than Monte Carlo because it reuses estimates.
  • Large state spaces: combine TD(0) with function approximation (e.g., linear features) to generalize across similar states.
  • On-policy control (SARSA): learning Q(s,a) and improving the ε-greedy policy simultaneously for safer learning than off-policy methods in stochastic domains. Avoid plain TD(0) when strong off-policy learning with function approximation is required, as it may diverge. In such cases, prefer algorithms with stability guarantees (e.g., gradient TD methods) or ensure sufficient on-policy sampling and well-behaved features.

⚠️Common Mistakes

  • Wrong terminal handling: forgetting to set V(terminal) = 0 (or the true terminal value) causes biased bootstrapping. Always special-case terminal transitions by using V(s') = 0.
  • Mis-tuned learning rate α: too large leads to divergence or oscillation; too small slows learning. Consider decaying schedules and monitor the TD error’s variance.
  • Mixing policies unintentionally: evaluating a target policy π while following a different behavior policy without correction can bias results. For plain TD(0), stay on-policy unless you use importance sampling or specialized off-policy TD.
  • Feature scaling issues: with function approximation, unnormalized features cause unstable updates and slow convergence. Standardize or bound features, and consider smaller α proportional to 1/|x|^{2}.
  • Ignoring exploration in control: in SARSA(0), failing to use ε-greedy or annealed exploration can trap learning in suboptimal behaviors.
  • Not resetting episode state: carrying over transient variables or traces (in TD(λ)) across episodes incorrectly skews learning.
  • Overinterpreting early values: TD updates are biased but low-variance; early estimates can be systematically off. Track performance trends, not single-episode spikes.
  • Off-policy with function approximation: the deadly triad (bootstrapping + function approximation + off-policy) can cause divergence; use on-policy methods, gradient TD, or target networks to mitigate.

Key Formulas

State Value

Vπ(s)=Eπ​[t=0∑∞​γtrt​​s0​=s]

Explanation: The value of a state under policy π is the expected discounted sum of future rewards starting from that state. The discount γ makes distant rewards count less.

TD Error

δt​=rt​+γV(st+1​)−V(st​)

Explanation: The TD error measures how surprising the new observation is compared to the current estimate. It drives the direction and magnitude of the TD update.

TD(0) Update

V(st​)←V(st​)+αδt​

Explanation: We adjust the current value estimate by a fraction α of the TD error. Larger α makes learning faster but can be unstable.

Bellman Expectation Equation

Vπ(s)=Eπ​[rt​+γVπ(st+1​)∣st​=s]

Explanation: This recursive equation characterizes the true value function under a fixed policy. TD(0) aims to solve it by stochastic approximation.

SARSA(0) Update

Q(st​,at​)←Q(st​,at​)+α(rt​+γQ(st+1​,at+1​)−Q(st​,at​))

Explanation: On-policy control version of TD(0) that updates action values using the next action actually taken under the behavior policy.

Linear TD(0)

Vw​(s)=w⊤x(s),w←w+αδt​x(st​)

Explanation: With linear features x(s), the value estimate is a dot product with weights w. The gradient step moves w in proportion to the TD error times the feature vector.

Robbins–Monro Conditions

t=0∑∞​αt​=∞,t=0∑∞​αt2​<∞

Explanation: For convergence of stochastic approximation, learning rates should decrease but not too fast. These conditions are common in proofs of TD(0) convergence.

n-step Return

Gt(n)​=k=0∑n−1​γkrt+k​+γnV(st+n​)

Explanation: The n-step target generalizes TD(0) (n=1) and Monte Carlo (n=episode length). It trades bias and variance.

MSE and Semi-Gradient

J(w)=E[(Vπ(s)−Vw​(s))2],∇J(w)≈−2E[δt​x(st​)]

Explanation: Linear TD(0) can be interpreted as a semi-gradient method minimizing mean squared value error under a projection. The update uses δ times features.

Complexity Analysis

For tabular TD(0) policy evaluation, each time step processes a single transition (s, r, s'). The update computes one TD error and adjusts one entry V(s), which is O(1) time. Memory stores one scalar per state, requiring O(∣S∣) space. Over an episode of length L, total time is O(L); across E episodes, O(∑ L) = O(T) where T is the total number of steps observed. For on-policy control with SARSA(0), each step requires selecting an action. If ε−greedy scans all actions to find an argmax, this adds O(∣A∣) per step, while the TD update remains O(1). Memory is O(∣S∣∣A∣) for Q-values. Therefore, over T steps, time is O(T∣A∣) and space is O(∣S∣∣A∣). With linear function approximation Vw​(s) = wT x(s) where x(s) ∈ Rd, computing V and updating w both cost O(d) per step. Space is O(d) for parameters plus any feature computation overhead. If features are sparse with k nonzeros, the practical cost becomes O(k). TD(0) converges (tabular, on-policy) under suitable α schedules, but convergence speed depends on the MDP’s mixing time and the discount γ. High γ (close to 1) slows effective contraction, increasing sample complexity. In function approximation, TD(0) converges to the projected fixed point under on-policy sampling and compatible assumptions; off-policy with bootstrapping may diverge (deadly triad), which is a stability—not complexity—concern. Overall, TD methods are computationally lightweight per step, enabling real-time learning.

Code Examples

Tabular TD(0) Policy Evaluation on a 1D Random Walk
1#include <bits/stdc++.h>
2using namespace std;
3
4// 1D Random Walk: states 0..6, where 0 and 6 are terminal.
5// Start at state 3 each episode. Policy: move left/right with 0.5 each.
6// Reward: +1 only when reaching state 6 (right terminal), else 0.
7// Goal: Estimate V(s) for non-terminal states using TD(0).
8
9struct RandomWalk {
10 static constexpr int N = 7; // states 0..6
11 int start = 3;
12 int s;
13 mt19937 rng{12345};
14 uniform_real_distribution<double> uni{0.0, 1.0};
15
16 void reset() { s = start; }
17
18 // Step under the fixed policy: left/right with 0.5
19 // Returns (next_state, reward, done)
20 tuple<int,double,bool> step() {
21 bool go_right = (uni(rng) < 0.5);
22 int s_next = s + (go_right ? 1 : -1);
23 double r = 0.0;
24 bool done = false;
25 if (s_next <= 0) { s_next = 0; done = true; r = 0.0; }
26 if (s_next >= N-1) { s_next = N-1; done = true; r = 1.0; }
27 s = s_next;
28 return {s_next, r, done};
29 }
30};
31
32int main() {
33 ios::sync_with_stdio(false);
34 cin.tie(nullptr);
35
36 RandomWalk env;
37 vector<double> V(RandomWalk::N, 0.5); // init: optimistic for non-terminals
38 V[0] = 0.0; V[RandomWalk::N-1] = 0.0; // terminal values fixed to 0
39
40 double alpha = 0.1; // learning rate
41 double gamma = 1.0; // undiscounted episodic task
42 int episodes = 5000; // number of training episodes
43
44 mt19937 rng(42);
45
46 for (int ep = 0; ep < episodes; ++ep) {
47 env.reset();
48 bool done = false;
49 while (!done) {
50 int s = env.s; // current state
51 auto [s_next, r, terminal] = env.step();
52
53 double target = r + (terminal ? 0.0 : gamma * V[s_next]);
54 double delta = target - V[s]; // TD error
55 V[s] += alpha * delta; // TD(0) update
56
57 done = terminal;
58 }
59 }
60
61 cout << fixed << setprecision(3);
62 cout << "Estimated V(s):\n";
63 for (int s = 0; s < RandomWalk::N; ++s) {
64 cout << "s=" << s << " -> V=" << V[s] << "\n";
65 }
66 return 0;
67}
68

This program evaluates a fixed symmetric policy on a 1D random walk using TD(0). After each transition, it computes the TD target r + γV(s') (with V=0 at terminals) and updates V(s) by a fraction α of the TD error. Because updates happen every step, learning is fast and online.

Time: O(E × L) overall for E episodes of average length L; O(1) per step.Space: O(|S|) for the value table.
On-Policy Control with SARSA(0) on a 4x4 Gridworld
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple 4x4 Gridworld, start at (0,0), goal at (3,3).
5// Deterministic moves: up=0, right=1, down=2, left=3.
6// Reward: -1 per step until reaching goal (reward 0 at goal). Episodic.
7// Learn Q(s,a) using SARSA(0) with epsilon-greedy policy.
8
9struct Gridworld {
10 int H = 4, W = 4;
11 pair<int,int> start{0,0};
12 pair<int,int> goal{3,3};
13 pair<int,int> s;
14
15 void reset() { s = start; }
16
17 bool is_terminal(pair<int,int> p) const { return p == goal; }
18
19 // Apply action; return (next_state_index, reward, done)
20 tuple<int,double,bool> step(int a) {
21 int r = s.first, c = s.second;
22 if (a == 0) r = max(0, r-1);
23 if (a == 1) c = min(W-1, c+1);
24 if (a == 2) r = min(H-1, r+1);
25 if (a == 3) c = max(0, c-1);
26 s = {r,c};
27 bool done = is_terminal(s);
28 double rew = done ? 0.0 : -1.0; // encourage shortest path
29 int idx = r*W + c;
30 return {idx, rew, done};
31 }
32
33 int state_index() const { return s.first*W + s.second; }
34 int n_states() const { return H*W; }
35};
36
37int main(){
38 ios::sync_with_stdio(false);
39 cin.tie(nullptr);
40
41 Gridworld env;
42 int S = env.n_states();
43 int A = 4;
44
45 vector<vector<double>> Q(S, vector<double>(A, 0.0));
46
47 double alpha = 0.1;
48 double gamma = 0.99;
49 double eps = 0.1; // epsilon-greedy exploration
50 int episodes = 5000;
51 mt19937 rng(123);
52 uniform_real_distribution<double> uni(0.0,1.0);
53
54 auto greedy_action = [&](int s)->int{
55 // break ties randomly
56 vector<int> best; double bestv = -1e18;
57 for(int a=0;a<A;++a){
58 if (Q[s][a] > bestv + 1e-12) { bestv = Q[s][a]; best = {a}; }
59 else if (abs(Q[s][a]-bestv) <= 1e-12) best.push_back(a);
60 }
61 uniform_int_distribution<int> pick(0, (int)best.size()-1);
62 return best[pick(rng)];
63 };
64
65 auto eps_greedy = [&](int s)->int{
66 if (uni(rng) < eps) { uniform_int_distribution<int> pick(0, A-1); return pick(rng); }
67 return greedy_action(s);
68 };
69
70 for (int ep=0; ep<episodes; ++ep){
71 env.reset();
72 int s = env.state_index();
73 int a = eps_greedy(s);
74 bool done = false;
75 while(!done){
76 auto [s_next, r, terminal] = env.step(a);
77 int a_next = terminal ? 0 : eps_greedy(s_next);
78
79 double td_target = r + (terminal ? 0.0 : gamma * Q[s_next][a_next]);
80 double td_error = td_target - Q[s][a];
81 Q[s][a] += alpha * td_error; // SARSA(0) update
82
83 s = s_next; a = a_next; done = terminal;
84 }
85 // Optionally decay epsilon over time
86 // eps = max(0.01, eps * 0.999);
87 }
88
89 // Derive greedy policy and print its value per state (best action index)
90 cout << "Greedy action (0=U,1=R,2=D,3=L) per cell:\n";
91 for (int r=0;r<env.H;++r){
92 for (int c=0;c<env.W;++c){
93 int s = r*env.W + c;
94 int a = 0; double best = -1e18;
95 for(int i=0;i<A;++i) if(Q[s][i] > best){best=Q[s][i]; a=i;}
96 cout << a << ' ';
97 }
98 cout << '\n';
99 }
100 return 0;
101}
102

This example learns an action-value function with SARSA(0), an on-policy TD(0) control method. The ε-greedy policy selects actions, and updates use the next action actually taken. With a per-step penalty and zero terminal reward, the learned policy approximates the shortest path to the goal.

Time: O(T × |A|) over T steps (|A| for ε-greedy selection).Space: O(|S||A|) for the Q-table.
Linear TD(0) Policy Evaluation with Polynomial Features
1#include <bits/stdc++.h>
2using namespace std;
3
4// Evaluate a fixed policy on the 1D random walk using linear function approximation.
5// Feature vector x(s) = [1, s_norm, s_norm^2] for non-terminals; x(terminal) = 0.
6
7struct RandomWalkFA {
8 static constexpr int N = 7; // 0..6 terminals at 0 and 6
9 int start = 3;
10 int s;
11 mt19937 rng{7};
12 uniform_real_distribution<double> uni{0.0, 1.0};
13
14 void reset(){ s = start; }
15 tuple<int,double,bool> step(){
16 bool right = (uni(rng) < 0.5);
17 int s_next = s + (right ? 1 : -1);
18 double r = 0.0; bool done = false;
19 if (s_next <= 0){ s_next = 0; done = true; r = 0.0; }
20 if (s_next >= N-1){ s_next = N-1; done = true; r = 1.0; }
21 s = s_next; return {s_next, r, done};
22 }
23};
24
25static inline vector<double> features(int s){
26 // Map state index to normalized [0,1]; terminals get zero features
27 if (s == 0 || s == RandomWalkFA::N-1) return {0.0,0.0,0.0};
28 double x = (double)s / (RandomWalkFA::N - 1); // in (0,1)
29 return {1.0, x, x*x};
30}
31
32static inline double dot(const vector<double>& a, const vector<double>& b){
33 double v=0; for(size_t i=0;i<a.size();++i) v += a[i]*b[i]; return v;
34}
35
36int main(){
37 ios::sync_with_stdio(false);
38 cin.tie(nullptr);
39
40 RandomWalkFA env;
41
42 // weights for features [1, x, x^2]
43 vector<double> w = {0.0, 0.0, 0.0};
44
45 double alpha = 0.05; // smaller step due to function approximation
46 double gamma = 1.0; // episodic
47 int episodes = 10000;
48
49 for(int ep=0; ep<episodes; ++ep){
50 env.reset();
51 bool done=false;
52 while(!done){
53 int s = env.s;
54 auto x_s = features(s);
55 double v_s = dot(w, x_s);
56
57 auto [s_next, r, terminal] = env.step();
58 auto x_sn = features(s_next);
59 double v_sn = dot(w, x_sn); // zero if terminal due to features
60
61 double delta = r + (terminal ? 0.0 : gamma * v_sn) - v_s;
62 // Semi-gradient TD(0): w <- w + alpha * delta * x(s)
63 for(size_t i=0;i<w.size();++i) w[i] += alpha * delta * x_s[i];
64
65 done = terminal;
66 }
67 }
68
69 cout.setf(std::ios::fixed); cout << setprecision(4);
70 for(int s=0; s<RandomWalkFA::N; ++s){
71 double v = dot(w, features(s));
72 cout << "s=" << s << " V_hat=" << v << '\n';
73 }
74 return 0;
75}
76

This program replaces the tabular value function with a linear approximator using polynomial features of the state index. TD(0) updates the weight vector w via the semi-gradient rule w ← w + α δ x(s). Terminal features are zero so that V(terminal)=0 by construction.

Time: O(T × d) over T steps, where d is the number of features (here d=3).Space: O(d) for weights plus transient O(d) for features.
#temporal difference learning#td(0)#sarsa#reinforcement learning#policy evaluation#q-learning contrast#bootstrapping#bellman equation#function approximation#epsilon-greedy#stochastic approximation#value function#td error#projected bellman error#random walk