🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Automatic Differentiation

Key Points

  • •
    Automatic differentiation (AD) computes exact derivatives by systematically applying the chain rule to your program, not by symbolic algebra or numerical differences.
  • •
    Forward mode uses dual numbers to propagate derivatives alongside values and is efficient for functions with few inputs and many outputs (computes Jacobian–vector products).
  • •
    Reverse mode builds a computation graph and backpropagates adjoints (sensitivities), making it ideal for scalar outputs with many inputs (computes vector–Jacobian products).
  • •
    Dual numbers treat a value as x + ϵ x˙ with ϵ2=0, so derivatives ride along through arithmetic automatically.
  • •
    Reverse mode stores a tape of operations; a single backward pass gives gradients of a scalar output with respect to all inputs.
  • •
    For f: Rn → Rm, forward mode costs about m evaluations per input direction (O(n) passes), while reverse mode costs about one evaluation per output (O(m) passes).
  • •
    AD is robust and accurate to machine precision, unlike finite differences, but requires careful handling of control flow and side effects.
  • •
    In C++, operator overloading enables concise AD libraries in both forward and reverse modes with good performance.

Prerequisites

  • →Calculus: derivatives and chain rule — AD systematically applies the chain rule to compound functions and relies on knowing basic derivatives of elementary functions.
  • →Linear algebra: vectors, matrices, Jacobian — Understanding JVPs (Jv) and VJPs (w^T J) requires familiarity with matrix–vector products and Jacobians.
  • →C++ operator overloading and templates — AD in C++ is typically implemented by overloading arithmetic and math functions for custom number types.
  • →Graphs and topological order — Reverse mode uses a computation graph and a reverse topological traversal to propagate adjoints correctly.
  • →Floating-point arithmetic — AD computes derivatives to machine precision; understanding floating-point limits helps diagnose numerical issues.

Detailed Explanation

Tap terms for definitions

01Overview

Automatic Differentiation (AD) is a set of techniques to compute exact derivatives of functions implemented as programs. Unlike symbolic differentiation, it does not manipulate algebraic expressions; and unlike finite differences, it does not approximate derivatives with fragile subtractions. Instead, AD decomposes a program into a sequence of elementary operations with known derivatives and systematically applies the chain rule. The resulting derivatives are accurate to machine precision and have predictable computational cost.

There are two main modes of AD. Forward mode propagates derivatives from inputs to outputs during a single pass. It is best understood through dual numbers, which pack a value and its derivative into one object that travels through computations. Reverse mode (also called adjoint mode or backpropagation) runs the program forward to record a computation graph and then runs a backward pass that distributes sensitivities from the outputs to the inputs. Reverse mode is the engine behind most modern machine learning training procedures because it can compute gradients of scalar losses with respect to millions of parameters in time comparable to one program run.

In practice, C++ supports AD elegantly via operator overloading: you can define number-like types (dual numbers or nodes in a tape) that carry derivative information. Each arithmetic operation and math function updates both value and derivative consistently. The choice between forward and reverse mode depends on the shape of your function (number of inputs n versus outputs m) and which product with the Jacobian you need (JVP or VJP).

02Intuition & Analogies

Imagine a factory assembly line that turns raw materials (inputs) into a finished product (output). Each station performs a simple task: cutting, drilling, painting, etc. If you want to know how much a tiny change in a specific raw material affects the final product, you could:

  • Forward mode (dual numbers): Attach a “sensitivity tag” to a chosen input when it enters the line. At every station, the tag is updated according to how that station’s operation influences the result. By the time the product leaves the line, the tag tells you the effect of that input on the output. If you want sensitivities for multiple inputs, you repeat the process with different tags.

  • Reverse mode (adjoints): After producing the item, place a price tag on the final product and walk backwards through the assembly line. At each station, you split that price tag among the inputs according to how each input contributes to the station’s output (local partial derivatives). When you reach the start, every raw material has a tag showing how it influences the final product’s value. Crucially, one backward walk gives you the sensitivities for all inputs at once—as long as there’s only one final product value you care about.

Dual numbers make forward mode mechanical: carry two numbers (value and slope). Operations like multiply and sin update both: for z = xy, value z is xy, and slope z' is x'y + xy'. Reverse mode needs memory (a log, or tape) so that on the way back, each station knows how to distribute the price tag. This is why reverse mode uses more memory but shines when there are many inputs and a single scalar output.

03Formal Definition

Let f be a function implemented as a composition of elementary operations with known derivatives. Its Jacobian at x is J(x) ∈ Rm×n. AD computes linear maps J(x) v (Jacobian–vector product, JVP) or w⊤ J(x) (vector–Jacobian product, VJP) efficiently without forming J explicitly. Forward mode augments each scalar xi​ with a dual number x^i​ = xi​ + ϵ x˙i​, where ϵ is nilpotent (ϵ2=0). For each elementary operation g, there is a lift g^​ such that g^​(x + ϵ x˙) = g(x) + ϵ g'(x)x˙. Seeding x˙ with a direction v computes the JVP: f​(x + ϵ v) = f(x) + ϵ (J(x) v). Reverse mode executes the program to compute all intermediate values and then accumulates adjoints (sensitivities) uˉ = ∂u∂L​ for each intermediate u, where L is a scalar output of interest. For each operation z=g(u1​,…,uk​), local partials ∂ui​∂z​ are known, and adjoints satisfy uˉi​ += zˉ ∂ui​∂z​. Starting with Lˉ=1 and others zero, a reverse topological traversal yields all input gradients in one pass, computing the VJP w⊤ J(x) with w typically e1​ when L is a single scalar output.

04When to Use

  • Forward mode (dual numbers):

    • Best for functions f: \mathbb{R}^{n} \to \mathbb{R}^{m} with small n and larger m. Each forward sweep gives derivatives with respect to one seed direction. Ideal for computing directional derivatives, tangents in ODE solvers, and sensitivity of a few parameters.
    • Useful for Jacobian–vector products J(x)v without building the full Jacobian, which appears in Krylov methods and Newton–Krylov solvers.
  • Reverse mode (adjoint/backprop):

    • Best for scalar-output functions f: \mathbb{R}^{n} \to \mathbb{R}, where n is large (e.g., neural network training, least-squares objectives). One backward pass gets the full gradient.
    • Useful for vector–Jacobian products w^{\top} J(x). When you only need a weighted sum of outputs’ sensitivities, combine outputs as L = w^{\top} f(x) and backprop once.
  • Mixed strategies:

    • Full Jacobian of f: \mathbb{R}^{n} \to \mathbb{R}^{m} can be computed by n forward sweeps or m reverse sweeps; choose the smaller.
    • Higher-order derivatives: nest modes (e.g., forward-over-reverse) to get Hessian–vector products efficiently.
  • Practical C++ considerations:

    • Choose forward mode for small parameter vectors, geometric libraries, and compile-time expression templates.
    • Choose reverse mode for ML models, optimization with many parameters, and any scalar loss over large input dimension.

⚠️Common Mistakes

  • Confusing AD with finite differences: Finite differences use f(x+h)-f(x) and suffer from truncation and cancellation errors; AD applies exact chain rule on the executed program. Use AD for high accuracy and stability.
  • Using forward mode for huge input dimension: Forward mode computes one column of the Jacobian per sweep. For f: \mathbb{R}^{n} \to \mathbb{R}, this means O(n) sweeps. Prefer reverse mode when n is large and the output is scalar.
  • Forgetting to seed derivatives in forward mode: If you don’t set \dot{x}_{i}, you’ll get zero derivatives. Seed with basis vectors for gradients or with a custom direction for a JVP.
  • Ignoring control flow and side effects: Reverse mode needs a faithful tape. Mutating variables used in the forward pass, freeing memory early, or mixing threads without care can corrupt the graph and produce wrong gradients.
  • Not resetting adjoints before a new backward pass: In reverse mode, stale gradients accumulate if you reuse nodes. Always zero adjoints before backprop or rebuild the graph.
  • Differentiating through non-differentiable or discontinuous operations: Functions like max at a kink or integer/bitwise ops have undefined or zero derivatives. Use smooth approximations or subgradient-aware logic.
  • Overloading without constexpr/math coverage: In C++, forgetting to overload sin, cos, exp, log, etc., for your AD type leads to implicit conversions to double and lost derivatives.
  • Memory blow-up in reverse mode: Storing every intermediate creates large tapes. Use checkpointing or recomputation to trade memory for time.

Key Formulas

Dual Number

x=x+ϵx˙,ϵ2=0

Explanation: A dual number augments a real value with an infinitesimal part that carries the derivative. Arithmetic with dual numbers automatically applies the chain rule because ϵ2=0 eliminates higher-order terms.

Lifted Elementary Operation

g​(x+ϵx˙)=g(x)+ϵg′(x)x˙

Explanation: Every differentiable elementary function lifts to dual numbers by updating the derivative via the usual single-variable chain rule. This is the core rule enabling forward-mode AD.

Jacobian Matrix

J(x)=​∂x1​∂f1​​⋮∂x1​∂fm​​​⋯⋱⋯​∂xn​∂f1​​⋮∂xn​∂fm​​​​

Explanation: The Jacobian collects all first derivatives of a vector-valued function f: Rn→Rm. Its shape determines which AD mode is more efficient.

Products with the Jacobian

JVP: J(x)v,VJP: w⊤J(x)

Explanation: Forward mode naturally computes Jv by seeding tangents with v; reverse mode naturally computes wT J by backpropagating from a linear combination of outputs with weights w.

Reverse Accumulation Rule

uˉi​+=zˉ∂ui​∂z​

Explanation: In reverse mode, the adjoint of each input accumulates contributions from adjoints of its children scaled by local partial derivatives. This is the discrete chain rule on the computation graph.

Product Rule

dxd​(u⋅v)=u′v+uv′

Explanation: Dual numbers and reverse mode both rely on familiar calculus rules. The product rule is applied at each multiplication node.

Elementary Derivatives

dxd​sin(x)=cos(x),dxd​exp(x)=exp(x),dxd​log(x)=x1​

Explanation: These standard derivatives define the local partials used by AD for trigonometric and exponential/logarithmic operations.

Mode Selection Heuristic

Costforward​≈nCf​,Costreverse​≈mCf​

Explanation: For f: Rn→Rm with base evaluation cost Cf​, computing the full Jacobian requires n forward sweeps or m reverse sweeps. Choose the smaller of n and m.

Directional Derivative

∇f(x)⊤v=dtd​​t=0​f(x+tv)

Explanation: The directional derivative equals a JVP with direction v. Forward mode computes this in one sweep by seeding tangents with v.

Adjoint Recurrence

xˉi​=k∈children(i)∑​xˉk​∂xi​∂xk​​

Explanation: This expresses each node’s adjoint as the sum over its outgoing edges of the child adjoints times local partials. It is the basis of reverse-mode backpropagation.

Complexity Analysis

Let Cf​ denote the cost (time) of running the original program once and Mf​ its peak memory footprint. For f: Rn → Rm, forward mode computes a Jacobian–vector product (JVP) in time about Cf​ and negligible extra asymptotic memory (typically a small constant factor). To obtain the full Jacobian explicitly, you need n forward sweeps—one per input basis vector—so time is about n Cf​ and memory remains O(Mf​). This is efficient when n is small relative to m or when only a few JVPs are required. Reverse mode computes a vector–Jacobian product (VJP) per backward sweep in time about Cf​, after a forward pass that records the computation graph. Thus a single gradient of a scalar output (m=1) costs on the order of one to a few times Cf​ (constant factors depend on operation mix). To obtain the full Jacobian, you need m reverse sweeps—one per output basis vector—so time is about m Cf​. The main overhead is memory: the tape must store intermediate values needed to compute local partials, giving peak space O(Mf​ + E), where E is proportional to the number of elementary operations (graph edges). In practice, reverse-mode memory can be several multiples of the program’s native memory use. Higher-order derivatives can be computed by nesting modes. For example, a Hessian–vector product can be obtained in time about 2 Cf​ by running forward mode over a reverse-mode gradient (or vice versa); however, building the full Hessian explicitly is O(n) such products. Control flow (branches, loops) does not change asymptotic costs but affects constants due to tape management and cache behavior. Checkpointing can reduce reverse-mode peak memory at the price of extra recomputation, trading time for space.

Code Examples

Forward-mode AD with Dual Numbers: scalar derivative, multivariate gradient, and a JVP
1#include <iostream>
2#include <cmath>
3#include <vector>
4#include <iomanip>
5
6// Simple dual number type: value + epsilon * derivative, with epsilon^2 = 0
7struct Dual {
8 double val; // primal value
9 double d; // derivative (tangent) w.r.t. chosen seed direction
10
11 Dual(double v = 0.0, double der = 0.0) : val(v), d(der) {}
12};
13
14// Arithmetic operators for Dual
15inline Dual operator+(const Dual& a, const Dual& b) {
16 return Dual(a.val + b.val, a.d + b.d);
17}
18inline Dual operator-(const Dual& a, const Dual& b) {
19 return Dual(a.val - b.val, a.d - b.d);
20}
21inline Dual operator*(const Dual& a, const Dual& b) {
22 // Product rule: (ab)' = a'b + ab'
23 return Dual(a.val * b.val, a.d * b.val + a.val * b.d);
24}
25inline Dual operator/(const Dual& a, const Dual& b) {
26 // Quotient rule: (a/b)' = (a'b - ab') / b^2
27 double denom = b.val * b.val;
28 return Dual(a.val / b.val, (a.d * b.val - a.val * b.d) / denom);
29}
30// Overloads with double on right/left
31inline Dual operator+(const Dual& a, double b) { return Dual(a.val + b, a.d); }
32inline Dual operator+(double a, const Dual& b) { return Dual(a + b.val, b.d); }
33inline Dual operator-(const Dual& a, double b) { return Dual(a.val - b, a.d); }
34inline Dual operator-(double a, const Dual& b) { return Dual(a - b.val, -b.d); }
35inline Dual operator*(const Dual& a, double b) { return Dual(a.val * b, a.d * b); }
36inline Dual operator*(double a, const Dual& b) { return Dual(a * b.val, a * b.d); }
37inline Dual operator/(const Dual& a, double b) { return Dual(a.val / b, a.d / b); }
38inline Dual operator/(double a, const Dual& b) {
39 // (a / b)' = -a * b' / b^2
40 double denom = b.val * b.val;
41 return Dual(a / b.val, -a * b.d / denom);
42}
43
44// Elementary functions for Dual
45inline Dual sin(const Dual& x) { return Dual(std::sin(x.val), std::cos(x.val) * x.d); }
46inline Dual cos(const Dual& x) { return Dual(std::cos(x.val), -std::sin(x.val) * x.d); }
47inline Dual exp(const Dual& x) { double e = std::exp(x.val); return Dual(e, e * x.d); }
48inline Dual log(const Dual& x) { return Dual(std::log(x.val), x.d / x.val); }
49inline Dual pow(const Dual& x, double p) { // x^p
50 double pv = std::pow(x.val, p);
51 return Dual(pv, p * std::pow(x.val, p - 1.0) * x.d);
52}
53
54int main() {
55 std::cout << std::fixed << std::setprecision(6);
56
57 // 1) Single-variable derivative: f(x) = x^3 + sin(x) at x=2
58 auto f1 = [](const Dual& x) { return pow(x, 3.0) + sin(x); };
59 Dual x(2.0, 1.0); // seed derivative dx=1 to get df/dx
60 Dual fx = f1(x);
61 std::cout << "f(x)=x^3+sin(x) at x=2 => value=" << fx.val
62 << ", derivative=" << fx.d << "\n";
63
64 // 2) Multivariate gradient via multiple seeds: f(x,y) = x*y + sin(x)
65 auto f2 = [](const Dual& X, const Dual& Y) { return X * Y + sin(X); };
66 double x0 = 2.0, y0 = 3.0;
67 // Partial w.r.t x: seed (dx, dy) = (1, 0)
68 Dual Xx(x0, 1.0), Yx(y0, 0.0);
69 Dual fxy_x = f2(Xx, Yx);
70 // Partial w.r.t y: seed (dx, dy) = (0, 1)
71 Dual Xy(x0, 0.0), Yy(y0, 1.0);
72 Dual fxy_y = f2(Xy, Yy);
73 std::cout << "f(x,y)=x*y+sin(x) at (2,3) => value=" << fxy_x.val
74 << ", df/dx=" << fxy_x.d << ", df/dy=" << fxy_y.d << "\n";
75
76 // 3) JVP (Jacobian–vector product) for a vector-valued function
77 // F(x,y) = [ x*y, sin(x) + y^2 ] at (x,y)=(2,3) in direction v=(1,2)
78 auto F = [](const Dual& X, const Dual& Y) {
79 std::vector<Dual> out;
80 out.reserve(2);
81 out.push_back(X * Y); // F1 = x*y
82 out.push_back(sin(X) + pow(Y, 2)); // F2 = sin(x) + y^2
83 return out;
84 };
85 Dual Xv(x0, 1.0), Yv(y0, 2.0); // seed with v=(1,2)
86 auto Fv = F(Xv, Yv);
87 std::cout << "F(x,y) at (2,3) => values=[" << Fv[0].val << ", " << Fv[1].val
88 << "], Jv=[" << Fv[0].d << ", " << Fv[1].d << "]\n";
89
90 return 0;
91}
92

This program defines a Dual type that carries both a value and a derivative (tangent). Overloaded arithmetic and math functions automatically propagate derivatives via the chain rule. We demonstrate: (1) a single-variable derivative by seeding dx=1; (2) a two-variable gradient by performing two seeds—one for each coordinate; and (3) a Jacobian–vector product (directional derivative) for a vector-valued function by seeding the tangent with the desired direction vector.

Time: O(K) per evaluation, where K is the number of elementary operations executed by the function(s). Each operation does O(1) extra derivative work.Space: O(1) extra over the original computation, since only value and one tangent are stored per variable.
Reverse-mode AD (adjoint/backprop) via a minimal tape and topological backpropagation
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <memory>
5#include <unordered_set>
6#include <iomanip>
7
8namespace ad {
9
10struct Var;
11using VarPtr = std::shared_ptr<Var>;
12
13struct Edge {
14 VarPtr parent; // pointer to input node
15 std::function<double()> local_grad; // returns \partial z / \partial parent
16};
17
18struct Var {
19 double val; // primal value
20 double grad = 0.0; // adjoint (dL/dVar)
21 std::vector<Edge> parents; // edges to inputs with local partials
22 explicit Var(double v) : val(v) {}
23};
24
25inline VarPtr constant(double v) { return std::make_shared<Var>(v); }
26inline VarPtr variable(double v) { return std::make_shared<Var>(v); }
27
28// Primitive ops: +, -, *, unary functions
29inline VarPtr add(const VarPtr& a, const VarPtr& b) {
30 auto z = std::make_shared<Var>(a->val + b->val);
31 z->parents.push_back({a, [=](){ return 1.0; }});
32 z->parents.push_back({b, [=](){ return 1.0; }});
33 return z;
34}
35inline VarPtr sub(const VarPtr& a, const VarPtr& b) {
36 auto z = std::make_shared<Var>(a->val - b->val);
37 z->parents.push_back({a, [=](){ return 1.0; }});
38 z->parents.push_back({b, [=](){ return -1.0; }});
39 return z;
40}
41inline VarPtr mul(const VarPtr& a, const VarPtr& b) {
42 auto z = std::make_shared<Var>(a->val * b->val);
43 z->parents.push_back({a, [=](){ return b->val; }});
44 z->parents.push_back({b, [=](){ return a->val; }});
45 return z;
46}
47inline VarPtr divv(const VarPtr& a, const VarPtr& b) {
48 auto z = std::make_shared<Var>(a->val / b->val);
49 z->parents.push_back({a, [=](){ return 1.0 / b->val; }});
50 z->parents.push_back({b, [=](){ return -a->val / (b->val * b->val); }});
51 return z;
52}
53inline VarPtr sinv(const VarPtr& x) {
54 auto z = std::make_shared<Var>(std::sin(x->val));
55 z->parents.push_back({x, [=](){ return std::cos(x->val); }});
56 return z;
57}
58inline VarPtr cosv(const VarPtr& x) {
59 auto z = std::make_shared<Var>(std::cos(x->val));
60 z->parents.push_back({x, [=](){ return -std::sin(x->val); }});
61 return z;
62}
63inline VarPtr expv(const VarPtr& x) {
64 auto e = std::exp(x->val);
65 auto z = std::make_shared<Var>(e);
66 z->parents.push_back({x, [=](){ return e; }});
67 return z;
68}
69inline VarPtr logv(const VarPtr& x) {
70 auto z = std::make_shared<Var>(std::log(x->val));
71 z->parents.push_back({x, [=](){ return 1.0 / x->val; }});
72 return z;
73}
74inline VarPtr powv(const VarPtr& x, double p) {
75 double pv = std::pow(x->val, p);
76 auto z = std::make_shared<Var>(pv);
77 z->parents.push_back({x, [=](){ return p * std::pow(x->val, p - 1.0); }});
78 return z;
79}
80
81// Operator overloads for convenience
82inline VarPtr operator+(const VarPtr& a, const VarPtr& b) { return add(a,b); }
83inline VarPtr operator-(const VarPtr& a, const VarPtr& b) { return sub(a,b); }
84inline VarPtr operator*(const VarPtr& a, const VarPtr& b) { return mul(a,b); }
85inline VarPtr operator/(const VarPtr& a, const VarPtr& b) { return divv(a,b); }
86inline VarPtr operator+(const VarPtr& a, double b) { return add(a, constant(b)); }
87inline VarPtr operator+(double a, const VarPtr& b) { return add(constant(a), b); }
88inline VarPtr operator-(const VarPtr& a, double b) { return sub(a, constant(b)); }
89inline VarPtr operator-(double a, const VarPtr& b) { return sub(constant(a), b); }
90inline VarPtr operator*(const VarPtr& a, double b) { return mul(a, constant(b)); }
91inline VarPtr operator*(double a, const VarPtr& b) { return mul(constant(a), b); }
92inline VarPtr operator/(const VarPtr& a, double b) { return divv(a, constant(b)); }
93inline VarPtr operator/(double a, const VarPtr& b) { return divv(constant(a), b); }
94
95// Build reverse topological order via DFS
96void dfs(const VarPtr& v, std::vector<VarPtr>& order, std::unordered_set<Var*>& seen) {
97 if (!v) return;
98 if (seen.count(v.get())) return;
99 seen.insert(v.get());
100 for (const auto& e : v->parents) dfs(e.parent, order, seen);
101 order.push_back(v); // push after parents: topological order
102}
103
104void backward(const VarPtr& out) {
105 // 1) Collect nodes in topological order
106 std::vector<VarPtr> order; order.reserve(1024);
107 std::unordered_set<Var*> seen; seen.reserve(1024);
108 dfs(out, order, seen);
109
110 // 2) Initialize adjoints to zero (already zero for fresh graph)
111 for (auto& v : order) v->grad = 0.0;
112
113 // 3) Seed output gradient and traverse in reverse
114 out->grad = 1.0;
115 for (auto it = order.rbegin(); it != order.rend(); ++it) {
116 VarPtr node = *it;
117 for (auto& e : node->parents) {
118 e.parent->grad += node->grad * e.local_grad();
119 }
120 }
121}
122
123} // namespace ad
124
125int main() {
126 using namespace ad;
127 std::cout << std::fixed << std::setprecision(6);
128
129 // Example: f(x,y,z) = (x*y + sin(z))^2 at (2,3,0.5)
130 VarPtr x = variable(2.0);
131 VarPtr y = variable(3.0);
132 VarPtr z = variable(0.5);
133
134 VarPtr f = powv(x * y + sinv(z), 2.0);
135 backward(f);
136
137 std::cout << "f(x,y,z) = (x*y + sin(z))^2\n";
138 std::cout << "value = " << f->val << "\n";
139 std::cout << "df/dx = " << x->grad << ", df/dy = " << y->grad << ", df/dz = " << z->grad << "\n";
140
141 // Check against analytical gradients:
142 // g = x*y + sin(z); f = g^2; df/dx = 2*g*y; df/dy = 2*g*x; df/dz = 2*g*cos(z)
143 double g = x->val * y->val + std::sin(z->val);
144 std::cout << "(analytic) df/dx = " << 2*g*y->val
145 << ", df/dy = " << 2*g*x->val
146 << ", df/dz = " << 2*g*std::cos(z->val) << "\n";
147
148 return 0;
149}
150

This minimal reverse-mode engine builds a computation graph where each node stores its value and a list of parents with local partial derivatives. A depth-first search produces a topological order. The backward pass seeds the output adjoint with 1 and propagates gradients to parents using the chain rule. The example computes the gradient of a scalar function of three variables, illustrating that one backward pass yields all partial derivatives simultaneously.

Time: O(N) for one forward evaluation to build values plus O(N) for the backward pass, where N is the number of elementary operations (graph edges).Space: O(N) to store the tape (nodes, parents, and necessary values) during backpropagation.
#automatic differentiation#dual numbers#forward mode#reverse mode#adjoint#backpropagation#jacobian vector product#vector jacobian product#c++ operator overloading#computation graph#gradient#tape#sensitivity analysis#directional derivative#hessian