📚TheoryIntermediate

Automatic Differentiation

Key Points

•
Automatic differentiation (AD) computes exact derivatives by systematically applying the chain rule to your program, not by symbolic algebra or numerical differences.
•
Forward mode uses dual numbers to propagate derivatives alongside values and is efficient for functions with few inputs and many outputs (computes Jacobian–vector products).
•
Reverse mode builds a computation graph and backpropagates adjoints (sensitivities), making it ideal for scalar outputs with many inputs (computes vector–Jacobian products).
•
Dual numbers treat a value as x + $ϵ$ $\overset{x}{˙}$ with $ϵ^{2}$ =0, so derivatives ride along through arithmetic automatically.
•
Reverse mode stores a tape of operations; a single backward pass gives gradients of a scalar output with respect to all inputs.
•
For f: $R^{n}$ $\to$ $R^{m}$ , forward mode costs about m evaluations per input direction (O(n) passes), while reverse mode costs about one evaluation per output (O(m) passes).
•
AD is robust and accurate to machine precision, unlike finite differences, but requires careful handling of control flow and side effects.
•
In C++, operator overloading enables concise AD libraries in both forward and reverse modes with good performance.

Prerequisites

→Calculus: derivatives and chain rule — AD systematically applies the chain rule to compound functions and relies on knowing basic derivatives of elementary functions.
→Linear algebra: vectors, matrices, Jacobian — Understanding JVPs (Jv) and VJPs (w^T J) requires familiarity with matrix–vector products and Jacobians.
→C++ operator overloading and templates — AD in C++ is typically implemented by overloading arithmetic and math functions for custom number types.
→Graphs and topological order — Reverse mode uses a computation graph and a reverse topological traversal to propagate adjoints correctly.
→Floating-point arithmetic — AD computes derivatives to machine precision; understanding floating-point limits helps diagnose numerical issues.

Detailed Explanation

Tap terms for definitions

01Overview

Automatic Differentiation (AD) is a set of techniques to compute exact derivatives of functions implemented as programs. Unlike symbolic differentiation, it does not manipulate algebraic expressions; and unlike finite differences, it does not approximate derivatives with fragile subtractions. Instead, AD decomposes a program into a sequence of elementary operations with known derivatives and systematically applies the chain rule. The resulting derivatives are accurate to machine precision and have predictable computational cost.

There are two main modes of AD. Forward mode propagates derivatives from inputs to outputs during a single pass. It is best understood through dual numbers, which pack a value and its derivative into one object that travels through computations. Reverse mode (also called adjoint mode or backpropagation) runs the program forward to record a computation graph and then runs a backward pass that distributes sensitivities from the outputs to the inputs. Reverse mode is the engine behind most modern machine learning training procedures because it can compute gradients of scalar losses with respect to millions of parameters in time comparable to one program run.

In practice, C++ supports AD elegantly via operator overloading: you can define number-like types (dual numbers or nodes in a tape) that carry derivative information. Each arithmetic operation and math function updates both value and derivative consistently. The choice between forward and reverse mode depends on the shape of your function (number of inputs n versus outputs m) and which product with the Jacobian you need (JVP or VJP).

02Intuition & Analogies

Imagine a factory assembly line that turns raw materials (inputs) into a finished product (output). Each station performs a simple task: cutting, drilling, painting, etc. If you want to know how much a tiny change in a specific raw material affects the final product, you could:

Forward mode (dual numbers): Attach a “sensitivity tag” to a chosen input when it enters the line. At every station, the tag is updated according to how that station’s operation influences the result. By the time the product leaves the line, the tag tells you the effect of that input on the output. If you want sensitivities for multiple inputs, you repeat the process with different tags.
Reverse mode (adjoints): After producing the item, place a price tag on the final product and walk backwards through the assembly line. At each station, you split that price tag among the inputs according to how each input contributes to the station’s output (local partial derivatives). When you reach the start, every raw material has a tag showing how it influences the final product’s value. Crucially, one backward walk gives you the sensitivities for all inputs at once—as long as there’s only one final product value you care about.

Dual numbers make forward mode mechanical: carry two numbers (value and slope). Operations like multiply and sin update both: for z = xy, value z is xy, and slope z' is x'y + xy'. Reverse mode needs memory (a log, or tape) so that on the way back, each station knows how to distribute the price tag. This is why reverse mode uses more memory but shines when there are many inputs and a single scalar output.

03Formal Definition

Let f be a function implemented as a composition of elementary operations with known derivatives. Its Jacobian at x is J(x)

\in

R^{m \times n}

. AD computes linear maps J(x) v (Jacobian–vector product, JVP) or

w^{⊤}

J(x) (vector–Jacobian product, VJP) efficiently without forming J explicitly. Forward mode augments each scalar

x_{i}

with a dual number

\overset{x}{^}_{i}

x_{i}

ϵ

\overset{x}{˙}_{i}

, where

ϵ

is nilpotent (

ϵ^{2}

=0). For each elementary operation g, there is a lift

\overset{g}{^}

such that

\overset{g}{^}

(x +

ϵ

\overset{x}{˙}

) = g(x) +

ϵ

g'(x)

\overset{x}{˙}

. Seeding

\overset{x}{˙}

with a direction v computes the JVP:

f

(x +

ϵ

v) = f(x) +

ϵ

(J(x) v). Reverse mode executes the program to compute all intermediate values and then accumulates adjoints (sensitivities)

\overset{u}{ˉ}

\frac{\partial L}{\partial u}

for each intermediate u, where L is a scalar output of interest. For each operation

z = g (u_{1}

\dots

u_{k}

), local partials

\frac{\partial z}{\partial u _{i}}

are known, and adjoints satisfy

\overset{u}{ˉ}_{i}

+

\overset{z}{ˉ}

\frac{\partial z}{\partial u _{i}}

. Starting with

\overset{ˉ}{L}

=1 and others zero, a reverse topological traversal yields all input gradients in one pass, computing the VJP

w^{⊤}

J(x) with w typically

e_{1}

when L is a single scalar output.

04When to Use

Forward mode (dual numbers):
- Best for functions f: \mathbb{R}^{n} \to \mathbb{R}^{m} with small n and larger m. Each forward sweep gives derivatives with respect to one seed direction. Ideal for computing directional derivatives, tangents in ODE solvers, and sensitivity of a few parameters.
- Useful for Jacobian–vector products J(x)v without building the full Jacobian, which appears in Krylov methods and Newton–Krylov solvers.
Reverse mode (adjoint/backprop):
- Best for scalar-output functions f: \mathbb{R}^{n} \to \mathbb{R}, where n is large (e.g., neural network training, least-squares objectives). One backward pass gets the full gradient.
- Useful for vector–Jacobian products w^{\top} J(x). When you only need a weighted sum of outputs’ sensitivities, combine outputs as L = w^{\top} f(x) and backprop once.
Mixed strategies:
- Full Jacobian of f: \mathbb{R}^{n} \to \mathbb{R}^{m} can be computed by n forward sweeps or m reverse sweeps; choose the smaller.
- Higher-order derivatives: nest modes (e.g., forward-over-reverse) to get Hessian–vector products efficiently.
Practical C++ considerations:
- Choose forward mode for small parameter vectors, geometric libraries, and compile-time expression templates.
- Choose reverse mode for ML models, optimization with many parameters, and any scalar loss over large input dimension.

⚠️Common Mistakes

Confusing AD with finite differences: Finite differences use f(x+h)-f(x) and suffer from truncation and cancellation errors; AD applies exact chain rule on the executed program. Use AD for high accuracy and stability.
Using forward mode for huge input dimension: Forward mode computes one column of the Jacobian per sweep. For f: \mathbb{R}^{n} \to \mathbb{R}, this means O(n) sweeps. Prefer reverse mode when n is large and the output is scalar.
Forgetting to seed derivatives in forward mode: If you don’t set \dot{x}_{i}, you’ll get zero derivatives. Seed with basis vectors for gradients or with a custom direction for a JVP.
Ignoring control flow and side effects: Reverse mode needs a faithful tape. Mutating variables used in the forward pass, freeing memory early, or mixing threads without care can corrupt the graph and produce wrong gradients.
Not resetting adjoints before a new backward pass: In reverse mode, stale gradients accumulate if you reuse nodes. Always zero adjoints before backprop or rebuild the graph.
Differentiating through non-differentiable or discontinuous operations: Functions like max at a kink or integer/bitwise ops have undefined or zero derivatives. Use smooth approximations or subgradient-aware logic.
Overloading without constexpr/math coverage: In C++, forgetting to overload sin, cos, exp, log, etc., for your AD type leads to implicit conversions to double and lost derivatives.
Memory blow-up in reverse mode: Storing every intermediate creates large tapes. Use checkpointing or recomputation to trade memory for time.

Key Formulas

Dual Number

x = x + ϵ \overset{x}{˙}, ϵ^{2} = 0

Explanation: A dual number augments a real value with an infinitesimal part that carries the derivative. Arithmetic with dual numbers automatically applies the chain rule because $ϵ^{2}$ =0 eliminates higher-order terms.

Lifted Elementary Operation

g (x + ϵ \overset{x}{˙}) = g (x) + ϵ g^{'} (x) \overset{x}{˙}

Explanation: Every differentiable elementary function lifts to dual numbers by updating the derivative via the usual single-variable chain rule. This is the core rule enabling forward-mode AD.

Jacobian Matrix

J (x) = \frac{\partial f _{1}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}}

Explanation: The Jacobian collects all first derivatives of a vector-valued function f: $R^{n} \to R^{m}$ . Its shape determines which AD mode is more efficient.

Products with the Jacobian

JVP: J (x) v, VJP: w^{⊤} J (x)

Explanation: Forward mode naturally computes Jv by seeding tangents with v; reverse mode naturally computes $w^{T}$ J by backpropagating from a linear combination of outputs with weights w.

Reverse Accumulation Rule

\overset{u}{ˉ}_{i} + = \overset{z}{ˉ} \frac{\partial z}{\partial u _{i}}

Explanation: In reverse mode, the adjoint of each input accumulates contributions from adjoints of its children scaled by local partial derivatives. This is the discrete chain rule on the computation graph.

Product Rule

\frac{d}{d x} (u \cdot v) = u^{'} v + u v^{'}

Explanation: Dual numbers and reverse mode both rely on familiar calculus rules. The product rule is applied at each multiplication node.

Elementary Derivatives

\frac{d}{d x} sin (x) = cos (x), \frac{d}{d x} exp (x) = exp (x), \frac{d}{d x} lo g (x) = \frac{1}{x}

Explanation: These standard derivatives define the local partials used by AD for trigonometric and exponential/logarithmic operations.

Mode Selection Heuristic

Cost_{forward} \approx n C_{f}, Cost_{reverse} \approx m C_{f}

Explanation: For f: $R^{n} \to R^{m}$ with base evaluation cost $C_{f}$ , computing the full Jacobian requires n forward sweeps or m reverse sweeps. Choose the smaller of n and m.

Directional Derivative

\nabla f (x)^{⊤} v = \frac{d}{d t}_{t = 0} f (x + t v)

Explanation: The directional derivative equals a JVP with direction v. Forward mode computes this in one sweep by seeding tangents with v.

Adjoint Recurrence

\overset{x}{ˉ}_{i} = k \in children (i) \sum \overset{x}{ˉ}_{k} \frac{\partial x _{k}}{\partial x _{i}}

Explanation: This expresses each node’s adjoint as the sum over its outgoing edges of the child adjoints times local partials. It is the basis of reverse-mode backpropagation.

Complexity Analysis

Let

C_{f}

denote the cost (time) of running the original program once and

M_{f}

its peak memory footprint. For f:

R^{n}

\to

R^{m}

, forward mode computes a Jacobian–vector product (JVP) in time about

C_{f}

and negligible extra asymptotic memory (typically a small constant factor). To obtain the full Jacobian explicitly, you need n forward sweeps—one per input basis vector—so time is about n

C_{f}

and memory remains O(

M_{f}

). This is efficient when n is small relative to m or when only a few JVPs are required. Reverse mode computes a vector–Jacobian product (VJP) per backward sweep in time about

C_{f}

, after a forward pass that records the computation graph. Thus a single gradient of a scalar output (

m = 1

) costs on the order of one to a few times

C_{f}

(constant factors depend on operation mix). To obtain the full Jacobian, you need m reverse sweeps—one per output basis vector—so time is about m

C_{f}

. The main overhead is memory: the tape must store intermediate values needed to compute local partials, giving peak space O(

M_{f}

+ E), where E is proportional to the number of elementary operations (graph edges). In practice, reverse-mode memory can be several multiples of the program’s native memory use. Higher-order derivatives can be computed by nesting modes. For example, a Hessian–vector product can be obtained in time about 2

C_{f}

by running forward mode over a reverse-mode gradient (or vice versa); however, building the full Hessian explicitly is O(n) such products. Control flow (branches, loops) does not change asymptotic costs but affects constants due to tape management and cache behavior. Checkpointing can reduce reverse-mode peak memory at the price of extra recomputation, trading time for space.

Code Examples

Forward-mode AD with Dual Numbers: scalar derivative, multivariate gradient, and a JVP

1 #include <iostream>
2 #include <cmath>
3 #include <vector>
4 #include <iomanip>
5 
6 // Simple dual number type: value + epsilon * derivative, with epsilon^2 = 0
7 struct Dual {
8     double val;   // primal value
9     double d;     // derivative (tangent) w.r.t. chosen seed direction
10 
11     Dual(double v = 0.0, double der = 0.0) : val(v), d(der) {}
12 };
13 
14 // Arithmetic operators for Dual
15 inline Dual operator+(const Dual& a, const Dual& b) {
16     return Dual(a.val + b.val, a.d + b.d);
17 }
18 inline Dual operator-(const Dual& a, const Dual& b) {
19     return Dual(a.val - b.val, a.d - b.d);
20 }
21 inline Dual operator*(const Dual& a, const Dual& b) {
22     // Product rule: (ab)' = a'b + ab'
23     return Dual(a.val * b.val, a.d * b.val + a.val * b.d);
24 }
25 inline Dual operator/(const Dual& a, const Dual& b) {
26     // Quotient rule: (a/b)' = (a'b - ab') / b^2
27     double denom = b.val * b.val;
28     return Dual(a.val / b.val, (a.d * b.val - a.val * b.d) / denom);
29 }
30 // Overloads with double on right/left
31 inline Dual operator+(const Dual& a, double b) { return Dual(a.val + b, a.d); }
32 inline Dual operator+(double a, const Dual& b) { return Dual(a + b.val, b.d); }
33 inline Dual operator-(const Dual& a, double b) { return Dual(a.val - b, a.d); }
34 inline Dual operator-(double a, const Dual& b) { return Dual(a - b.val, -b.d); }
35 inline Dual operator*(const Dual& a, double b) { return Dual(a.val * b, a.d * b); }
36 inline Dual operator*(double a, const Dual& b) { return Dual(a * b.val, a * b.d); }
37 inline Dual operator/(const Dual& a, double b) { return Dual(a.val / b, a.d / b); }
38 inline Dual operator/(double a, const Dual& b) {
39     // (a / b)' = -a * b' / b^2
40     double denom = b.val * b.val;
41     return Dual(a / b.val, -a * b.d / denom);
42 }
43 
44 // Elementary functions for Dual
45 inline Dual sin(const Dual& x) { return Dual(std::sin(x.val), std::cos(x.val) * x.d); }
46 inline Dual cos(const Dual& x) { return Dual(std::cos(x.val), -std::sin(x.val) * x.d); }
47 inline Dual exp(const Dual& x) { double e = std::exp(x.val); return Dual(e, e * x.d); }
48 inline Dual log(const Dual& x) { return Dual(std::log(x.val), x.d / x.val); }
49 inline Dual pow(const Dual& x, double p) { // x^p
50     double pv = std::pow(x.val, p);
51     return Dual(pv, p * std::pow(x.val, p - 1.0) * x.d);
52 }
53 
54 int main() {
55     std::cout << std::fixed << std::setprecision(6);
56 
57     // 1) Single-variable derivative: f(x) = x^3 + sin(x) at x=2
58     auto f1 = [](const Dual& x) { return pow(x, 3.0) + sin(x); };
59     Dual x(2.0, 1.0); // seed derivative dx=1 to get df/dx
60     Dual fx = f1(x);
61     std::cout << "f(x)=x^3+sin(x) at x=2 => value=" << fx.val
62               << ", derivative=" << fx.d << "\n";
63 
64     // 2) Multivariate gradient via multiple seeds: f(x,y) = x*y + sin(x)
65     auto f2 = [](const Dual& X, const Dual& Y) { return X * Y + sin(X); };
66     double x0 = 2.0, y0 = 3.0;
67     // Partial w.r.t x: seed (dx, dy) = (1, 0)
68     Dual Xx(x0, 1.0), Yx(y0, 0.0);
69     Dual fxy_x = f2(Xx, Yx);
70     // Partial w.r.t y: seed (dx, dy) = (0, 1)
71     Dual Xy(x0, 0.0), Yy(y0, 1.0);
72     Dual fxy_y = f2(Xy, Yy);
73     std::cout << "f(x,y)=x*y+sin(x) at (2,3) => value=" << fxy_x.val
74               << ", df/dx=" << fxy_x.d << ", df/dy=" << fxy_y.d << "\n";
75 
76     // 3) JVP (Jacobian–vector product) for a vector-valued function
77     // F(x,y) = [ x*y,  sin(x) + y^2 ] at (x,y)=(2,3) in direction v=(1,2)
78     auto F = [](const Dual& X, const Dual& Y) {
79         std::vector<Dual> out;
80         out.reserve(2);
81         out.push_back(X * Y);              // F1 = x*y
82         out.push_back(sin(X) + pow(Y, 2)); // F2 = sin(x) + y^2
83         return out;
84     };
85     Dual Xv(x0, 1.0), Yv(y0, 2.0); // seed with v=(1,2)
86     auto Fv = F(Xv, Yv);
87     std::cout << "F(x,y) at (2,3) => values=[" << Fv[0].val << ", " << Fv[1].val
88               << "], Jv=[" << Fv[0].d << ", " << Fv[1].d << "]\n";
89 
90     return 0;
91 }
92

This program defines a Dual type that carries both a value and a derivative (tangent). Overloaded arithmetic and math functions automatically propagate derivatives via the chain rule. We demonstrate: (1) a single-variable derivative by seeding dx=1; (2) a two-variable gradient by performing two seeds—one for each coordinate; and (3) a Jacobian–vector product (directional derivative) for a vector-valued function by seeding the tangent with the desired direction vector.

Time: O(K) per evaluation, where K is the number of elementary operations executed by the function(s). Each operation does O(1) extra derivative work.Space: O(1) extra over the original computation, since only value and one tangent are stored per variable.

Reverse-mode AD (adjoint/backprop) via a minimal tape and topological backpropagation

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <memory>
5 #include <unordered_set>
6 #include <iomanip>
7 
8 namespace ad {
9 
10 struct Var;
11 using VarPtr = std::shared_ptr<Var>;
12 
13 struct Edge {
14     VarPtr parent;                      // pointer to input node
15     std::function<double()> local_grad; // returns \partial z / \partial parent
16 };
17 
18 struct Var {
19     double val;                 // primal value
20     double grad = 0.0;          // adjoint (dL/dVar)
21     std::vector<Edge> parents;  // edges to inputs with local partials
22     explicit Var(double v) : val(v) {}
23 };
24 
25 inline VarPtr constant(double v) { return std::make_shared<Var>(v); }
26 inline VarPtr variable(double v) { return std::make_shared<Var>(v); }
27 
28 // Primitive ops: +, -, *, unary functions
29 inline VarPtr add(const VarPtr& a, const VarPtr& b) {
30     auto z = std::make_shared<Var>(a->val + b->val);
31     z->parents.push_back({a, [=](){ return 1.0; }});
32     z->parents.push_back({b, [=](){ return 1.0; }});
33     return z;
34 }
35 inline VarPtr sub(const VarPtr& a, const VarPtr& b) {
36     auto z = std::make_shared<Var>(a->val - b->val);
37     z->parents.push_back({a, [=](){ return 1.0; }});
38     z->parents.push_back({b, [=](){ return -1.0; }});
39     return z;
40 }
41 inline VarPtr mul(const VarPtr& a, const VarPtr& b) {
42     auto z = std::make_shared<Var>(a->val * b->val);
43     z->parents.push_back({a, [=](){ return b->val; }});
44     z->parents.push_back({b, [=](){ return a->val; }});
45     return z;
46 }
47 inline VarPtr divv(const VarPtr& a, const VarPtr& b) {
48     auto z = std::make_shared<Var>(a->val / b->val);
49     z->parents.push_back({a, [=](){ return 1.0 / b->val; }});
50     z->parents.push_back({b, [=](){ return -a->val / (b->val * b->val); }});
51     return z;
52 }
53 inline VarPtr sinv(const VarPtr& x) {
54     auto z = std::make_shared<Var>(std::sin(x->val));
55     z->parents.push_back({x, [=](){ return std::cos(x->val); }});
56     return z;
57 }
58 inline VarPtr cosv(const VarPtr& x) {
59     auto z = std::make_shared<Var>(std::cos(x->val));
60     z->parents.push_back({x, [=](){ return -std::sin(x->val); }});
61     return z;
62 }
63 inline VarPtr expv(const VarPtr& x) {
64     auto e = std::exp(x->val);
65     auto z = std::make_shared<Var>(e);
66     z->parents.push_back({x, [=](){ return e; }});
67     return z;
68 }
69 inline VarPtr logv(const VarPtr& x) {
70     auto z = std::make_shared<Var>(std::log(x->val));
71     z->parents.push_back({x, [=](){ return 1.0 / x->val; }});
72     return z;
73 }
74 inline VarPtr powv(const VarPtr& x, double p) {
75     double pv = std::pow(x->val, p);
76     auto z = std::make_shared<Var>(pv);
77     z->parents.push_back({x, [=](){ return p * std::pow(x->val, p - 1.0); }});
78     return z;
79 }
80 
81 // Operator overloads for convenience
82 inline VarPtr operator+(const VarPtr& a, const VarPtr& b) { return add(a,b); }
83 inline VarPtr operator-(const VarPtr& a, const VarPtr& b) { return sub(a,b); }
84 inline VarPtr operator*(const VarPtr& a, const VarPtr& b) { return mul(a,b); }
85 inline VarPtr operator/(const VarPtr& a, const VarPtr& b) { return divv(a,b); }
86 inline VarPtr operator+(const VarPtr& a, double b) { return add(a, constant(b)); }
87 inline VarPtr operator+(double a, const VarPtr& b) { return add(constant(a), b); }
88 inline VarPtr operator-(const VarPtr& a, double b) { return sub(a, constant(b)); }
89 inline VarPtr operator-(double a, const VarPtr& b) { return sub(constant(a), b); }
90 inline VarPtr operator*(const VarPtr& a, double b) { return mul(a, constant(b)); }
91 inline VarPtr operator*(double a, const VarPtr& b) { return mul(constant(a), b); }
92 inline VarPtr operator/(const VarPtr& a, double b) { return divv(a, constant(b)); }
93 inline VarPtr operator/(double a, const VarPtr& b) { return divv(constant(a), b); }
94 
95 // Build reverse topological order via DFS
96 void dfs(const VarPtr& v, std::vector<VarPtr>& order, std::unordered_set<Var*>& seen) {
97     if (!v) return;
98     if (seen.count(v.get())) return;
99     seen.insert(v.get());
100     for (const auto& e : v->parents) dfs(e.parent, order, seen);
101     order.push_back(v); // push after parents: topological order
102 }
103 
104 void backward(const VarPtr& out) {
105     // 1) Collect nodes in topological order
106     std::vector<VarPtr> order; order.reserve(1024);
107     std::unordered_set<Var*> seen; seen.reserve(1024);
108     dfs(out, order, seen);
109 
110     // 2) Initialize adjoints to zero (already zero for fresh graph)
111     for (auto& v : order) v->grad = 0.0;
112 
113     // 3) Seed output gradient and traverse in reverse
114     out->grad = 1.0;
115     for (auto it = order.rbegin(); it != order.rend(); ++it) {
116         VarPtr node = *it;
117         for (auto& e : node->parents) {
118             e.parent->grad += node->grad * e.local_grad();
119         }
120     }
121 }
122 
123 } // namespace ad
124 
125 int main() {
126     using namespace ad;
127     std::cout << std::fixed << std::setprecision(6);
128 
129     // Example: f(x,y,z) = (x*y + sin(z))^2 at (2,3,0.5)
130     VarPtr x = variable(2.0);
131     VarPtr y = variable(3.0);
132     VarPtr z = variable(0.5);
133 
134     VarPtr f = powv(x * y + sinv(z), 2.0);
135     backward(f);
136 
137     std::cout << "f(x,y,z) = (x*y + sin(z))^2\n";
138     std::cout << "value = " << f->val << "\n";
139     std::cout << "df/dx = " << x->grad << ", df/dy = " << y->grad << ", df/dz = " << z->grad << "\n";
140 
141     // Check against analytical gradients:
142     // g = x*y + sin(z); f = g^2; df/dx = 2*g*y; df/dy = 2*g*x; df/dz = 2*g*cos(z)
143     double g = x->val * y->val + std::sin(z->val);
144     std::cout << "(analytic) df/dx = " << 2*g*y->val
145               << ", df/dy = " << 2*g*x->val
146               << ", df/dz = " << 2*g*std::cos(z->val) << "\n";
147 
148     return 0;
149 }
150

This minimal reverse-mode engine builds a computation graph where each node stores its value and a list of parents with local partial derivatives. A depth-first search produces a topological order. The backward pass seeds the output adjoint with 1 and propagates gradients to parents using the chain rule. The example computes the gradient of a scalar function of three variables, illustrating that one backward pass yields all partial derivatives simultaneously.

Time: O(N) for one forward evaluation to build values plus O(N) for the backward pass, where N is the number of elementary operations (graph edges).Space: O(N) to store the tape (nodes, parents, and necessary values) during backpropagation.

1	#include <iostream>
2	#include <cmath>
3	#include <vector>
4	#include <iomanip>
5
6	// Simple dual number type: value + epsilon * derivative, with epsilon^2 = 0
7	struct Dual {
8	double val; // primal value
9	double d; // derivative (tangent) w.r.t. chosen seed direction
10
11	Dual(double v = 0.0, double der = 0.0) : val(v), d(der) {}
12	};
13
14	// Arithmetic operators for Dual
15	inline Dual operator+(const Dual& a, const Dual& b) {
16	return Dual(a.val + b.val, a.d + b.d);
17	}
18	inline Dual operator-(const Dual& a, const Dual& b) {
19	return Dual(a.val - b.val, a.d - b.d);
20	}
21	inline Dual operator*(const Dual& a, const Dual& b) {
22	// Product rule: (ab)' = a'b + ab'
23	return Dual(a.val * b.val, a.d * b.val + a.val * b.d);
24	}
25	inline Dual operator/(const Dual& a, const Dual& b) {
26	// Quotient rule: (a/b)' = (a'b - ab') / b^2
27	double denom = b.val * b.val;
28	return Dual(a.val / b.val, (a.d * b.val - a.val * b.d) / denom);
29	}
30	// Overloads with double on right/left
31	inline Dual operator+(const Dual& a, double b) { return Dual(a.val + b, a.d); }
32	inline Dual operator+(double a, const Dual& b) { return Dual(a + b.val, b.d); }
33	inline Dual operator-(const Dual& a, double b) { return Dual(a.val - b, a.d); }
34	inline Dual operator-(double a, const Dual& b) { return Dual(a - b.val, -b.d); }
35	inline Dual operator(const Dual& a, double b) { return Dual(a.val b, a.d * b); }
36	inline Dual operator(double a, const Dual& b) { return Dual(a b.val, a * b.d); }
37	inline Dual operator/(const Dual& a, double b) { return Dual(a.val / b, a.d / b); }
38	inline Dual operator/(double a, const Dual& b) {
39	// (a / b)' = -a * b' / b^2
40	double denom = b.val * b.val;
41	return Dual(a / b.val, -a * b.d / denom);
42	}
43
44	// Elementary functions for Dual
45	inline Dual sin(const Dual& x) { return Dual(std::sin(x.val), std::cos(x.val) * x.d); }
46	inline Dual cos(const Dual& x) { return Dual(std::cos(x.val), -std::sin(x.val) * x.d); }
47	inline Dual exp(const Dual& x) { double e = std::exp(x.val); return Dual(e, e * x.d); }
48	inline Dual log(const Dual& x) { return Dual(std::log(x.val), x.d / x.val); }
49	inline Dual pow(const Dual& x, double p) { // x^p
50	double pv = std::pow(x.val, p);
51	return Dual(pv, p * std::pow(x.val, p - 1.0) * x.d);
52	}
53
54	int main() {
55	std::cout << std::fixed << std::setprecision(6);
56
57	// 1) Single-variable derivative: f(x) = x^3 + sin(x) at x=2
58	auto f1 = [](const Dual& x) { return pow(x, 3.0) + sin(x); };
59	Dual x(2.0, 1.0); // seed derivative dx=1 to get df/dx
60	Dual fx = f1(x);
61	std::cout << "f(x)=x^3+sin(x) at x=2 => value=" << fx.val
62	<< ", derivative=" << fx.d << "\n";
63
64	// 2) Multivariate gradient via multiple seeds: f(x,y) = x*y + sin(x)
65	auto f2 = [](const Dual& X, const Dual& Y) { return X * Y + sin(X); };
66	double x0 = 2.0, y0 = 3.0;
67	// Partial w.r.t x: seed (dx, dy) = (1, 0)
68	Dual Xx(x0, 1.0), Yx(y0, 0.0);
69	Dual fxy_x = f2(Xx, Yx);
70	// Partial w.r.t y: seed (dx, dy) = (0, 1)
71	Dual Xy(x0, 0.0), Yy(y0, 1.0);
72	Dual fxy_y = f2(Xy, Yy);
73	std::cout << "f(x,y)=x*y+sin(x) at (2,3) => value=" << fxy_x.val
74	<< ", df/dx=" << fxy_x.d << ", df/dy=" << fxy_y.d << "\n";
75
76	// 3) JVP (Jacobian–vector product) for a vector-valued function
77	// F(x,y) = [ x*y, sin(x) + y^2 ] at (x,y)=(2,3) in direction v=(1,2)
78	auto F = [](const Dual& X, const Dual& Y) {
79	std::vector<Dual> out;
80	out.reserve(2);
81	out.push_back(X * Y); // F1 = x*y
82	out.push_back(sin(X) + pow(Y, 2)); // F2 = sin(x) + y^2
83	return out;
84	};
85	Dual Xv(x0, 1.0), Yv(y0, 2.0); // seed with v=(1,2)
86	auto Fv = F(Xv, Yv);
87	std::cout << "F(x,y) at (2,3) => values=[" << Fv[0].val << ", " << Fv[1].val
88	<< "], Jv=[" << Fv[0].d << ", " << Fv[1].d << "]\n";
89
90	return 0;
91	}
92