Automatic Differentiation
Key Points
- •Automatic differentiation (AD) computes exact derivatives by systematically applying the chain rule to your program, not by symbolic algebra or numerical differences.
- •Forward mode uses dual numbers to propagate derivatives alongside values and is efficient for functions with few inputs and many outputs (computes Jacobian–vector products).
- •Reverse mode builds a computation graph and backpropagates adjoints (sensitivities), making it ideal for scalar outputs with many inputs (computes vector–Jacobian products).
- •Dual numbers treat a value as x + with =0, so derivatives ride along through arithmetic automatically.
- •Reverse mode stores a tape of operations; a single backward pass gives gradients of a scalar output with respect to all inputs.
- •For f: , forward mode costs about m evaluations per input direction (O(n) passes), while reverse mode costs about one evaluation per output (O(m) passes).
- •AD is robust and accurate to machine precision, unlike finite differences, but requires careful handling of control flow and side effects.
- •In C++, operator overloading enables concise AD libraries in both forward and reverse modes with good performance.
Prerequisites
- →Calculus: derivatives and chain rule — AD systematically applies the chain rule to compound functions and relies on knowing basic derivatives of elementary functions.
- →Linear algebra: vectors, matrices, Jacobian — Understanding JVPs (Jv) and VJPs (w^T J) requires familiarity with matrix–vector products and Jacobians.
- →C++ operator overloading and templates — AD in C++ is typically implemented by overloading arithmetic and math functions for custom number types.
- →Graphs and topological order — Reverse mode uses a computation graph and a reverse topological traversal to propagate adjoints correctly.
- →Floating-point arithmetic — AD computes derivatives to machine precision; understanding floating-point limits helps diagnose numerical issues.
Detailed Explanation
Tap terms for definitions01Overview
Automatic Differentiation (AD) is a set of techniques to compute exact derivatives of functions implemented as programs. Unlike symbolic differentiation, it does not manipulate algebraic expressions; and unlike finite differences, it does not approximate derivatives with fragile subtractions. Instead, AD decomposes a program into a sequence of elementary operations with known derivatives and systematically applies the chain rule. The resulting derivatives are accurate to machine precision and have predictable computational cost.
There are two main modes of AD. Forward mode propagates derivatives from inputs to outputs during a single pass. It is best understood through dual numbers, which pack a value and its derivative into one object that travels through computations. Reverse mode (also called adjoint mode or backpropagation) runs the program forward to record a computation graph and then runs a backward pass that distributes sensitivities from the outputs to the inputs. Reverse mode is the engine behind most modern machine learning training procedures because it can compute gradients of scalar losses with respect to millions of parameters in time comparable to one program run.
In practice, C++ supports AD elegantly via operator overloading: you can define number-like types (dual numbers or nodes in a tape) that carry derivative information. Each arithmetic operation and math function updates both value and derivative consistently. The choice between forward and reverse mode depends on the shape of your function (number of inputs n versus outputs m) and which product with the Jacobian you need (JVP or VJP).
02Intuition & Analogies
Imagine a factory assembly line that turns raw materials (inputs) into a finished product (output). Each station performs a simple task: cutting, drilling, painting, etc. If you want to know how much a tiny change in a specific raw material affects the final product, you could:
-
Forward mode (dual numbers): Attach a “sensitivity tag” to a chosen input when it enters the line. At every station, the tag is updated according to how that station’s operation influences the result. By the time the product leaves the line, the tag tells you the effect of that input on the output. If you want sensitivities for multiple inputs, you repeat the process with different tags.
-
Reverse mode (adjoints): After producing the item, place a price tag on the final product and walk backwards through the assembly line. At each station, you split that price tag among the inputs according to how each input contributes to the station’s output (local partial derivatives). When you reach the start, every raw material has a tag showing how it influences the final product’s value. Crucially, one backward walk gives you the sensitivities for all inputs at once—as long as there’s only one final product value you care about.
Dual numbers make forward mode mechanical: carry two numbers (value and slope). Operations like multiply and sin update both: for z = xy, value z is xy, and slope z' is x'y + xy'. Reverse mode needs memory (a log, or tape) so that on the way back, each station knows how to distribute the price tag. This is why reverse mode uses more memory but shines when there are many inputs and a single scalar output.
03Formal Definition
04When to Use
-
Forward mode (dual numbers):
- Best for functions f: \mathbb{R}^{n} \to \mathbb{R}^{m} with small n and larger m. Each forward sweep gives derivatives with respect to one seed direction. Ideal for computing directional derivatives, tangents in ODE solvers, and sensitivity of a few parameters.
- Useful for Jacobian–vector products J(x)v without building the full Jacobian, which appears in Krylov methods and Newton–Krylov solvers.
-
Reverse mode (adjoint/backprop):
- Best for scalar-output functions f: \mathbb{R}^{n} \to \mathbb{R}, where n is large (e.g., neural network training, least-squares objectives). One backward pass gets the full gradient.
- Useful for vector–Jacobian products w^{\top} J(x). When you only need a weighted sum of outputs’ sensitivities, combine outputs as L = w^{\top} f(x) and backprop once.
-
Mixed strategies:
- Full Jacobian of f: \mathbb{R}^{n} \to \mathbb{R}^{m} can be computed by n forward sweeps or m reverse sweeps; choose the smaller.
- Higher-order derivatives: nest modes (e.g., forward-over-reverse) to get Hessian–vector products efficiently.
-
Practical C++ considerations:
- Choose forward mode for small parameter vectors, geometric libraries, and compile-time expression templates.
- Choose reverse mode for ML models, optimization with many parameters, and any scalar loss over large input dimension.
⚠️Common Mistakes
- Confusing AD with finite differences: Finite differences use f(x+h)-f(x) and suffer from truncation and cancellation errors; AD applies exact chain rule on the executed program. Use AD for high accuracy and stability.
- Using forward mode for huge input dimension: Forward mode computes one column of the Jacobian per sweep. For f: \mathbb{R}^{n} \to \mathbb{R}, this means O(n) sweeps. Prefer reverse mode when n is large and the output is scalar.
- Forgetting to seed derivatives in forward mode: If you don’t set \dot{x}_{i}, you’ll get zero derivatives. Seed with basis vectors for gradients or with a custom direction for a JVP.
- Ignoring control flow and side effects: Reverse mode needs a faithful tape. Mutating variables used in the forward pass, freeing memory early, or mixing threads without care can corrupt the graph and produce wrong gradients.
- Not resetting adjoints before a new backward pass: In reverse mode, stale gradients accumulate if you reuse nodes. Always zero adjoints before backprop or rebuild the graph.
- Differentiating through non-differentiable or discontinuous operations: Functions like max at a kink or integer/bitwise ops have undefined or zero derivatives. Use smooth approximations or subgradient-aware logic.
- Overloading without constexpr/math coverage: In C++, forgetting to overload sin, cos, exp, log, etc., for your AD type leads to implicit conversions to double and lost derivatives.
- Memory blow-up in reverse mode: Storing every intermediate creates large tapes. Use checkpointing or recomputation to trade memory for time.
Key Formulas
Dual Number
Explanation: A dual number augments a real value with an infinitesimal part that carries the derivative. Arithmetic with dual numbers automatically applies the chain rule because =0 eliminates higher-order terms.
Lifted Elementary Operation
Explanation: Every differentiable elementary function lifts to dual numbers by updating the derivative via the usual single-variable chain rule. This is the core rule enabling forward-mode AD.
Jacobian Matrix
Explanation: The Jacobian collects all first derivatives of a vector-valued function f: . Its shape determines which AD mode is more efficient.
Products with the Jacobian
Explanation: Forward mode naturally computes Jv by seeding tangents with v; reverse mode naturally computes J by backpropagating from a linear combination of outputs with weights w.
Reverse Accumulation Rule
Explanation: In reverse mode, the adjoint of each input accumulates contributions from adjoints of its children scaled by local partial derivatives. This is the discrete chain rule on the computation graph.
Product Rule
Explanation: Dual numbers and reverse mode both rely on familiar calculus rules. The product rule is applied at each multiplication node.
Elementary Derivatives
Explanation: These standard derivatives define the local partials used by AD for trigonometric and exponential/logarithmic operations.
Mode Selection Heuristic
Explanation: For f: with base evaluation cost , computing the full Jacobian requires n forward sweeps or m reverse sweeps. Choose the smaller of n and m.
Directional Derivative
Explanation: The directional derivative equals a JVP with direction v. Forward mode computes this in one sweep by seeding tangents with v.
Adjoint Recurrence
Explanation: This expresses each node’s adjoint as the sum over its outgoing edges of the child adjoints times local partials. It is the basis of reverse-mode backpropagation.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <cmath> 3 #include <vector> 4 #include <iomanip> 5 6 // Simple dual number type: value + epsilon * derivative, with epsilon^2 = 0 7 struct Dual { 8 double val; // primal value 9 double d; // derivative (tangent) w.r.t. chosen seed direction 10 11 Dual(double v = 0.0, double der = 0.0) : val(v), d(der) {} 12 }; 13 14 // Arithmetic operators for Dual 15 inline Dual operator+(const Dual& a, const Dual& b) { 16 return Dual(a.val + b.val, a.d + b.d); 17 } 18 inline Dual operator-(const Dual& a, const Dual& b) { 19 return Dual(a.val - b.val, a.d - b.d); 20 } 21 inline Dual operator*(const Dual& a, const Dual& b) { 22 // Product rule: (ab)' = a'b + ab' 23 return Dual(a.val * b.val, a.d * b.val + a.val * b.d); 24 } 25 inline Dual operator/(const Dual& a, const Dual& b) { 26 // Quotient rule: (a/b)' = (a'b - ab') / b^2 27 double denom = b.val * b.val; 28 return Dual(a.val / b.val, (a.d * b.val - a.val * b.d) / denom); 29 } 30 // Overloads with double on right/left 31 inline Dual operator+(const Dual& a, double b) { return Dual(a.val + b, a.d); } 32 inline Dual operator+(double a, const Dual& b) { return Dual(a + b.val, b.d); } 33 inline Dual operator-(const Dual& a, double b) { return Dual(a.val - b, a.d); } 34 inline Dual operator-(double a, const Dual& b) { return Dual(a - b.val, -b.d); } 35 inline Dual operator*(const Dual& a, double b) { return Dual(a.val * b, a.d * b); } 36 inline Dual operator*(double a, const Dual& b) { return Dual(a * b.val, a * b.d); } 37 inline Dual operator/(const Dual& a, double b) { return Dual(a.val / b, a.d / b); } 38 inline Dual operator/(double a, const Dual& b) { 39 // (a / b)' = -a * b' / b^2 40 double denom = b.val * b.val; 41 return Dual(a / b.val, -a * b.d / denom); 42 } 43 44 // Elementary functions for Dual 45 inline Dual sin(const Dual& x) { return Dual(std::sin(x.val), std::cos(x.val) * x.d); } 46 inline Dual cos(const Dual& x) { return Dual(std::cos(x.val), -std::sin(x.val) * x.d); } 47 inline Dual exp(const Dual& x) { double e = std::exp(x.val); return Dual(e, e * x.d); } 48 inline Dual log(const Dual& x) { return Dual(std::log(x.val), x.d / x.val); } 49 inline Dual pow(const Dual& x, double p) { // x^p 50 double pv = std::pow(x.val, p); 51 return Dual(pv, p * std::pow(x.val, p - 1.0) * x.d); 52 } 53 54 int main() { 55 std::cout << std::fixed << std::setprecision(6); 56 57 // 1) Single-variable derivative: f(x) = x^3 + sin(x) at x=2 58 auto f1 = [](const Dual& x) { return pow(x, 3.0) + sin(x); }; 59 Dual x(2.0, 1.0); // seed derivative dx=1 to get df/dx 60 Dual fx = f1(x); 61 std::cout << "f(x)=x^3+sin(x) at x=2 => value=" << fx.val 62 << ", derivative=" << fx.d << "\n"; 63 64 // 2) Multivariate gradient via multiple seeds: f(x,y) = x*y + sin(x) 65 auto f2 = [](const Dual& X, const Dual& Y) { return X * Y + sin(X); }; 66 double x0 = 2.0, y0 = 3.0; 67 // Partial w.r.t x: seed (dx, dy) = (1, 0) 68 Dual Xx(x0, 1.0), Yx(y0, 0.0); 69 Dual fxy_x = f2(Xx, Yx); 70 // Partial w.r.t y: seed (dx, dy) = (0, 1) 71 Dual Xy(x0, 0.0), Yy(y0, 1.0); 72 Dual fxy_y = f2(Xy, Yy); 73 std::cout << "f(x,y)=x*y+sin(x) at (2,3) => value=" << fxy_x.val 74 << ", df/dx=" << fxy_x.d << ", df/dy=" << fxy_y.d << "\n"; 75 76 // 3) JVP (Jacobian–vector product) for a vector-valued function 77 // F(x,y) = [ x*y, sin(x) + y^2 ] at (x,y)=(2,3) in direction v=(1,2) 78 auto F = [](const Dual& X, const Dual& Y) { 79 std::vector<Dual> out; 80 out.reserve(2); 81 out.push_back(X * Y); // F1 = x*y 82 out.push_back(sin(X) + pow(Y, 2)); // F2 = sin(x) + y^2 83 return out; 84 }; 85 Dual Xv(x0, 1.0), Yv(y0, 2.0); // seed with v=(1,2) 86 auto Fv = F(Xv, Yv); 87 std::cout << "F(x,y) at (2,3) => values=[" << Fv[0].val << ", " << Fv[1].val 88 << "], Jv=[" << Fv[0].d << ", " << Fv[1].d << "]\n"; 89 90 return 0; 91 } 92
This program defines a Dual type that carries both a value and a derivative (tangent). Overloaded arithmetic and math functions automatically propagate derivatives via the chain rule. We demonstrate: (1) a single-variable derivative by seeding dx=1; (2) a two-variable gradient by performing two seeds—one for each coordinate; and (3) a Jacobian–vector product (directional derivative) for a vector-valued function by seeding the tangent with the desired direction vector.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <memory> 5 #include <unordered_set> 6 #include <iomanip> 7 8 namespace ad { 9 10 struct Var; 11 using VarPtr = std::shared_ptr<Var>; 12 13 struct Edge { 14 VarPtr parent; // pointer to input node 15 std::function<double()> local_grad; // returns \partial z / \partial parent 16 }; 17 18 struct Var { 19 double val; // primal value 20 double grad = 0.0; // adjoint (dL/dVar) 21 std::vector<Edge> parents; // edges to inputs with local partials 22 explicit Var(double v) : val(v) {} 23 }; 24 25 inline VarPtr constant(double v) { return std::make_shared<Var>(v); } 26 inline VarPtr variable(double v) { return std::make_shared<Var>(v); } 27 28 // Primitive ops: +, -, *, unary functions 29 inline VarPtr add(const VarPtr& a, const VarPtr& b) { 30 auto z = std::make_shared<Var>(a->val + b->val); 31 z->parents.push_back({a, [=](){ return 1.0; }}); 32 z->parents.push_back({b, [=](){ return 1.0; }}); 33 return z; 34 } 35 inline VarPtr sub(const VarPtr& a, const VarPtr& b) { 36 auto z = std::make_shared<Var>(a->val - b->val); 37 z->parents.push_back({a, [=](){ return 1.0; }}); 38 z->parents.push_back({b, [=](){ return -1.0; }}); 39 return z; 40 } 41 inline VarPtr mul(const VarPtr& a, const VarPtr& b) { 42 auto z = std::make_shared<Var>(a->val * b->val); 43 z->parents.push_back({a, [=](){ return b->val; }}); 44 z->parents.push_back({b, [=](){ return a->val; }}); 45 return z; 46 } 47 inline VarPtr divv(const VarPtr& a, const VarPtr& b) { 48 auto z = std::make_shared<Var>(a->val / b->val); 49 z->parents.push_back({a, [=](){ return 1.0 / b->val; }}); 50 z->parents.push_back({b, [=](){ return -a->val / (b->val * b->val); }}); 51 return z; 52 } 53 inline VarPtr sinv(const VarPtr& x) { 54 auto z = std::make_shared<Var>(std::sin(x->val)); 55 z->parents.push_back({x, [=](){ return std::cos(x->val); }}); 56 return z; 57 } 58 inline VarPtr cosv(const VarPtr& x) { 59 auto z = std::make_shared<Var>(std::cos(x->val)); 60 z->parents.push_back({x, [=](){ return -std::sin(x->val); }}); 61 return z; 62 } 63 inline VarPtr expv(const VarPtr& x) { 64 auto e = std::exp(x->val); 65 auto z = std::make_shared<Var>(e); 66 z->parents.push_back({x, [=](){ return e; }}); 67 return z; 68 } 69 inline VarPtr logv(const VarPtr& x) { 70 auto z = std::make_shared<Var>(std::log(x->val)); 71 z->parents.push_back({x, [=](){ return 1.0 / x->val; }}); 72 return z; 73 } 74 inline VarPtr powv(const VarPtr& x, double p) { 75 double pv = std::pow(x->val, p); 76 auto z = std::make_shared<Var>(pv); 77 z->parents.push_back({x, [=](){ return p * std::pow(x->val, p - 1.0); }}); 78 return z; 79 } 80 81 // Operator overloads for convenience 82 inline VarPtr operator+(const VarPtr& a, const VarPtr& b) { return add(a,b); } 83 inline VarPtr operator-(const VarPtr& a, const VarPtr& b) { return sub(a,b); } 84 inline VarPtr operator*(const VarPtr& a, const VarPtr& b) { return mul(a,b); } 85 inline VarPtr operator/(const VarPtr& a, const VarPtr& b) { return divv(a,b); } 86 inline VarPtr operator+(const VarPtr& a, double b) { return add(a, constant(b)); } 87 inline VarPtr operator+(double a, const VarPtr& b) { return add(constant(a), b); } 88 inline VarPtr operator-(const VarPtr& a, double b) { return sub(a, constant(b)); } 89 inline VarPtr operator-(double a, const VarPtr& b) { return sub(constant(a), b); } 90 inline VarPtr operator*(const VarPtr& a, double b) { return mul(a, constant(b)); } 91 inline VarPtr operator*(double a, const VarPtr& b) { return mul(constant(a), b); } 92 inline VarPtr operator/(const VarPtr& a, double b) { return divv(a, constant(b)); } 93 inline VarPtr operator/(double a, const VarPtr& b) { return divv(constant(a), b); } 94 95 // Build reverse topological order via DFS 96 void dfs(const VarPtr& v, std::vector<VarPtr>& order, std::unordered_set<Var*>& seen) { 97 if (!v) return; 98 if (seen.count(v.get())) return; 99 seen.insert(v.get()); 100 for (const auto& e : v->parents) dfs(e.parent, order, seen); 101 order.push_back(v); // push after parents: topological order 102 } 103 104 void backward(const VarPtr& out) { 105 // 1) Collect nodes in topological order 106 std::vector<VarPtr> order; order.reserve(1024); 107 std::unordered_set<Var*> seen; seen.reserve(1024); 108 dfs(out, order, seen); 109 110 // 2) Initialize adjoints to zero (already zero for fresh graph) 111 for (auto& v : order) v->grad = 0.0; 112 113 // 3) Seed output gradient and traverse in reverse 114 out->grad = 1.0; 115 for (auto it = order.rbegin(); it != order.rend(); ++it) { 116 VarPtr node = *it; 117 for (auto& e : node->parents) { 118 e.parent->grad += node->grad * e.local_grad(); 119 } 120 } 121 } 122 123 } // namespace ad 124 125 int main() { 126 using namespace ad; 127 std::cout << std::fixed << std::setprecision(6); 128 129 // Example: f(x,y,z) = (x*y + sin(z))^2 at (2,3,0.5) 130 VarPtr x = variable(2.0); 131 VarPtr y = variable(3.0); 132 VarPtr z = variable(0.5); 133 134 VarPtr f = powv(x * y + sinv(z), 2.0); 135 backward(f); 136 137 std::cout << "f(x,y,z) = (x*y + sin(z))^2\n"; 138 std::cout << "value = " << f->val << "\n"; 139 std::cout << "df/dx = " << x->grad << ", df/dy = " << y->grad << ", df/dz = " << z->grad << "\n"; 140 141 // Check against analytical gradients: 142 // g = x*y + sin(z); f = g^2; df/dx = 2*g*y; df/dy = 2*g*x; df/dz = 2*g*cos(z) 143 double g = x->val * y->val + std::sin(z->val); 144 std::cout << "(analytic) df/dx = " << 2*g*y->val 145 << ", df/dy = " << 2*g*x->val 146 << ", df/dz = " << 2*g*std::cos(z->val) << "\n"; 147 148 return 0; 149 } 150
This minimal reverse-mode engine builds a computation graph where each node stores its value and a list of parents with local partial derivatives. A depth-first search produces a topological order. The backward pass seeds the output adjoint with 1 and propagates gradients to parents using the chain rule. The example computes the gradient of a scalar function of three variables, illustrating that one backward pass yields all partial derivatives simultaneously.