Multivariable Chain Rule
Key Points
- •The multivariable chain rule explains how rates of change pass through a pipeline of functions by multiplying the right derivatives (Jacobians) in the right order.
- •For vector-valued functions, the derivative is a Jacobian matrix, and the chain rule says (x) = (g(x)) (x).
- •When the outer function is scalar, the gradient version is ∇( = (x)^T ∇ which is the basis of backpropagation in machine learning.
- •You must always evaluate outer derivatives at the inner function’s current output; forgetting this is a very common mistake.
- •Dimensions must match: if g: → and f: → , then (g(x)) is p×m and (x) is m×n, so their product is p×n.
- •Numerically, you can verify the chain rule by computing Jacobians with finite differences and checking that the product matches the Jacobian of the composition.
- •Automatic differentiation (AD) implements the chain rule mechanically; forward mode propagates derivative columns, while reverse mode propagates gradients backward.
- •Computational cost depends on method: central differences need 2n function calls per output, forward-mode AD scales with input dimension, and reverse-mode AD scales with output dimension.
Prerequisites
- →Single-variable chain rule — Provides the core idea that derivatives of compositions multiply; the multivariable version generalizes this.
- →Partial derivatives — Chain rule entries are sums/products of partial derivatives; you must know how to compute them.
- →Gradient and Jacobian — The multivariable chain rule is stated succinctly in terms of gradients and Jacobians.
- →Matrix multiplication — Combining sensitivities uses matrix products of Jacobians; dimension checks rely on linear algebra.
- →Linear approximation and differentials — The chain rule composes local linear maps; understanding differentials clarifies why Jacobian products arise.
- →Vector calculus notation — Comfort with ∇, dot products, and indexing helps read and manipulate formulas correctly.
- →Numerical differentiation — Useful for verifying analytic derivatives and understanding practical errors and step-size tradeoffs.
- →Automatic differentiation basics — Shows how the chain rule is implemented in software for exact derivatives.
Detailed Explanation
Tap terms for definitions01Overview
The multivariable chain rule is the rule for differentiating compositions of functions when there are many inputs and outputs. In single-variable calculus, it says how the slope changes when one variable depends on another. In multiple variables, each function can map between spaces of different dimensions, so the derivative is best captured by a Jacobian matrix (a grid of partial derivatives). If you have g mapping from R^n to R^m and f mapping from R^m to R^p, then their composition h = f ∘ g maps R^n to R^p. The chain rule states that the Jacobian of h at x equals the matrix product of the Jacobians of f (evaluated at g(x)) and g (evaluated at x). This is a compact and powerful way to describe how small changes in inputs flow through multiple stages.
This idea powers many practical fields: in optimization, it explains how gradients change when you reparameterize variables; in physics, it connects rates like velocity and temperature that depend on moving positions; in computer graphics and robotics, it ties together coordinate transforms; and in machine learning, it is the mathematical heart of backpropagation. The chain rule also provides a way to organize calculations efficiently: for scalar outputs, reverse-mode automatic differentiation is preferred; for few inputs, forward-mode works well. Computationally, you can check or approximate the pieces using finite differences and then assemble them using the matrix product implied by the chain rule.
02Intuition & Analogies
Imagine an assembly line: raw parts (inputs) go through station g first, getting reshaped into intermediate parts, and then through station f to become final products. If you nudge a raw part slightly, the final product changes for two reasons: (1) the intermediate part changes a bit (how sensitive g is), and (2) the final step responds to that change (how sensitive f is to its input). The total effect equals “sensitivity of f to its input” times “sensitivity of g to the original input.” That’s exactly the chain rule.
Another analogy is navigation with a GPS. Suppose your current location depends on joystick inputs (x and y), and your displayed map coordinates (U, V, W) depend on your location through a projection. A tiny joystick push alters location a little, and that shifted location alters the displayed coordinates. The total change in displayed coordinates is the product of two maps: the local linear effect of joystick on location, and the local linear effect of location on the projection. Multiplying these small linear effects (matrices) gives the overall small change.
For a scalar readout like temperature at a moving point, the chain rule is like following ripples: the temperature changes as you move in space (its gradient), and you’re moving with some velocity. The rate of temperature change you feel is the dot product of “which direction is hotter” (gradient) and “which way you’re moving” (velocity). When you stack many stages (as in neural networks), you just multiply more of these small linear effects in order from last to first, like passing a message backward through layers.
03Formal Definition
04When to Use
- Reparameterizations: If you express variables through others (like polar to Cartesian coordinates), use the chain rule to get derivatives in the new variables.
- Neural networks and backpropagation: Gradients of loss with respect to weights are computed by repeatedly applying the chain rule layer by layer.
- Physics along trajectories: Quantities like temperature, density, or potential that depend on position and time via a moving particle require the path version of the chain rule.
- Computer graphics and robotics: Stacking rotations, scalings, and translations amounts to composing maps; the chain rule tells you how sensitivities combine via Jacobian products.
- Optimization with constraints or substitutions: When an objective depends on an intermediate function (e.g., z = g(x) and objective L(z)), the gradient with respect to x is J_g(x)^T ∇L.
- Probabilistic modeling and change of variables: Jacobian determinants appear in density transformations; computing or estimating Jacobians relies on the chain rule.
- Numerical verification and debugging: To check a hand-derived gradient, compute numerical Jacobians by finite differences and compare them with Jacobian products implied by the chain rule.
⚠️Common Mistakes
- Dimension mismatch: Forgetting that J_f(g(x)) must be p×m and J_g(x) must be m×n. Always write down shapes to confirm the product is defined.
- Evaluating at the wrong point: The outer Jacobian must be evaluated at g(x), not at x. Omitting this is a frequent source of incorrect results.
- Gradient vs. Jacobian conventions: Mixing row and column gradient conventions can introduce missing transposes. Pick a convention and stick to it.
- Ignoring nondifferentiability: If either function is not differentiable at the evaluation point, the chain rule may fail. Check continuity and differentiability.
- Numerical pitfalls: Using too large a step size h in finite differences gives truncation error; too small gives round-off error. Central differences usually balance accuracy and stability.
- Confusing elementwise and matrix products: The chain rule uses matrix multiplication of Jacobians, not elementwise multiplication.
- Overlooking multiple inner functions: When the outer function depends on several inner outputs, the correct sum over those channels appears via matrix multiplication; do not drop terms.
- AD mode choice: Using forward mode for very high-dimensional inputs can be slow; for scalar outputs with many inputs, reverse mode (backprop) is usually more efficient.
Key Formulas
Jacobian Chain Rule
Explanation: The Jacobian of a composition equals the product of the outer Jacobian (evaluated at the inner output) with the inner Jacobian. Use this to combine sensitivities through stages.
Component-wise Chain Rule
Explanation: Each output component’s derivative with respect to each input equals a sum over all intermediate channels. This mirrors matrix multiplication entries.
Gradient Chain Rule
Explanation: For a scalar outer function, the gradient with respect to inputs is the transpose of the inner Jacobian times the outer gradient. This is the basis of backpropagation.
Path (Single-Parameter) Chain Rule
Explanation: When inputs move along a path r(t), the rate of change of f equals the dot product of the gradient of f with the velocity along the path.
Multi-layer Chain Rule
Explanation: For many composed layers, the total Jacobian is the ordered product of all local Jacobians evaluated at the correct intermediate points.
Second-Order (Scalar Outer) Chain Rule
Explanation: The Hessian of a scalar composition has a quadratic term from curvature of the outer function and additional terms from curvature of the inner functions.
Central Difference Approximation
Explanation: This estimates Jacobian entries numerically using symmetric perturbations. It has error on the order of for smooth functions.
Operator Norm Bound
Explanation: Local stretching of a composition does not exceed the product of the stretchings of the parts. Useful for stability and Lipschitz estimates.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 7 // Pretty-print helpers 8 void printVec(const string& name, const Vec& v) { 9 cout << name << " = ["; 10 for (size_t i = 0; i < v.size(); ++i) { 11 cout << fixed << setprecision(6) << v[i] << (i + 1 == v.size() ? "" : ", "); 12 } 13 cout << "]\n"; 14 } 15 16 void printMat(const string& name, const Mat& A) { 17 cout << name << " (" << A.size() << "x" << (A.empty()?0:A[0].size()) << "):\n"; 18 for (const auto& row : A) { 19 cout << " "; 20 for (size_t j = 0; j < row.size(); ++j) { 21 cout << setw(12) << fixed << setprecision(6) << row[j] << (j + 1 == row.size() ? "" : " "); 22 } 23 cout << "\n"; 24 } 25 } 26 27 // Define g: R^2 -> R^3 28 auto g = [](const Vec& x) -> Vec { 29 // x = [x0, x1] 30 double x0 = x[0], x1 = x[1]; 31 Vec y(3); 32 y[0] = x0 * x1; // u1 = x*y 33 y[1] = x0 * x0 + x1; // u2 = x^2 + y 34 y[2] = exp(x0 - x1); // u3 = e^{x - y} 35 return y; 36 }; 37 38 // Define f: R^3 -> R^2 39 auto f = [](const Vec& u) -> Vec { 40 // u = [u, v, w] 41 double U = u[0], V = u[1], W = u[2]; 42 Vec z(2); 43 z[0] = U + sin(V); // z1 = U + sin(V) 44 z[1] = V * W; // z2 = V * W 45 return z; 46 }; 47 48 // Numerical Jacobian via central differences (O(m*n) function evals) 49 Mat jacobian(function<Vec(const Vec&)> func, const Vec& x, double h = 1e-6) { 50 Vec f0 = func(x); 51 size_t m = f0.size(); 52 size_t n = x.size(); 53 Mat J(m, Vec(n, 0.0)); 54 for (size_t k = 0; k < n; ++k) { 55 Vec xp = x, xm = x; 56 xp[k] += h; 57 xm[k] -= h; 58 Vec fp = func(xp); 59 Vec fm = func(xm); 60 for (size_t i = 0; i < m; ++i) { 61 J[i][k] = (fp[i] - fm[i]) / (2.0 * h); 62 } 63 } 64 return J; 65 } 66 67 // Matrix multiplication C = A * B 68 Mat matmul(const Mat& A, const Mat& B) { 69 size_t p = A.size(); 70 size_t m = A.empty() ? 0 : A[0].size(); 71 size_t n = B.empty() ? 0 : B[0].size(); 72 if (!B.empty() && B.size() != m) throw runtime_error("Dimension mismatch in matmul"); 73 Mat C(p, Vec(n, 0.0)); 74 for (size_t i = 0; i < p; ++i) { 75 for (size_t k = 0; k < m; ++k) { 76 double aik = A[i][k]; 77 for (size_t j = 0; j < n; ++j) { 78 C[i][j] += aik * B[k][j]; 79 } 80 } 81 } 82 return C; 83 } 84 85 int main() { 86 cout << setprecision(6) << fixed; 87 // Point of evaluation in R^2 88 Vec x = {1.2, -0.7}; 89 90 // Compose h = f(g(x)) 91 auto h = [](const Vec& x) -> Vec { 92 return f(g(x)); 93 }; 94 95 // Evaluate y = g(x) 96 Vec y = g(x); 97 printVec("x", x); 98 printVec("y = g(x)", y); 99 100 // Compute Jacobians: J_g(x), J_f(y), and their product 101 Mat Jg = jacobian(function<Vec(const Vec&)>(g), x); 102 Mat Jf = jacobian(function<Vec(const Vec&)>(f), y); 103 Mat Jh_chain = matmul(Jf, Jg); 104 105 // Direct numerical Jacobian of the composition h at x 106 Mat Jh_direct = jacobian(function<Vec(const Vec&)>(h), x); 107 108 printMat("J_g(x)", Jg); 109 printMat("J_f(g(x))", Jf); 110 printMat("J_{f∘g}(x) via chain = J_f(g(x)) * J_g(x)", Jh_chain); 111 printMat("J_{f∘g}(x) via direct finite diff", Jh_direct); 112 113 // Show the difference between the two Jacobians 114 Mat Diff = Jh_chain; 115 for (size_t i = 0; i < Diff.size(); ++i) 116 for (size_t j = 0; j < Diff[0].size(); ++j) 117 Diff[i][j] -= Jh_direct[i][j]; 118 printMat("Difference (chain - direct)", Diff); 119 120 return 0; 121 } 122
We define g: R^2→R^3 and f: R^3→R^2, then compose h=f∘g. Using central differences, we compute numerical Jacobians J_g(x) and J_f(g(x)). The chain rule predicts J_{f∘g}(x) = J_f(g(x)) J_g(x). We verify this by also computing the Jacobian of the composition directly and comparing the two. Up to finite-difference error, the matrices match, confirming the chain rule in Jacobian form.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Dual number carrying value and gradient (size n) 5 struct Dual { 6 double val; // primal value 7 vector<double> d; // derivatives wrt each input variable 8 9 Dual(double v = 0.0, int n = 0) : val(v), d(n, 0.0) {} 10 static Dual variable(double v, int n, int idx) { 11 Dual x(v, n); 12 x.d[idx] = 1.0; // seed derivative for this variable 13 return x; 14 } 15 }; 16 17 // Elementwise operations (chain rule happens in these overloads) 18 Dual operator+(const Dual& a, const Dual& b) { 19 Dual r; r.val = a.val + b.val; r.d.resize(a.d.size()); 20 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] + b.d[i]; 21 return r; 22 } 23 Dual operator-(const Dual& a, const Dual& b) { 24 Dual r; r.val = a.val - b.val; r.d.resize(a.d.size()); 25 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] - b.d[i]; 26 return r; 27 } 28 Dual operator*(const Dual& a, const Dual& b) { 29 Dual r; r.val = a.val * b.val; r.d.resize(a.d.size()); 30 for (size_t i = 0; i < r.d.size(); ++i) 31 r.d[i] = a.d[i] * b.val + a.val * b.d[i]; // product rule 32 return r; 33 } 34 Dual operator/(const Dual& a, const Dual& b) { 35 Dual r; r.val = a.val / b.val; r.d.resize(a.d.size()); 36 for (size_t i = 0; i < r.d.size(); ++i) 37 r.d[i] = (a.d[i] * b.val - a.val * b.d[i]) / (b.val * b.val); 38 return r; 39 } 40 41 // Mixed with doubles 42 Dual operator+(const Dual& a, double c) { Dual r=a; r.val+=c; return r; } 43 Dual operator+(double c, const Dual& a) { return a + c; } 44 Dual operator-(const Dual& a, double c) { Dual r=a; r.val-=c; return r; } 45 Dual operator-(double c, const Dual& a) { Dual r; r.val=c-a.val; r.d.resize(a.d.size()); for(size_t i=0;i<r.d.size();++i) r.d[i]=-a.d[i]; return r; } 46 Dual operator*(const Dual& a, double c) { Dual r=a; r.val*=c; for(double &di:r.d) di*=c; return r; } 47 Dual operator*(double c, const Dual& a) { return a * c; } 48 Dual operator/(const Dual& a, double c) { Dual r=a; r.val/=c; for(double &di:r.d) di/=c; return r; } 49 50 // Elementary functions 51 Dual sin(const Dual& a) { 52 Dual r; r.val = std::sin(a.val); r.d.resize(a.d.size()); 53 double c = std::cos(a.val); 54 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = c * a.d[i]; 55 return r; 56 } 57 Dual cos(const Dual& a) { 58 Dual r; r.val = std::cos(a.val); r.d.resize(a.d.size()); 59 double s = -std::sin(a.val); 60 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = s * a.d[i]; 61 return r; 62 } 63 Dual exp(const Dual& a) { 64 Dual r; r.val = std::exp(a.val); r.d.resize(a.d.size()); 65 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = r.val * a.d[i]; 66 return r; 67 } 68 Dual pow(const Dual& a, double p) { // a^p, p constant 69 Dual r; r.val = std::pow(a.val, p); r.d.resize(a.d.size()); 70 double coeff = (p == 0.0 ? 0.0 : p * std::pow(a.val, p - 1.0)); 71 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = coeff * a.d[i]; 72 return r; 73 } 74 75 int main() { 76 ios::sync_with_stdio(false); 77 cin.tie(nullptr); 78 79 // We compute gradient of F(x,y) = (x*y)^2 + exp(x^2 + sin(y)) at a point. 80 // This composition uses inner functions g1=x*y, g2=x^2+sin(y) and outer \varphi(u,v)=u^2+exp(v). 81 const int n = 2; // number of inputs (x, y) 82 double x0 = 1.2, y0 = -0.7; 83 84 Dual X = Dual::variable(x0, n, 0); // dX/dx = 1, dX/dy = 0 85 Dual Y = Dual::variable(y0, n, 1); // dY/dx = 0, dY/dy = 1 86 87 Dual g1 = X * Y; // u = x*y 88 Dual g2 = pow(X, 2.0) + sin(Y); // v = x^2 + sin(y) 89 Dual F = pow(g1, 2.0) + exp(g2);// F = u^2 + exp(v) 90 91 cout.setf(std::ios::fixed); cout << setprecision(8); 92 cout << "F(x,y) = (x*y)^2 + exp(x^2 + sin(y))\n"; 93 cout << "At (x,y) = (" << x0 << ", " << y0 << ")\n"; 94 cout << "Value: " << F.val << "\n"; 95 cout << "Gradient: [dF/dx, dF/dy] = [" << F.d[0] << ", " << F.d[1] << "]\n"; 96 97 // Optional: compare with numerical central differences 98 auto f_scalar = [](double x, double y) { 99 return (x*y)*(x*y) + exp(x*x + sin(y)); 100 }; 101 double h = 1e-6; 102 double fxp = f_scalar(x0 + h, y0), fxm = f_scalar(x0 - h, y0); 103 double fyp = f_scalar(x0, y0 + h), fym = f_scalar(x0, y0 - h); 104 double dfdx_num = (fxp - fxm) / (2*h); 105 double dfdy_num = (fyp - fym) / (2*h); 106 cout << "Finite-difference gradient ≈ [" << dfdx_num << ", " << dfdy_num << "]\n"; 107 108 return 0; 109 } 110
This program implements forward-mode automatic differentiation with dual numbers that carry a vector of partial derivatives. Operator overloads apply the chain rule locally for +, −, ×, ÷, sin, exp, and pow. We define a composite scalar function F(x,y) = (xy)^2 + exp(x^2 + sin(y)), evaluate it at a point, and read off the exact gradient from the dual parts. A finite-difference check confirms the result numerically. The chain rule is executed automatically at each operation.