🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Multivariable Chain Rule

Key Points

  • •
    The multivariable chain rule explains how rates of change pass through a pipeline of functions by multiplying the right derivatives (Jacobians) in the right order.
  • •
    For vector-valued functions, the derivative is a Jacobian matrix, and the chain rule says Jf∘g​(x) = Jf​(g(x)) Jg​(x).
  • •
    When the outer function is scalar, the gradient version is ∇(φ∘g)(x) = Jg​(x)^T ∇φ(g(x)), which is the basis of backpropagation in machine learning.
  • •
    You must always evaluate outer derivatives at the inner function’s current output; forgetting this is a very common mistake.
  • •
    Dimensions must match: if g: Rn → Rm and f: Rm → Rp, then Jf​(g(x)) is p×m and Jg​(x) is m×n, so their product is p×n.
  • •
    Numerically, you can verify the chain rule by computing Jacobians with finite differences and checking that the product matches the Jacobian of the composition.
  • •
    Automatic differentiation (AD) implements the chain rule mechanically; forward mode propagates derivative columns, while reverse mode propagates gradients backward.
  • •
    Computational cost depends on method: central differences need 2n function calls per output, forward-mode AD scales with input dimension, and reverse-mode AD scales with output dimension.

Prerequisites

  • →Single-variable chain rule — Provides the core idea that derivatives of compositions multiply; the multivariable version generalizes this.
  • →Partial derivatives — Chain rule entries are sums/products of partial derivatives; you must know how to compute them.
  • →Gradient and Jacobian — The multivariable chain rule is stated succinctly in terms of gradients and Jacobians.
  • →Matrix multiplication — Combining sensitivities uses matrix products of Jacobians; dimension checks rely on linear algebra.
  • →Linear approximation and differentials — The chain rule composes local linear maps; understanding differentials clarifies why Jacobian products arise.
  • →Vector calculus notation — Comfort with ∇, dot products, and indexing helps read and manipulate formulas correctly.
  • →Numerical differentiation — Useful for verifying analytic derivatives and understanding practical errors and step-size tradeoffs.
  • →Automatic differentiation basics — Shows how the chain rule is implemented in software for exact derivatives.

Detailed Explanation

Tap terms for definitions

01Overview

The multivariable chain rule is the rule for differentiating compositions of functions when there are many inputs and outputs. In single-variable calculus, it says how the slope changes when one variable depends on another. In multiple variables, each function can map between spaces of different dimensions, so the derivative is best captured by a Jacobian matrix (a grid of partial derivatives). If you have g mapping from R^n to R^m and f mapping from R^m to R^p, then their composition h = f ∘ g maps R^n to R^p. The chain rule states that the Jacobian of h at x equals the matrix product of the Jacobians of f (evaluated at g(x)) and g (evaluated at x). This is a compact and powerful way to describe how small changes in inputs flow through multiple stages.

This idea powers many practical fields: in optimization, it explains how gradients change when you reparameterize variables; in physics, it connects rates like velocity and temperature that depend on moving positions; in computer graphics and robotics, it ties together coordinate transforms; and in machine learning, it is the mathematical heart of backpropagation. The chain rule also provides a way to organize calculations efficiently: for scalar outputs, reverse-mode automatic differentiation is preferred; for few inputs, forward-mode works well. Computationally, you can check or approximate the pieces using finite differences and then assemble them using the matrix product implied by the chain rule.

02Intuition & Analogies

Imagine an assembly line: raw parts (inputs) go through station g first, getting reshaped into intermediate parts, and then through station f to become final products. If you nudge a raw part slightly, the final product changes for two reasons: (1) the intermediate part changes a bit (how sensitive g is), and (2) the final step responds to that change (how sensitive f is to its input). The total effect equals “sensitivity of f to its input” times “sensitivity of g to the original input.” That’s exactly the chain rule.

Another analogy is navigation with a GPS. Suppose your current location depends on joystick inputs (x and y), and your displayed map coordinates (U, V, W) depend on your location through a projection. A tiny joystick push alters location a little, and that shifted location alters the displayed coordinates. The total change in displayed coordinates is the product of two maps: the local linear effect of joystick on location, and the local linear effect of location on the projection. Multiplying these small linear effects (matrices) gives the overall small change.

For a scalar readout like temperature at a moving point, the chain rule is like following ripples: the temperature changes as you move in space (its gradient), and you’re moving with some velocity. The rate of temperature change you feel is the dot product of “which direction is hotter” (gradient) and “which way you’re moving” (velocity). When you stack many stages (as in neural networks), you just multiply more of these small linear effects in order from last to first, like passing a message backward through layers.

03Formal Definition

Let g: Rn → Rm and f: Rm → Rp be differentiable at x ∈ Rn and at y=g(x) ∈ Rm, respectively. The derivative (Jacobian) of a vector-valued function h: Rn → Rp at x is the p × n matrix Jh​(x) whose (i, k) entry is ∂xk​∂hi​​(x). The multivariable chain rule states that for the composition h=f ∘ g, the Jacobian at x is given by \nJh​(x) = Jf​(g(x))\, Jg​(x), \nwhere Jf​(g(x)) is p × m and Jg​(x) is m × n. This is equivalent to the component-wise identity \n∂xk​∂(fi​∘g)​(x) = ∑j=1m​ ∂uj​∂fi​​(g(x)) \, ∂xk​∂gj​​(x). \nIf the outer function is scalar, φ: Rm → R, then its gradient at y is the 1 × m row (or m × 1 column, depending on convention). The chain rule takes the gradient form \n∇ (φ ∘ g)(x) = Jg​(x)^⊤ \, ∇ φ(g(x)). \nFor a scalar function f: Rn → R along a path r: R → Rn, the chain rule reduces to dtd​ f(r(t)) = ∇ f(r(t)) ⋅ r'(t).

04When to Use

  • Reparameterizations: If you express variables through others (like polar to Cartesian coordinates), use the chain rule to get derivatives in the new variables.
  • Neural networks and backpropagation: Gradients of loss with respect to weights are computed by repeatedly applying the chain rule layer by layer.
  • Physics along trajectories: Quantities like temperature, density, or potential that depend on position and time via a moving particle require the path version of the chain rule.
  • Computer graphics and robotics: Stacking rotations, scalings, and translations amounts to composing maps; the chain rule tells you how sensitivities combine via Jacobian products.
  • Optimization with constraints or substitutions: When an objective depends on an intermediate function (e.g., z = g(x) and objective L(z)), the gradient with respect to x is J_g(x)^T ∇L.
  • Probabilistic modeling and change of variables: Jacobian determinants appear in density transformations; computing or estimating Jacobians relies on the chain rule.
  • Numerical verification and debugging: To check a hand-derived gradient, compute numerical Jacobians by finite differences and compare them with Jacobian products implied by the chain rule.

⚠️Common Mistakes

  • Dimension mismatch: Forgetting that J_f(g(x)) must be p×m and J_g(x) must be m×n. Always write down shapes to confirm the product is defined.
  • Evaluating at the wrong point: The outer Jacobian must be evaluated at g(x), not at x. Omitting this is a frequent source of incorrect results.
  • Gradient vs. Jacobian conventions: Mixing row and column gradient conventions can introduce missing transposes. Pick a convention and stick to it.
  • Ignoring nondifferentiability: If either function is not differentiable at the evaluation point, the chain rule may fail. Check continuity and differentiability.
  • Numerical pitfalls: Using too large a step size h in finite differences gives truncation error; too small gives round-off error. Central differences usually balance accuracy and stability.
  • Confusing elementwise and matrix products: The chain rule uses matrix multiplication of Jacobians, not elementwise multiplication.
  • Overlooking multiple inner functions: When the outer function depends on several inner outputs, the correct sum over those channels appears via matrix multiplication; do not drop terms.
  • AD mode choice: Using forward mode for very high-dimensional inputs can be slow; for scalar outputs with many inputs, reverse mode (backprop) is usually more efficient.

Key Formulas

Jacobian Chain Rule

Jf∘g​(x)=Jf​(g(x))Jg​(x)

Explanation: The Jacobian of a composition equals the product of the outer Jacobian (evaluated at the inner output) with the inner Jacobian. Use this to combine sensitivities through stages.

Component-wise Chain Rule

∂xk​∂(fi​∘g)​(x)=j=1∑m​∂uj​∂fi​​(g(x))∂xk​∂gj​​(x)

Explanation: Each output component’s derivative with respect to each input equals a sum over all intermediate channels. This mirrors matrix multiplication entries.

Gradient Chain Rule

∇(φ∘g)(x)=Jg​(x)⊤∇φ(g(x))

Explanation: For a scalar outer function, the gradient with respect to inputs is the transpose of the inner Jacobian times the outer gradient. This is the basis of backpropagation.

Path (Single-Parameter) Chain Rule

dtd​f(r(t))=∇f(r(t))⋅r′(t)

Explanation: When inputs move along a path r(t), the rate of change of f equals the dot product of the gradient of f with the velocity along the path.

Multi-layer Chain Rule

JfL​∘fL−1​∘⋯∘f1​​(x)=JfL​​(zL−1​)⋯Jf2​​(z1​)Jf1​​(x),zk​=fk​(zk−1​), z0​=x

Explanation: For many composed layers, the total Jacobian is the ordered product of all local Jacobians evaluated at the correct intermediate points.

Second-Order (Scalar Outer) Chain Rule

Hφ∘g​(x)=Jg​(x)⊤Hφ​(g(x))Jg​(x)+j=1∑m​∂uj​∂φ​(g(x))Hgj​​(x)

Explanation: The Hessian of a scalar composition has a quadratic term from curvature of the outer function and additional terms from curvature of the inner functions.

Central Difference Approximation

∂xk​∂fi​​(x)≈2hfi​(x+hek​)−fi​(x−hek​)​

Explanation: This estimates Jacobian entries numerically using symmetric perturbations. It has error on the order of h2 for smooth functions.

Operator Norm Bound

∥Jf∘g​(x)∥≤∥Jf​(g(x))∥∥Jg​(x)∥

Explanation: Local stretching of a composition does not exceed the product of the stretchings of the parts. Useful for stability and Lipschitz estimates.

Complexity Analysis

Computing Jacobians and applying the chain rule has distinct computational profiles depending on the method. If you use central finite differences to estimate the Jacobian Jg​ of g: Rn→Rm, you need 2 evaluations per input dimension per output component, resulting in roughly 2n function evaluations to form each output column, or O(mn) total function evaluations. For a composition h=f∘g with f: Rm→Rp, you additionally compute Jf​ at y=g(x), which costs O(pm) evaluations. Forming the product Jf​(g(x)) Jg​(x) then costs O(pmn) arithmetic operations for dense matrices, which is usually small compared to the cost of function evaluations if they are expensive (e.g., simulations). Forward-mode automatic differentiation propagates one “seed” direction through the computation and effectively computes a Jacobian-vector product (JVP) in time proportional to the cost of evaluating the function, up to a factor that scales with the input dimension n. To obtain the full Jacobian, you need n passes (one per input basis vector), so the total cost is about n times the primal evaluation cost; memory overhead is modest, O(n) per active variable to store partials. Reverse-mode AD (backpropagation) computes vector-Jacobian products (VJPs) and is most efficient for scalar outputs (p=1). It yields the full gradient in time on the order of a small constant multiple of a single function evaluation, with memory overhead for storing the computation graph (often the dominant cost). For dense problems, matrix multiplications among explicit Jacobians cost O(pmn) time and O(pm + mn) space, but in AD you avoid explicit Jacobians by working with JVPs or VJPs, improving both time and space usage for large-scale applications. Numerical differentiation introduces truncation and round-off errors; choosing a step size h≈√ε⋅max(1,∣∣x∣∣)/∣∣direction∣∣ balances accuracy and stability.

Code Examples

Numerical verification of the multivariable chain rule with Jacobians
1#include <bits/stdc++.h>
2using namespace std;
3
4using Vec = vector<double>;
5using Mat = vector<vector<double>>;
6
7// Pretty-print helpers
8void printVec(const string& name, const Vec& v) {
9 cout << name << " = [";
10 for (size_t i = 0; i < v.size(); ++i) {
11 cout << fixed << setprecision(6) << v[i] << (i + 1 == v.size() ? "" : ", ");
12 }
13 cout << "]\n";
14}
15
16void printMat(const string& name, const Mat& A) {
17 cout << name << " (" << A.size() << "x" << (A.empty()?0:A[0].size()) << "):\n";
18 for (const auto& row : A) {
19 cout << " ";
20 for (size_t j = 0; j < row.size(); ++j) {
21 cout << setw(12) << fixed << setprecision(6) << row[j] << (j + 1 == row.size() ? "" : " ");
22 }
23 cout << "\n";
24 }
25}
26
27// Define g: R^2 -> R^3
28auto g = [](const Vec& x) -> Vec {
29 // x = [x0, x1]
30 double x0 = x[0], x1 = x[1];
31 Vec y(3);
32 y[0] = x0 * x1; // u1 = x*y
33 y[1] = x0 * x0 + x1; // u2 = x^2 + y
34 y[2] = exp(x0 - x1); // u3 = e^{x - y}
35 return y;
36};
37
38// Define f: R^3 -> R^2
39auto f = [](const Vec& u) -> Vec {
40 // u = [u, v, w]
41 double U = u[0], V = u[1], W = u[2];
42 Vec z(2);
43 z[0] = U + sin(V); // z1 = U + sin(V)
44 z[1] = V * W; // z2 = V * W
45 return z;
46};
47
48// Numerical Jacobian via central differences (O(m*n) function evals)
49Mat jacobian(function<Vec(const Vec&)> func, const Vec& x, double h = 1e-6) {
50 Vec f0 = func(x);
51 size_t m = f0.size();
52 size_t n = x.size();
53 Mat J(m, Vec(n, 0.0));
54 for (size_t k = 0; k < n; ++k) {
55 Vec xp = x, xm = x;
56 xp[k] += h;
57 xm[k] -= h;
58 Vec fp = func(xp);
59 Vec fm = func(xm);
60 for (size_t i = 0; i < m; ++i) {
61 J[i][k] = (fp[i] - fm[i]) / (2.0 * h);
62 }
63 }
64 return J;
65}
66
67// Matrix multiplication C = A * B
68Mat matmul(const Mat& A, const Mat& B) {
69 size_t p = A.size();
70 size_t m = A.empty() ? 0 : A[0].size();
71 size_t n = B.empty() ? 0 : B[0].size();
72 if (!B.empty() && B.size() != m) throw runtime_error("Dimension mismatch in matmul");
73 Mat C(p, Vec(n, 0.0));
74 for (size_t i = 0; i < p; ++i) {
75 for (size_t k = 0; k < m; ++k) {
76 double aik = A[i][k];
77 for (size_t j = 0; j < n; ++j) {
78 C[i][j] += aik * B[k][j];
79 }
80 }
81 }
82 return C;
83}
84
85int main() {
86 cout << setprecision(6) << fixed;
87 // Point of evaluation in R^2
88 Vec x = {1.2, -0.7};
89
90 // Compose h = f(g(x))
91 auto h = [](const Vec& x) -> Vec {
92 return f(g(x));
93 };
94
95 // Evaluate y = g(x)
96 Vec y = g(x);
97 printVec("x", x);
98 printVec("y = g(x)", y);
99
100 // Compute Jacobians: J_g(x), J_f(y), and their product
101 Mat Jg = jacobian(function<Vec(const Vec&)>(g), x);
102 Mat Jf = jacobian(function<Vec(const Vec&)>(f), y);
103 Mat Jh_chain = matmul(Jf, Jg);
104
105 // Direct numerical Jacobian of the composition h at x
106 Mat Jh_direct = jacobian(function<Vec(const Vec&)>(h), x);
107
108 printMat("J_g(x)", Jg);
109 printMat("J_f(g(x))", Jf);
110 printMat("J_{f∘g}(x) via chain = J_f(g(x)) * J_g(x)", Jh_chain);
111 printMat("J_{f∘g}(x) via direct finite diff", Jh_direct);
112
113 // Show the difference between the two Jacobians
114 Mat Diff = Jh_chain;
115 for (size_t i = 0; i < Diff.size(); ++i)
116 for (size_t j = 0; j < Diff[0].size(); ++j)
117 Diff[i][j] -= Jh_direct[i][j];
118 printMat("Difference (chain - direct)", Diff);
119
120 return 0;
121}
122

We define g: R^2→R^3 and f: R^3→R^2, then compose h=f∘g. Using central differences, we compute numerical Jacobians J_g(x) and J_f(g(x)). The chain rule predicts J_{f∘g}(x) = J_f(g(x)) J_g(x). We verify this by also computing the Jacobian of the composition directly and comparing the two. Up to finite-difference error, the matrices match, confirming the chain rule in Jacobian form.

Time: O((m + p) * n * C_eval) for forming J_g and J_f via finite differences, plus O(pmn) for the matrix product, where C_eval is the cost of one function evaluation.Space: O(mn + pm + pn) to store the Jacobians and intermediates; dominated by O(pn) for the final Jacobian of the composition.
Forward-mode automatic differentiation with dual numbers (gradient via chain rule)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Dual number carrying value and gradient (size n)
5struct Dual {
6 double val; // primal value
7 vector<double> d; // derivatives wrt each input variable
8
9 Dual(double v = 0.0, int n = 0) : val(v), d(n, 0.0) {}
10 static Dual variable(double v, int n, int idx) {
11 Dual x(v, n);
12 x.d[idx] = 1.0; // seed derivative for this variable
13 return x;
14 }
15};
16
17// Elementwise operations (chain rule happens in these overloads)
18Dual operator+(const Dual& a, const Dual& b) {
19 Dual r; r.val = a.val + b.val; r.d.resize(a.d.size());
20 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] + b.d[i];
21 return r;
22}
23Dual operator-(const Dual& a, const Dual& b) {
24 Dual r; r.val = a.val - b.val; r.d.resize(a.d.size());
25 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] - b.d[i];
26 return r;
27}
28Dual operator*(const Dual& a, const Dual& b) {
29 Dual r; r.val = a.val * b.val; r.d.resize(a.d.size());
30 for (size_t i = 0; i < r.d.size(); ++i)
31 r.d[i] = a.d[i] * b.val + a.val * b.d[i]; // product rule
32 return r;
33}
34Dual operator/(const Dual& a, const Dual& b) {
35 Dual r; r.val = a.val / b.val; r.d.resize(a.d.size());
36 for (size_t i = 0; i < r.d.size(); ++i)
37 r.d[i] = (a.d[i] * b.val - a.val * b.d[i]) / (b.val * b.val);
38 return r;
39}
40
41// Mixed with doubles
42Dual operator+(const Dual& a, double c) { Dual r=a; r.val+=c; return r; }
43Dual operator+(double c, const Dual& a) { return a + c; }
44Dual operator-(const Dual& a, double c) { Dual r=a; r.val-=c; return r; }
45Dual operator-(double c, const Dual& a) { Dual r; r.val=c-a.val; r.d.resize(a.d.size()); for(size_t i=0;i<r.d.size();++i) r.d[i]=-a.d[i]; return r; }
46Dual operator*(const Dual& a, double c) { Dual r=a; r.val*=c; for(double &di:r.d) di*=c; return r; }
47Dual operator*(double c, const Dual& a) { return a * c; }
48Dual operator/(const Dual& a, double c) { Dual r=a; r.val/=c; for(double &di:r.d) di/=c; return r; }
49
50// Elementary functions
51Dual sin(const Dual& a) {
52 Dual r; r.val = std::sin(a.val); r.d.resize(a.d.size());
53 double c = std::cos(a.val);
54 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = c * a.d[i];
55 return r;
56}
57Dual cos(const Dual& a) {
58 Dual r; r.val = std::cos(a.val); r.d.resize(a.d.size());
59 double s = -std::sin(a.val);
60 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = s * a.d[i];
61 return r;
62}
63Dual exp(const Dual& a) {
64 Dual r; r.val = std::exp(a.val); r.d.resize(a.d.size());
65 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = r.val * a.d[i];
66 return r;
67}
68Dual pow(const Dual& a, double p) { // a^p, p constant
69 Dual r; r.val = std::pow(a.val, p); r.d.resize(a.d.size());
70 double coeff = (p == 0.0 ? 0.0 : p * std::pow(a.val, p - 1.0));
71 for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = coeff * a.d[i];
72 return r;
73}
74
75int main() {
76 ios::sync_with_stdio(false);
77 cin.tie(nullptr);
78
79 // We compute gradient of F(x,y) = (x*y)^2 + exp(x^2 + sin(y)) at a point.
80 // This composition uses inner functions g1=x*y, g2=x^2+sin(y) and outer \varphi(u,v)=u^2+exp(v).
81 const int n = 2; // number of inputs (x, y)
82 double x0 = 1.2, y0 = -0.7;
83
84 Dual X = Dual::variable(x0, n, 0); // dX/dx = 1, dX/dy = 0
85 Dual Y = Dual::variable(y0, n, 1); // dY/dx = 0, dY/dy = 1
86
87 Dual g1 = X * Y; // u = x*y
88 Dual g2 = pow(X, 2.0) + sin(Y); // v = x^2 + sin(y)
89 Dual F = pow(g1, 2.0) + exp(g2);// F = u^2 + exp(v)
90
91 cout.setf(std::ios::fixed); cout << setprecision(8);
92 cout << "F(x,y) = (x*y)^2 + exp(x^2 + sin(y))\n";
93 cout << "At (x,y) = (" << x0 << ", " << y0 << ")\n";
94 cout << "Value: " << F.val << "\n";
95 cout << "Gradient: [dF/dx, dF/dy] = [" << F.d[0] << ", " << F.d[1] << "]\n";
96
97 // Optional: compare with numerical central differences
98 auto f_scalar = [](double x, double y) {
99 return (x*y)*(x*y) + exp(x*x + sin(y));
100 };
101 double h = 1e-6;
102 double fxp = f_scalar(x0 + h, y0), fxm = f_scalar(x0 - h, y0);
103 double fyp = f_scalar(x0, y0 + h), fym = f_scalar(x0, y0 - h);
104 double dfdx_num = (fxp - fxm) / (2*h);
105 double dfdy_num = (fyp - fym) / (2*h);
106 cout << "Finite-difference gradient ≈ [" << dfdx_num << ", " << dfdy_num << "]\n";
107
108 return 0;
109}
110

This program implements forward-mode automatic differentiation with dual numbers that carry a vector of partial derivatives. Operator overloads apply the chain rule locally for +, −, ×, ÷, sin, exp, and pow. We define a composite scalar function F(x,y) = (xy)^2 + exp(x^2 + sin(y)), evaluate it at a point, and read off the exact gradient from the dual parts. A finite-difference check confirms the result numerically. The chain rule is executed automatically at each operation.

Time: O(n * N_ops), where n is the number of inputs and N_ops is the number of primitive operations in evaluating F.Space: O(n) per active variable to store derivative vectors; overall O(n * V_active), where V_active is the number of simultaneously live dual variables.
#multivariable chain rule#jacobian#gradient#automatic differentiation#forward mode#reverse mode#backpropagation#finite differences#matrix multiplication#hessian#jvp#vjp#composition#vector calculus#sensitivity analysis