∑MathIntermediate

Multivariable Chain Rule

Key Points

•
The multivariable chain rule explains how rates of change pass through a pipeline of functions by multiplying the right derivatives (Jacobians) in the right order.
•
For vector-valued functions, the derivative is a Jacobian matrix, and the chain rule says $J_{f \circ g}$ (x) = $J_{f}$ (g(x)) $J_{g}$ (x).
•
When the outer function is scalar, the gradient version is ∇( $φ \circ g) (x)$ = $J_{g}$ (x)^T ∇ $φ (g (x)),$ which is the basis of backpropagation in machine learning.
•
You must always evaluate outer derivatives at the inner function’s current output; forgetting this is a very common mistake.
•
Dimensions must match: if g: $R^{n}$ → $R^{m}$ and f: $R^{m}$ → $R^{p}$ , then $J_{f}$ (g(x)) is p×m and $J_{g}$ (x) is m×n, so their product is p×n.
•
Numerically, you can verify the chain rule by computing Jacobians with finite differences and checking that the product matches the Jacobian of the composition.
•
Automatic differentiation (AD) implements the chain rule mechanically; forward mode propagates derivative columns, while reverse mode propagates gradients backward.
•
Computational cost depends on method: central differences need 2n function calls per output, forward-mode AD scales with input dimension, and reverse-mode AD scales with output dimension.

Prerequisites

→Single-variable chain rule — Provides the core idea that derivatives of compositions multiply; the multivariable version generalizes this.
→Partial derivatives — Chain rule entries are sums/products of partial derivatives; you must know how to compute them.
→Gradient and Jacobian — The multivariable chain rule is stated succinctly in terms of gradients and Jacobians.
→Matrix multiplication — Combining sensitivities uses matrix products of Jacobians; dimension checks rely on linear algebra.
→Linear approximation and differentials — The chain rule composes local linear maps; understanding differentials clarifies why Jacobian products arise.
→Vector calculus notation — Comfort with ∇, dot products, and indexing helps read and manipulate formulas correctly.
→Numerical differentiation — Useful for verifying analytic derivatives and understanding practical errors and step-size tradeoffs.
→Automatic differentiation basics — Shows how the chain rule is implemented in software for exact derivatives.

Detailed Explanation

Tap terms for definitions

01Overview

The multivariable chain rule is the rule for differentiating compositions of functions when there are many inputs and outputs. In single-variable calculus, it says how the slope changes when one variable depends on another. In multiple variables, each function can map between spaces of different dimensions, so the derivative is best captured by a Jacobian matrix (a grid of partial derivatives). If you have g mapping from R^n to R^m and f mapping from R^m to R^p, then their composition h = f ∘ g maps R^n to R^p. The chain rule states that the Jacobian of h at x equals the matrix product of the Jacobians of f (evaluated at g(x)) and g (evaluated at x). This is a compact and powerful way to describe how small changes in inputs flow through multiple stages.

This idea powers many practical fields: in optimization, it explains how gradients change when you reparameterize variables; in physics, it connects rates like velocity and temperature that depend on moving positions; in computer graphics and robotics, it ties together coordinate transforms; and in machine learning, it is the mathematical heart of backpropagation. The chain rule also provides a way to organize calculations efficiently: for scalar outputs, reverse-mode automatic differentiation is preferred; for few inputs, forward-mode works well. Computationally, you can check or approximate the pieces using finite differences and then assemble them using the matrix product implied by the chain rule.

02Intuition & Analogies

Imagine an assembly line: raw parts (inputs) go through station g first, getting reshaped into intermediate parts, and then through station f to become final products. If you nudge a raw part slightly, the final product changes for two reasons: (1) the intermediate part changes a bit (how sensitive g is), and (2) the final step responds to that change (how sensitive f is to its input). The total effect equals “sensitivity of f to its input” times “sensitivity of g to the original input.” That’s exactly the chain rule.

Another analogy is navigation with a GPS. Suppose your current location depends on joystick inputs (x and y), and your displayed map coordinates (U, V, W) depend on your location through a projection. A tiny joystick push alters location a little, and that shifted location alters the displayed coordinates. The total change in displayed coordinates is the product of two maps: the local linear effect of joystick on location, and the local linear effect of location on the projection. Multiplying these small linear effects (matrices) gives the overall small change.

For a scalar readout like temperature at a moving point, the chain rule is like following ripples: the temperature changes as you move in space (its gradient), and you’re moving with some velocity. The rate of temperature change you feel is the dot product of “which direction is hotter” (gradient) and “which way you’re moving” (velocity). When you stack many stages (as in neural networks), you just multiply more of these small linear effects in order from last to first, like passing a message backward through layers.

03Formal Definition

Let g:

R^{n}

\to

R^{m}

and f:

R^{m}

\to

R^{p}

be differentiable at x

\in

R^{n}

and at

y = g (x

)

\in

R^{m}

, respectively. The derivative (Jacobian) of a vector-valued function h:

R^{n}

\to

R^{p}

at x is the p

\times

n matrix

J_{h}

(x) whose (i, k) entry is

\frac{\partial h _{i}}{\partial x _{k}}

(x). The multivariable chain rule states that for the composition

h = f

\circ

g, the Jacobian at x is given by

\nJ_{h}

(x) =

J_{f}

(g(x))\,

J_{g}

(x),

\nwhere

J_{f}

(g(x)) is p

\times

m and

J_{g}

(x) is m

\times

n. This is equivalent to the component-wise identity

\n \frac{\partial ( f _{i} \circ g )}{\partial x _{k}}

(x) =

\sum_{j = 1}^{m}

\frac{\partial f _{i}}{\partial u _{j}}

(g(x)) \,

\frac{\partial g _{j}}{\partial x _{k}}

(x).

\nIf

the outer function is scalar,

φ

R^{m}

\to

R

, then its gradient at y is the 1

\times

m row (or m

\times

1 column, depending on convention). The chain rule takes the gradient form

\n \nabla

(

φ

\circ

g)(x) =

J_{g}

(x)^

⊤

\nabla

φ

(g(x)).

\nFor

a scalar function f:

R^{n}

\to

R

along a path r:

R

\to

R^{n}

, the chain rule reduces to

\frac{d}{d t}

f(r(t)) =

\nabla

f(r(t))

\cdot

r'(t).

04When to Use

Reparameterizations: If you express variables through others (like polar to Cartesian coordinates), use the chain rule to get derivatives in the new variables.
Neural networks and backpropagation: Gradients of loss with respect to weights are computed by repeatedly applying the chain rule layer by layer.
Physics along trajectories: Quantities like temperature, density, or potential that depend on position and time via a moving particle require the path version of the chain rule.
Computer graphics and robotics: Stacking rotations, scalings, and translations amounts to composing maps; the chain rule tells you how sensitivities combine via Jacobian products.
Optimization with constraints or substitutions: When an objective depends on an intermediate function (e.g., z = g(x) and objective L(z)), the gradient with respect to x is J_g(x)^T ∇L.
Probabilistic modeling and change of variables: Jacobian determinants appear in density transformations; computing or estimating Jacobians relies on the chain rule.
Numerical verification and debugging: To check a hand-derived gradient, compute numerical Jacobians by finite differences and compare them with Jacobian products implied by the chain rule.

⚠️Common Mistakes

Dimension mismatch: Forgetting that J_f(g(x)) must be p×m and J_g(x) must be m×n. Always write down shapes to confirm the product is defined.
Evaluating at the wrong point: The outer Jacobian must be evaluated at g(x), not at x. Omitting this is a frequent source of incorrect results.
Gradient vs. Jacobian conventions: Mixing row and column gradient conventions can introduce missing transposes. Pick a convention and stick to it.
Ignoring nondifferentiability: If either function is not differentiable at the evaluation point, the chain rule may fail. Check continuity and differentiability.
Numerical pitfalls: Using too large a step size h in finite differences gives truncation error; too small gives round-off error. Central differences usually balance accuracy and stability.
Confusing elementwise and matrix products: The chain rule uses matrix multiplication of Jacobians, not elementwise multiplication.
Overlooking multiple inner functions: When the outer function depends on several inner outputs, the correct sum over those channels appears via matrix multiplication; do not drop terms.
AD mode choice: Using forward mode for very high-dimensional inputs can be slow; for scalar outputs with many inputs, reverse mode (backprop) is usually more efficient.

Key Formulas

Jacobian Chain Rule

J_{f \circ g} (x) = J_{f} (g (x)) J_{g} (x)

Explanation: The Jacobian of a composition equals the product of the outer Jacobian (evaluated at the inner output) with the inner Jacobian. Use this to combine sensitivities through stages.

Component-wise Chain Rule

\frac{\partial ( f _{i} \circ g )}{\partial x _{k}} (x) = j = 1 \sum m \frac{\partial f _{i}}{\partial u _{j}} (g (x)) \frac{\partial g _{j}}{\partial x _{k}} (x)

Explanation: Each output component’s derivative with respect to each input equals a sum over all intermediate channels. This mirrors matrix multiplication entries.

Gradient Chain Rule

\nabla (φ \circ g) (x) = J_{g} (x)^{⊤} \nabla φ (g (x))

Explanation: For a scalar outer function, the gradient with respect to inputs is the transpose of the inner Jacobian times the outer gradient. This is the basis of backpropagation.

Path (Single-Parameter) Chain Rule

\frac{d}{d t} f (r (t)) = \nabla f (r (t)) \cdot r^{'} (t)

Explanation: When inputs move along a path r(t), the rate of change of f equals the dot product of the gradient of f with the velocity along the path.

Multi-layer Chain Rule

J_{f_{L} \circ f_{L - 1} \circ \dots \circ f_{1}} (x) = J_{f_{L}} (z_{L - 1}) \dots J_{f_{2}} (z_{1}) J_{f_{1}} (x), z_{k} = f_{k} (z_{k - 1}), z_{0} = x

Explanation: For many composed layers, the total Jacobian is the ordered product of all local Jacobians evaluated at the correct intermediate points.

Second-Order (Scalar Outer) Chain Rule

H_{φ \circ g} (x) = J_{g} (x)^{⊤} H_{φ} (g (x)) J_{g} (x) + j = 1 \sum m \frac{\partial φ}{\partial u _{j}} (g (x)) H_{g_{j}} (x)

Explanation: The Hessian of a scalar composition has a quadratic term from curvature of the outer function and additional terms from curvature of the inner functions.

Central Difference Approximation

\frac{\partial f _{i}}{\partial x _{k}} (x) \approx \frac{f _{i} ( x + h e _{k} ) - f _{i} ( x - h e _{k} )}{2 h}

Explanation: This estimates Jacobian entries numerically using symmetric perturbations. It has error on the order of $h^{2}$ for smooth functions.

Operator Norm Bound

∥ J_{f \circ g} (x) ∥ \leq ∥ J_{f} (g (x)) ∥ ∥ J_{g} (x) ∥

Explanation: Local stretching of a composition does not exceed the product of the stretchings of the parts. Useful for stability and Lipschitz estimates.

Complexity Analysis

Computing Jacobians and applying the chain rule has distinct computational profiles depending on the method. If you use central finite differences to estimate the Jacobian

J_{g}

of g:

R^{n}

→

R^{m}

, you need 2 evaluations per input dimension per output component, resulting in roughly 2n function evaluations to form each output column, or O(mn) total function evaluations. For a composition

h = f \circ g

with f:

R^{m}

→

R^{p}

, you additionally compute

J_{f}

y = g (x

), which costs O(pm) evaluations. Forming the product

J_{f}

(g(x))

J_{g}

(x) then costs O(pmn) arithmetic operations for dense matrices, which is usually small compared to the cost of function evaluations if they are expensive (e.g., simulations). Forward-mode automatic differentiation propagates one “seed” direction through the computation and effectively computes a Jacobian-vector product (JVP) in time proportional to the cost of evaluating the function, up to a factor that scales with the input dimension n. To obtain the full Jacobian, you need n passes (one per input basis vector), so the total cost is about n times the primal evaluation cost; memory overhead is modest, O(n) per active variable to store partials. Reverse-mode AD (backpropagation) computes vector-Jacobian products (VJPs) and is most efficient for scalar outputs (

p = 1

). It yields the full gradient in time on the order of a small constant multiple of a single function evaluation, with memory overhead for storing the computation graph (often the dominant cost). For dense problems, matrix multiplications among explicit Jacobians cost O(pmn) time and O(pm + mn) space, but in AD you avoid explicit Jacobians by working with JVPs or VJPs, improving both time and space usage for large-scale applications. Numerical differentiation introduces truncation and round-off errors; choosing a step size h≈√

ε \cdot ma x (1, ∣∣ x ∣∣) /∣∣ d i rec t i o n ∣∣

balances accuracy and stability.

Code Examples

Numerical verification of the multivariable chain rule with Jacobians

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 
7 // Pretty-print helpers
8 void printVec(const string& name, const Vec& v) {
9     cout << name << " = [";
10     for (size_t i = 0; i < v.size(); ++i) {
11         cout << fixed << setprecision(6) << v[i] << (i + 1 == v.size() ? "" : ", ");
12     }
13     cout << "]\n";
14 }
15 
16 void printMat(const string& name, const Mat& A) {
17     cout << name << " (" << A.size() << "x" << (A.empty()?0:A[0].size()) << "):\n";
18     for (const auto& row : A) {
19         cout << "  ";
20         for (size_t j = 0; j < row.size(); ++j) {
21             cout << setw(12) << fixed << setprecision(6) << row[j] << (j + 1 == row.size() ? "" : " ");
22         }
23         cout << "\n";
24     }
25 }
26 
27 // Define g: R^2 -> R^3
28 auto g = [](const Vec& x) -> Vec {
29     // x = [x0, x1]
30     double x0 = x[0], x1 = x[1];
31     Vec y(3);
32     y[0] = x0 * x1;           // u1 = x*y
33     y[1] = x0 * x0 + x1;      // u2 = x^2 + y
34     y[2] = exp(x0 - x1);      // u3 = e^{x - y}
35     return y;
36 };
37 
38 // Define f: R^3 -> R^2
39 auto f = [](const Vec& u) -> Vec {
40     // u = [u, v, w]
41     double U = u[0], V = u[1], W = u[2];
42     Vec z(2);
43     z[0] = U + sin(V);       // z1 = U + sin(V)
44     z[1] = V * W;            // z2 = V * W
45     return z;
46 };
47 
48 // Numerical Jacobian via central differences (O(m*n) function evals)
49 Mat jacobian(function<Vec(const Vec&)> func, const Vec& x, double h = 1e-6) {
50     Vec f0 = func(x);
51     size_t m = f0.size();
52     size_t n = x.size();
53     Mat J(m, Vec(n, 0.0));
54     for (size_t k = 0; k < n; ++k) {
55         Vec xp = x, xm = x;
56         xp[k] += h;
57         xm[k] -= h;
58         Vec fp = func(xp);
59         Vec fm = func(xm);
60         for (size_t i = 0; i < m; ++i) {
61             J[i][k] = (fp[i] - fm[i]) / (2.0 * h);
62         }
63     }
64     return J;
65 }
66 
67 // Matrix multiplication C = A * B
68 Mat matmul(const Mat& A, const Mat& B) {
69     size_t p = A.size();
70     size_t m = A.empty() ? 0 : A[0].size();
71     size_t n = B.empty() ? 0 : B[0].size();
72     if (!B.empty() && B.size() != m) throw runtime_error("Dimension mismatch in matmul");
73     Mat C(p, Vec(n, 0.0));
74     for (size_t i = 0; i < p; ++i) {
75         for (size_t k = 0; k < m; ++k) {
76             double aik = A[i][k];
77             for (size_t j = 0; j < n; ++j) {
78                 C[i][j] += aik * B[k][j];
79             }
80         }
81     }
82     return C;
83 }
84 
85 int main() {
86     cout << setprecision(6) << fixed;
87     // Point of evaluation in R^2
88     Vec x = {1.2, -0.7};
89 
90     // Compose h = f(g(x))
91     auto h = [](const Vec& x) -> Vec {
92         return f(g(x));
93     };
94 
95     // Evaluate y = g(x)
96     Vec y = g(x);
97     printVec("x", x);
98     printVec("y = g(x)", y);
99 
100     // Compute Jacobians: J_g(x), J_f(y), and their product
101     Mat Jg = jacobian(function<Vec(const Vec&)>(g), x);
102     Mat Jf = jacobian(function<Vec(const Vec&)>(f), y);
103     Mat Jh_chain = matmul(Jf, Jg);
104 
105     // Direct numerical Jacobian of the composition h at x
106     Mat Jh_direct = jacobian(function<Vec(const Vec&)>(h), x);
107 
108     printMat("J_g(x)", Jg);
109     printMat("J_f(g(x))", Jf);
110     printMat("J_{f∘g}(x) via chain = J_f(g(x)) * J_g(x)", Jh_chain);
111     printMat("J_{f∘g}(x) via direct finite diff", Jh_direct);
112 
113     // Show the difference between the two Jacobians
114     Mat Diff = Jh_chain;
115     for (size_t i = 0; i < Diff.size(); ++i)
116         for (size_t j = 0; j < Diff[0].size(); ++j)
117             Diff[i][j] -= Jh_direct[i][j];
118     printMat("Difference (chain - direct)", Diff);
119 
120     return 0;
121 }
122

We define g: R^2→R^3 and f: R^3→R^2, then compose h=f∘g. Using central differences, we compute numerical Jacobians J_g(x) and J_f(g(x)). The chain rule predicts J_{f∘g}(x) = J_f(g(x)) J_g(x). We verify this by also computing the Jacobian of the composition directly and comparing the two. Up to finite-difference error, the matrices match, confirming the chain rule in Jacobian form.

Time: O((m + p) * n * C_eval) for forming J_g and J_f via finite differences, plus O(pmn) for the matrix product, where C_eval is the cost of one function evaluation.Space: O(mn + pm + pn) to store the Jacobians and intermediates; dominated by O(pn) for the final Jacobian of the composition.

Forward-mode automatic differentiation with dual numbers (gradient via chain rule)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Dual number carrying value and gradient (size n)
5 struct Dual {
6     double val;                 // primal value
7     vector<double> d;           // derivatives wrt each input variable
8 
9     Dual(double v = 0.0, int n = 0) : val(v), d(n, 0.0) {}
10     static Dual variable(double v, int n, int idx) {
11         Dual x(v, n);
12         x.d[idx] = 1.0;         // seed derivative for this variable
13         return x;
14     }
15 };
16 
17 // Elementwise operations (chain rule happens in these overloads)
18 Dual operator+(const Dual& a, const Dual& b) {
19     Dual r; r.val = a.val + b.val; r.d.resize(a.d.size());
20     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] + b.d[i];
21     return r;
22 }
23 Dual operator-(const Dual& a, const Dual& b) {
24     Dual r; r.val = a.val - b.val; r.d.resize(a.d.size());
25     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] - b.d[i];
26     return r;
27 }
28 Dual operator*(const Dual& a, const Dual& b) {
29     Dual r; r.val = a.val * b.val; r.d.resize(a.d.size());
30     for (size_t i = 0; i < r.d.size(); ++i)
31         r.d[i] = a.d[i] * b.val + a.val * b.d[i]; // product rule
32     return r;
33 }
34 Dual operator/(const Dual& a, const Dual& b) {
35     Dual r; r.val = a.val / b.val; r.d.resize(a.d.size());
36     for (size_t i = 0; i < r.d.size(); ++i)
37         r.d[i] = (a.d[i] * b.val - a.val * b.d[i]) / (b.val * b.val);
38     return r;
39 }
40 
41 // Mixed with doubles
42 Dual operator+(const Dual& a, double c) { Dual r=a; r.val+=c; return r; }
43 Dual operator+(double c, const Dual& a) { return a + c; }
44 Dual operator-(const Dual& a, double c) { Dual r=a; r.val-=c; return r; }
45 Dual operator-(double c, const Dual& a) { Dual r; r.val=c-a.val; r.d.resize(a.d.size()); for(size_t i=0;i<r.d.size();++i) r.d[i]=-a.d[i]; return r; }
46 Dual operator*(const Dual& a, double c) { Dual r=a; r.val*=c; for(double &di:r.d) di*=c; return r; }
47 Dual operator*(double c, const Dual& a) { return a * c; }
48 Dual operator/(const Dual& a, double c) { Dual r=a; r.val/=c; for(double &di:r.d) di/=c; return r; }
49 
50 // Elementary functions
51 Dual sin(const Dual& a) {
52     Dual r; r.val = std::sin(a.val); r.d.resize(a.d.size());
53     double c = std::cos(a.val);
54     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = c * a.d[i];
55     return r;
56 }
57 Dual cos(const Dual& a) {
58     Dual r; r.val = std::cos(a.val); r.d.resize(a.d.size());
59     double s = -std::sin(a.val);
60     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = s * a.d[i];
61     return r;
62 }
63 Dual exp(const Dual& a) {
64     Dual r; r.val = std::exp(a.val); r.d.resize(a.d.size());
65     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = r.val * a.d[i];
66     return r;
67 }
68 Dual pow(const Dual& a, double p) { // a^p, p constant
69     Dual r; r.val = std::pow(a.val, p); r.d.resize(a.d.size());
70     double coeff = (p == 0.0 ? 0.0 : p * std::pow(a.val, p - 1.0));
71     for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = coeff * a.d[i];
72     return r;
73 }
74 
75 int main() {
76     ios::sync_with_stdio(false);
77     cin.tie(nullptr);
78 
79     // We compute gradient of F(x,y) = (x*y)^2 + exp(x^2 + sin(y)) at a point.
80     // This composition uses inner functions g1=x*y, g2=x^2+sin(y) and outer \varphi(u,v)=u^2+exp(v).
81     const int n = 2; // number of inputs (x, y)
82     double x0 = 1.2, y0 = -0.7;
83 
84     Dual X = Dual::variable(x0, n, 0); // dX/dx = 1, dX/dy = 0
85     Dual Y = Dual::variable(y0, n, 1); // dY/dx = 0, dY/dy = 1
86 
87     Dual g1 = X * Y;                 // u = x*y
88     Dual g2 = pow(X, 2.0) + sin(Y);  // v = x^2 + sin(y)
89     Dual F  = pow(g1, 2.0) + exp(g2);// F = u^2 + exp(v)
90 
91     cout.setf(std::ios::fixed); cout << setprecision(8);
92     cout << "F(x,y) = (x*y)^2 + exp(x^2 + sin(y))\n";
93     cout << "At (x,y) = (" << x0 << ", " << y0 << ")\n";
94     cout << "Value: " << F.val << "\n";
95     cout << "Gradient: [dF/dx, dF/dy] = [" << F.d[0] << ", " << F.d[1] << "]\n";
96 
97     // Optional: compare with numerical central differences
98     auto f_scalar = [](double x, double y) {
99         return (x*y)*(x*y) + exp(x*x + sin(y));
100     };
101     double h = 1e-6;
102     double fxp = f_scalar(x0 + h, y0), fxm = f_scalar(x0 - h, y0);
103     double fyp = f_scalar(x0, y0 + h), fym = f_scalar(x0, y0 - h);
104     double dfdx_num = (fxp - fxm) / (2*h);
105     double dfdy_num = (fyp - fym) / (2*h);
106     cout << "Finite-difference gradient ≈ [" << dfdx_num << ", " << dfdy_num << "]\n";
107 
108     return 0;
109 }
110

This program implements forward-mode automatic differentiation with dual numbers that carry a vector of partial derivatives. Operator overloads apply the chain rule locally for +, −, ×, ÷, sin, exp, and pow. We define a composite scalar function F(x,y) = (xy)^2 + exp(x^2 + sin(y)), evaluate it at a point, and read off the exact gradient from the dual parts. A finite-difference check confirms the result numerically. The chain rule is executed automatically at each operation.

Time: O(n * N_ops), where n is the number of inputs and N_ops is the number of primitive operations in evaluating F.Space: O(n) per active variable to store derivative vectors; overall O(n * V_active), where V_active is the number of simultaneously live dual variables.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Vec = vector<double>;
5	using Mat = vector<vector<double>>;
6
7	// Pretty-print helpers
8	void printVec(const string& name, const Vec& v) {
9	cout << name << " = [";
10	for (size_t i = 0; i < v.size(); ++i) {
11	cout << fixed << setprecision(6) << v[i] << (i + 1 == v.size() ? "" : ", ");
12	}
13	cout << "]\n";
14	}
15
16	void printMat(const string& name, const Mat& A) {
17	cout << name << " (" << A.size() << "x" << (A.empty()?0:A[0].size()) << "):\n";
18	for (const auto& row : A) {
19	cout << " ";
20	for (size_t j = 0; j < row.size(); ++j) {
21	cout << setw(12) << fixed << setprecision(6) << row[j] << (j + 1 == row.size() ? "" : " ");
22	}
23	cout << "\n";
24	}
25	}
26
27	// Define g: R^2 -> R^3
28	auto g = [](const Vec& x) -> Vec {
29	// x = [x0, x1]
30	double x0 = x[0], x1 = x[1];
31	Vec y(3);
32	y[0] = x0 * x1; // u1 = x*y
33	y[1] = x0 * x0 + x1; // u2 = x^2 + y
34	y[2] = exp(x0 - x1); // u3 = e^{x - y}
35	return y;
36	};
37
38	// Define f: R^3 -> R^2
39	auto f = [](const Vec& u) -> Vec {
40	// u = [u, v, w]
41	double U = u[0], V = u[1], W = u[2];
42	Vec z(2);
43	z[0] = U + sin(V); // z1 = U + sin(V)
44	z[1] = V * W; // z2 = V * W
45	return z;
46	};
47
48	// Numerical Jacobian via central differences (O(m*n) function evals)
49	Mat jacobian(function<Vec(const Vec&)> func, const Vec& x, double h = 1e-6) {
50	Vec f0 = func(x);
51	size_t m = f0.size();
52	size_t n = x.size();
53	Mat J(m, Vec(n, 0.0));
54	for (size_t k = 0; k < n; ++k) {
55	Vec xp = x, xm = x;
56	xp[k] += h;
57	xm[k] -= h;
58	Vec fp = func(xp);
59	Vec fm = func(xm);
60	for (size_t i = 0; i < m; ++i) {
61	J[i][k] = (fp[i] - fm[i]) / (2.0 * h);
62	}
63	}
64	return J;
65	}
66
67	// Matrix multiplication C = A * B
68	Mat matmul(const Mat& A, const Mat& B) {
69	size_t p = A.size();
70	size_t m = A.empty() ? 0 : A[0].size();
71	size_t n = B.empty() ? 0 : B[0].size();
72	if (!B.empty() && B.size() != m) throw runtime_error("Dimension mismatch in matmul");
73	Mat C(p, Vec(n, 0.0));
74	for (size_t i = 0; i < p; ++i) {
75	for (size_t k = 0; k < m; ++k) {
76	double aik = A[i][k];
77	for (size_t j = 0; j < n; ++j) {
78	C[i][j] += aik * B[k][j];
79	}
80	}
81	}
82	return C;
83	}
84
85	int main() {
86	cout << setprecision(6) << fixed;
87	// Point of evaluation in R^2
88	Vec x = {1.2, -0.7};
89
90	// Compose h = f(g(x))
91	auto h = [](const Vec& x) -> Vec {
92	return f(g(x));
93	};
94
95	// Evaluate y = g(x)
96	Vec y = g(x);
97	printVec("x", x);
98	printVec("y = g(x)", y);
99
100	// Compute Jacobians: J_g(x), J_f(y), and their product
101	Mat Jg = jacobian(function<Vec(const Vec&)>(g), x);
102	Mat Jf = jacobian(function<Vec(const Vec&)>(f), y);
103	Mat Jh_chain = matmul(Jf, Jg);
104
105	// Direct numerical Jacobian of the composition h at x
106	Mat Jh_direct = jacobian(function<Vec(const Vec&)>(h), x);
107
108	printMat("J_g(x)", Jg);
109	printMat("J_f(g(x))", Jf);
110	printMat("J_{f∘g}(x) via chain = J_f(g(x)) * J_g(x)", Jh_chain);
111	printMat("J_{f∘g}(x) via direct finite diff", Jh_direct);
112
113	// Show the difference between the two Jacobians
114	Mat Diff = Jh_chain;
115	for (size_t i = 0; i < Diff.size(); ++i)
116	for (size_t j = 0; j < Diff[0].size(); ++j)
117	Diff[i][j] -= Jh_direct[i][j];
118	printMat("Difference (chain - direct)", Diff);
119
120	return 0;
121	}
122

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Dual number carrying value and gradient (size n)
5	struct Dual {
6	double val; // primal value
7	vector<double> d; // derivatives wrt each input variable
8
9	Dual(double v = 0.0, int n = 0) : val(v), d(n, 0.0) {}
10	static Dual variable(double v, int n, int idx) {
11	Dual x(v, n);
12	x.d[idx] = 1.0; // seed derivative for this variable
13	return x;
14	}
15	};
16
17	// Elementwise operations (chain rule happens in these overloads)
18	Dual operator+(const Dual& a, const Dual& b) {
19	Dual r; r.val = a.val + b.val; r.d.resize(a.d.size());
20	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] + b.d[i];
21	return r;
22	}
23	Dual operator-(const Dual& a, const Dual& b) {
24	Dual r; r.val = a.val - b.val; r.d.resize(a.d.size());
25	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = a.d[i] - b.d[i];
26	return r;
27	}
28	Dual operator*(const Dual& a, const Dual& b) {
29	Dual r; r.val = a.val * b.val; r.d.resize(a.d.size());
30	for (size_t i = 0; i < r.d.size(); ++i)
31	r.d[i] = a.d[i] * b.val + a.val * b.d[i]; // product rule
32	return r;
33	}
34	Dual operator/(const Dual& a, const Dual& b) {
35	Dual r; r.val = a.val / b.val; r.d.resize(a.d.size());
36	for (size_t i = 0; i < r.d.size(); ++i)
37	r.d[i] = (a.d[i] * b.val - a.val * b.d[i]) / (b.val * b.val);
38	return r;
39	}
40
41	// Mixed with doubles
42	Dual operator+(const Dual& a, double c) { Dual r=a; r.val+=c; return r; }
43	Dual operator+(double c, const Dual& a) { return a + c; }
44	Dual operator-(const Dual& a, double c) { Dual r=a; r.val-=c; return r; }
45	Dual operator-(double c, const Dual& a) { Dual r; r.val=c-a.val; r.d.resize(a.d.size()); for(size_t i=0;i<r.d.size();++i) r.d[i]=-a.d[i]; return r; }
46	Dual operator(const Dual& a, double c) { Dual r=a; r.val=c; for(double &di:r.d) di*=c; return r; }
47	Dual operator(double c, const Dual& a) { return a c; }
48	Dual operator/(const Dual& a, double c) { Dual r=a; r.val/=c; for(double &di:r.d) di/=c; return r; }
49
50	// Elementary functions
51	Dual sin(const Dual& a) {
52	Dual r; r.val = std::sin(a.val); r.d.resize(a.d.size());
53	double c = std::cos(a.val);
54	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = c * a.d[i];
55	return r;
56	}
57	Dual cos(const Dual& a) {
58	Dual r; r.val = std::cos(a.val); r.d.resize(a.d.size());
59	double s = -std::sin(a.val);
60	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = s * a.d[i];
61	return r;
62	}
63	Dual exp(const Dual& a) {
64	Dual r; r.val = std::exp(a.val); r.d.resize(a.d.size());
65	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = r.val * a.d[i];
66	return r;
67	}
68	Dual pow(const Dual& a, double p) { // a^p, p constant
69	Dual r; r.val = std::pow(a.val, p); r.d.resize(a.d.size());
70	double coeff = (p == 0.0 ? 0.0 : p * std::pow(a.val, p - 1.0));
71	for (size_t i = 0; i < r.d.size(); ++i) r.d[i] = coeff * a.d[i];
72	return r;
73	}
74
75	int main() {
76	ios::sync_with_stdio(false);
77	cin.tie(nullptr);
78
79	// We compute gradient of F(x,y) = (x*y)^2 + exp(x^2 + sin(y)) at a point.
80	// This composition uses inner functions g1=x*y, g2=x^2+sin(y) and outer \varphi(u,v)=u^2+exp(v).
81	const int n = 2; // number of inputs (x, y)
82	double x0 = 1.2, y0 = -0.7;
83
84	Dual X = Dual::variable(x0, n, 0); // dX/dx = 1, dX/dy = 0
85	Dual Y = Dual::variable(y0, n, 1); // dY/dx = 0, dY/dy = 1
86
87	Dual g1 = X * Y; // u = x*y
88	Dual g2 = pow(X, 2.0) + sin(Y); // v = x^2 + sin(y)
89	Dual F = pow(g1, 2.0) + exp(g2);// F = u^2 + exp(v)
90
91	cout.setf(std::ios::fixed); cout << setprecision(8);
92	cout << "F(x,y) = (x*y)^2 + exp(x^2 + sin(y))\n";
93	cout << "At (x,y) = (" << x0 << ", " << y0 << ")\n";
94	cout << "Value: " << F.val << "\n";
95	cout << "Gradient: [dF/dx, dF/dy] = [" << F.d[0] << ", " << F.d[1] << "]\n";
96
97	// Optional: compare with numerical central differences
98	auto f_scalar = [](double x, double y) {
99	return (xy)(xy) + exp(xx + sin(y));
100	};
101	double h = 1e-6;
102	double fxp = f_scalar(x0 + h, y0), fxm = f_scalar(x0 - h, y0);
103	double fyp = f_scalar(x0, y0 + h), fym = f_scalar(x0, y0 - h);
104	double dfdx_num = (fxp - fxm) / (2*h);
105	double dfdy_num = (fyp - fym) / (2*h);
106	cout << "Finite-difference gradient ≈ [" << dfdx_num << ", " << dfdy_num << "]\n";
107
108	return 0;
109	}
110