Partial Derivatives
Key Points
- •Partial derivatives measure how a multivariable function changes when you wiggle just one input while keeping the others fixed.
- •They are the building blocks for gradients, Jacobians, and Hessians, which power optimization, machine learning, and physics.
- •Formally, a partial derivative is a single-variable derivative taken along one coordinate axis, defined by a limit.
- •Central-difference finite differences estimate partial derivatives accurately using two function evaluations per coordinate.
- •Automatic differentiation can compute exact derivatives up to floating-point error with modest overhead compared to finite differences.
- •Choosing a good step size h is crucial in numerical differentiation to balance truncation and rounding errors.
- •Mixed partials are usually interchangeable under smoothness conditions (Clairaut’s theorem).
- •Use gradients for steepest-descent directions and linear approximations via the total differential.
Prerequisites
- →Single-variable differentiation — Partial derivatives generalize the limit-based definition and rules of derivatives from one variable to many.
- →Limits and continuity — The definition of partial derivatives relies on limits and continuity to ensure existence and properties.
- →Vectors and matrices — Gradients, Jacobians, and Hessians are vector and matrix objects; understanding linear algebra notation is essential.
- →Taylor series (basic) — Linearization and error analysis of finite differences use Taylor expansions.
- →Floating-point arithmetic — Choosing stable step sizes and interpreting numerical errors requires knowledge of rounding and machine epsilon.
- →Programming with functions — Implementing numerical differentiation treats the function as a black box requiring careful evaluation handling.
- →Optimization basics — Gradients are used to find minima and maxima; understanding their role clarifies why partials matter.
Detailed Explanation
Tap terms for definitions01Overview
A partial derivative describes how a function of several variables changes when only one of those variables changes, and the rest are held constant. If you think of a function f(x, y, z) as a landscape in 3D space, then the partial derivative with respect to x at a point tells you the slope of the landscape when you walk only along the x-axis direction, freezing y and z. Collecting all first-order partial derivatives forms the gradient, a vector pointing in the direction of steepest increase. For vector-valued functions, arranging first-order partial derivatives yields the Jacobian matrix, and second-order partials arranged in a square matrix give the Hessian. These objects are central in optimization (to find minima or maxima), machine learning (to update weights), physics and engineering (to describe spatial changes of fields), and numerical analysis (to linearize and approximate complex models).
02Intuition & Analogies
Imagine a hilly terrain described by altitude f(x, y) at horizontal coordinates (x, y). If you face east-west (x-direction) and take a tiny step east while keeping your north-south position fixed, the rate your altitude changes per unit step is the partial derivative with respect to x, written ∂f/∂x. If instead you only move north-south (y-direction), the analogous rate is ∂f/∂y. Slicing the terrain with vertical planes aligned with the x- or y-axis produces curves; partial derivatives are just the slopes of these 1D curves at the point where the slice cuts through the surface. Extend this idea to any function f(x1, x2, ..., xn): each partial derivative asks, “What is the instantaneous slope if I nudge xi and freeze everything else?” The gradient then bundles these slopes into a single vector arrow that points toward the direction of steepest ascent; its length reflects how steep the hill is overall. In real applications, such as temperature in a room T(x, y, z) or profit depending on price and advertising P(p, a), partial derivatives quantify sensitivities: how much temperature changes per centimeter in x, or how profit changes per unit of advertising spend when price stays fixed. Numerically, we can approximate these sensitivities by sampling the function at nearby points (finite differences) or compute them robustly using automatic differentiation, which applies the chain rule mechanically through the computation steps.
03Formal Definition
04When to Use
Use partial derivatives whenever you need to understand or control how a multivariable output responds to changes in individual inputs. In optimization, they drive gradient-based methods (gradient descent, Newton's method) to locate minima or maxima. In machine learning, gradients of loss functions with respect to parameters are used to update weights. In physics and engineering, spatial derivatives describe diffusion, wave propagation, and flux; partial differential equations (PDEs) are built from them. In sensitivity analysis and uncertainty quantification, partial derivatives indicate which inputs most influence outputs near an operating point. Computationally: use finite differences for quick, black-box gradient checks or when source code is unavailable; use automatic differentiation for accurate, efficient derivatives when you can run code with AD types; use symbolic differentiation for exact expressions in small, algebraic problems. Choose central differences for better accuracy (O(h^{2})) if you can afford extra function calls, and forward differences for speed when accuracy demands are modest. For ill-scaled variables, rescale or choose step sizes relative to variable magnitudes.
⚠️Common Mistakes
Common pitfalls include: (1) Forgetting to hold other variables constant when taking a partial derivative; always treat non-target variables as constants. (2) Confusing partial derivatives with total derivatives; the total derivative accounts for how all inputs change together, sometimes through dependencies x_{i}(t). (3) Using a finite-difference step size h that is too large (high truncation error) or too small (catastrophic cancellation and floating-point noise). A practical rule is h \approx \sqrt{\epsilon},\max(1,|x_{i}|), where \epsilon is machine epsilon. (4) Assuming differentiability at points with kinks or discontinuities; finite differences can be misleading near non-smooth regions. (5) Mixing units or angles (degrees vs. radians) inside trigonometric functions, which invalidates derivative values. (6) Interpreting the gradient’s direction incorrectly; it points to steepest ascent, so for minimization you move in the negative gradient direction. (7) Expecting mixed partials to commute without smoothness; Clairaut’s theorem requires continuity of second partials. (8) Ignoring variable scaling; poorly scaled inputs make gradients tiny or huge, complicating numerical estimation and optimization.
Key Formulas
Partial derivative (limit definition)
Explanation: The partial derivative is the slope of f in the i-th coordinate direction, computed as a limit of difference quotients. It mirrors the single-variable derivative but keeps other variables fixed.
Gradient
Explanation: The gradient collects all first-order partial derivatives into a vector. It points in the direction of steepest ascent of the scalar function f.
Directional derivative
Explanation: The rate of change in direction u equals the dot product of the gradient with a unit vector u. This connects geometry (direction) with calculus (rate of change).
Jacobian matrix
Explanation: For vector-valued F, the Jacobian arranges partial derivatives into an m-by-n matrix. It linearly approximates F near a through (a) h.
Hessian matrix
Explanation: The Hessian stacks second-order partials and encodes curvature. It is symmetric when f has continuous second derivatives.
Multivariable Taylor expansion (to second order)
Explanation: The Taylor expansion approximates f near a using gradient (slope) and Hessian (curvature). Higher-order terms vanish faster for small h.
Clairaut's (Schwarz) theorem
Explanation: When second partial derivatives are continuous, mixed partials commute. This ensures the Hessian is symmetric.
Central difference approximation
Explanation: Approximates the i-th partial derivative using symmetric sample points. It has error proportional to , often more accurate than forward differences.
Forward difference approximation
Explanation: Uses a one-sided step to estimate the derivative. It is simpler but has error proportional to h, typically less accurate than central differences.
Heuristic optimal step size
Explanation: Balances truncation and rounding errors in finite differences. Here is machine epsilon for the floating-point type.
Total differential
Explanation: Expresses the first-order change in f as a weighted sum of input changes. The weights are the partial derivatives.
Chain rule (scalar composition)
Explanation: Differentiating a composition multiplies the outer derivative evaluated at the inner function by the partial of the inner. This is the basis of automatic differentiation.
Complexity Analysis
Code Examples
1 #include <cmath> 2 #include <functional> 3 #include <iostream> 4 #include <limits> 5 #include <vector> 6 7 // Compute the i-th partial derivative of f at point x using central differences. 8 double partial_derivative(const std::function<double(const std::vector<double>&)>& f, 9 std::vector<double> x, 10 std::size_t i) { 11 const double xi = x[i]; 12 const double eps = std::numeric_limits<double>::epsilon(); 13 // Heuristic step size: h ~ sqrt(eps)*max(1, |xi|) 14 const double h = std::sqrt(eps) * std::max(1.0, std::abs(xi)); 15 16 x[i] = xi + h; 17 const double fp = f(x); 18 x[i] = xi - h; 19 const double fm = f(x); 20 21 // Restore (optional) 22 x[i] = xi; 23 24 return (fp - fm) / (2.0 * h); 25 } 26 27 int main() { 28 // Example scalar function: f(x, y, z) = x^2*y + sin(y) + x*exp(z) 29 auto f = [](const std::vector<double>& v) -> double { 30 double x = v[0], y = v[1], z = v[2]; 31 return x*x*y + std::sin(y) + x*std::exp(z); 32 }; 33 34 std::vector<double> x = {1.2, -0.7, 0.3}; 35 36 for (std::size_t i = 0; i < x.size(); ++i) { 37 double dfi = partial_derivative(f, x, i); 38 std::cout << "Partial derivative wrt x[" << i << "]: " << dfi << "\n"; 39 } 40 41 return 0; 42 } 43
This program approximates the i-th partial derivative of a scalar function with respect to the i-th variable using a central finite difference. The step size is chosen relative to the variable magnitude to balance truncation and rounding errors. It demonstrates a black-box approach: only function evaluations are needed.
1 #include <cmath> 2 #include <functional> 3 #include <iostream> 4 #include <limits> 5 #include <vector> 6 7 std::vector<double> gradient(const std::function<double(const std::vector<double>&)>& f, 8 const std::vector<double>& x) { 9 std::size_t n = x.size(); 10 std::vector<double> g(n); 11 std::vector<double> xt = x; 12 const double eps = std::numeric_limits<double>::epsilon(); 13 14 for (std::size_t i = 0; i < n; ++i) { 15 double xi = x[i]; 16 double h = std::sqrt(eps) * std::max(1.0, std::abs(xi)); 17 xt[i] = xi + h; double fp = f(xt); 18 xt[i] = xi - h; double fm = f(xt); 19 g[i] = (fp - fm) / (2.0 * h); 20 xt[i] = xi; // restore 21 } 22 return g; 23 } 24 25 int main() { 26 // Same test function as before for consistency. 27 auto f = [](const std::vector<double>& v) -> double { 28 double x = v[0], y = v[1], z = v[2]; 29 return x*x*y + std::sin(y) + x*std::exp(z); 30 }; 31 32 std::vector<double> x = {1.2, -0.7, 0.3}; 33 34 std::vector<double> g = gradient(f, x); 35 36 std::cout << "f(x) = " << f(x) << "\n"; 37 std::cout << "Gradient: [" << g[0] << ", " << g[1] << ", " << g[2] << "]\n"; 38 39 return 0; 40 } 41
This code computes the gradient by looping over coordinates and applying the central-difference formula for each partial derivative. It reuses a working copy of x to avoid reallocations. Central differences give O(h^2) truncation error and require two function evaluations per component.
1 #include <cmath> 2 #include <iostream> 3 #include <vector> 4 5 struct Dual { 6 double val; // primal value 7 std::vector<double> der; // derivative vector (size n) 8 9 Dual() : val(0.0) {} 10 Dual(double v, std::size_t n) : val(v), der(n, 0.0) {} 11 12 static Dual variable(double v, std::size_t n, std::size_t idx) { 13 Dual d(v, n); 14 d.der[idx] = 1.0; // seed direction e_idx 15 return d; 16 } 17 }; 18 19 // Helper: elementwise operations on derivative vectors 20 static inline std::vector<double> add(const std::vector<double>& a, const std::vector<double>& b) { 21 std::vector<double> r(a.size()); 22 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] + b[i]; 23 return r; 24 } 25 static inline std::vector<double> sub(const std::vector<double>& a, const std::vector<double>& b) { 26 std::vector<double> r(a.size()); 27 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i]; 28 return r; 29 } 30 static inline std::vector<double> scal(const std::vector<double>& a, double s) { 31 std::vector<double> r(a.size()); 32 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] * s; 33 return r; 34 } 35 36 // Operators for Dual 37 Dual operator+(const Dual& a, const Dual& b) { 38 Dual r; r.val = a.val + b.val; r.der = add(a.der, b.der); return r; 39 } 40 Dual operator-(const Dual& a, const Dual& b) { 41 Dual r; r.val = a.val - b.val; r.der = sub(a.der, b.der); return r; 42 } 43 Dual operator*(const Dual& a, const Dual& b) { 44 // Product rule: d(a*b) = a'*b + a*b' 45 Dual r; r.val = a.val * b.val; 46 // r.der = a.der*b.val + b.der*a.val 47 std::size_t n = a.der.size(); 48 r.der.assign(n, 0.0); 49 for (std::size_t i = 0; i < n; ++i) r.der[i] = a.der[i] * b.val + b.der[i] * a.val; 50 return r; 51 } 52 53 // Dual with double interactions (minimal set used below) 54 Dual operator+(const Dual& a, double b) { Dual r; r.val = a.val + b; r.der = a.der; return r; } 55 Dual operator+(double a, const Dual& b) { return b + a; } 56 Dual operator*(const Dual& a, double b) { Dual r; r.val = a.val * b; r.der = scal(a.der, b); return r; } 57 Dual operator*(double a, const Dual& b) { return b * a; } 58 59 // Elementary functions (extend as needed) 60 Dual sin(const Dual& a) { 61 Dual r; r.val = std::sin(a.val); r.der = scal(a.der, std::cos(a.val)); return r; 62 } 63 Dual exp(const Dual& a) { 64 double ev = std::exp(a.val); 65 Dual r; r.val = ev; r.der = scal(a.der, ev); return r; 66 } 67 68 // Example function f(x,y,z) = x^2*y + sin(y) + x*exp(z) 69 Dual f(const std::vector<Dual>& X) { 70 const Dual& x = X[0]; 71 const Dual& y = X[1]; 72 const Dual& z = X[2]; 73 return (x * x) * y + sin(y) + x * exp(z); 74 } 75 76 int main() { 77 std::size_t n = 3; // number of variables 78 std::vector<double> point = {1.2, -0.7, 0.3}; 79 80 // Lift to Dual variables with appropriate seeds 81 std::vector<Dual> X(n); 82 for (std::size_t i = 0; i < n; ++i) X[i] = Dual::variable(point[i], n, i); 83 84 Dual y = f(X); // one pass computes value and gradient 85 86 std::cout << "f(x) = " << y.val << "\n"; 87 std::cout << "grad f = ["; 88 for (std::size_t i = 0; i < n; ++i) { 89 std::cout << y.der[i] << (i + 1 < n ? ", " : "]\n"); 90 } 91 return 0; 92 } 93
This forward-mode automatic differentiation (AD) example implements a minimal dual-number type that carries a derivative vector. Overloaded operators apply the chain rule automatically, producing the full gradient in one evaluation of f. It is accurate up to floating-point error and avoids step-size tuning.