🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Partial Derivatives

Key Points

  • •
    Partial derivatives measure how a multivariable function changes when you wiggle just one input while keeping the others fixed.
  • •
    They are the building blocks for gradients, Jacobians, and Hessians, which power optimization, machine learning, and physics.
  • •
    Formally, a partial derivative is a single-variable derivative taken along one coordinate axis, defined by a limit.
  • •
    Central-difference finite differences estimate partial derivatives accurately using two function evaluations per coordinate.
  • •
    Automatic differentiation can compute exact derivatives up to floating-point error with modest overhead compared to finite differences.
  • •
    Choosing a good step size h is crucial in numerical differentiation to balance truncation and rounding errors.
  • •
    Mixed partials are usually interchangeable under smoothness conditions (Clairaut’s theorem).
  • •
    Use gradients for steepest-descent directions and linear approximations via the total differential.

Prerequisites

  • →Single-variable differentiation — Partial derivatives generalize the limit-based definition and rules of derivatives from one variable to many.
  • →Limits and continuity — The definition of partial derivatives relies on limits and continuity to ensure existence and properties.
  • →Vectors and matrices — Gradients, Jacobians, and Hessians are vector and matrix objects; understanding linear algebra notation is essential.
  • →Taylor series (basic) — Linearization and error analysis of finite differences use Taylor expansions.
  • →Floating-point arithmetic — Choosing stable step sizes and interpreting numerical errors requires knowledge of rounding and machine epsilon.
  • →Programming with functions — Implementing numerical differentiation treats the function as a black box requiring careful evaluation handling.
  • →Optimization basics — Gradients are used to find minima and maxima; understanding their role clarifies why partials matter.

Detailed Explanation

Tap terms for definitions

01Overview

A partial derivative describes how a function of several variables changes when only one of those variables changes, and the rest are held constant. If you think of a function f(x, y, z) as a landscape in 3D space, then the partial derivative with respect to x at a point tells you the slope of the landscape when you walk only along the x-axis direction, freezing y and z. Collecting all first-order partial derivatives forms the gradient, a vector pointing in the direction of steepest increase. For vector-valued functions, arranging first-order partial derivatives yields the Jacobian matrix, and second-order partials arranged in a square matrix give the Hessian. These objects are central in optimization (to find minima or maxima), machine learning (to update weights), physics and engineering (to describe spatial changes of fields), and numerical analysis (to linearize and approximate complex models).

02Intuition & Analogies

Imagine a hilly terrain described by altitude f(x, y) at horizontal coordinates (x, y). If you face east-west (x-direction) and take a tiny step east while keeping your north-south position fixed, the rate your altitude changes per unit step is the partial derivative with respect to x, written ∂f/∂x. If instead you only move north-south (y-direction), the analogous rate is ∂f/∂y. Slicing the terrain with vertical planes aligned with the x- or y-axis produces curves; partial derivatives are just the slopes of these 1D curves at the point where the slice cuts through the surface. Extend this idea to any function f(x1, x2, ..., xn): each partial derivative asks, “What is the instantaneous slope if I nudge xi and freeze everything else?” The gradient then bundles these slopes into a single vector arrow that points toward the direction of steepest ascent; its length reflects how steep the hill is overall. In real applications, such as temperature in a room T(x, y, z) or profit depending on price and advertising P(p, a), partial derivatives quantify sensitivities: how much temperature changes per centimeter in x, or how profit changes per unit of advertising spend when price stays fixed. Numerically, we can approximate these sensitivities by sampling the function at nearby points (finite differences) or compute them robustly using automatic differentiation, which applies the chain rule mechanically through the computation steps.

03Formal Definition

Let f: Rn → R and a = (a1​, …, an​) ∈ Rn. The partial derivative of f with respect to the i-th coordinate at a is defined by the limit \[ ∂xi​∂f​(a) = limh→0​ hf(a1​,…,ai−1​,ai​+h,ai+1​,…,an​)−f(a)​, \] provided the limit exists. The gradient is the vector of first-order partials, ∇ f(a) = \big( ∂x1​∂f​(a), …, ∂xn​∂f​(a) \big). For a function F: Rn → Rm, the Jacobian JF​(a) is the m × n matrix with entries (JF​)_{j,i} = ∂xi​∂Fj​​(a). Second-order partial derivatives produce the Hessian Hf​(a), an n × n matrix with entries (Hf​)_{i,j} = ∂xi​∂xj​∂2f​(a). Under mild smoothness (e.g., f ∈ C2), mixed partials commute: ∂xi​∂xj​∂2f​ = ∂xj​∂xi​∂2f​ (Clairaut's theorem). Differentiability of f at a implies a first-order linear approximation: f(a + h) ≈ f(a) + ∇ f(a)^{⊤} h for small h, with an error that vanishes faster than \∣h∥.

04When to Use

Use partial derivatives whenever you need to understand or control how a multivariable output responds to changes in individual inputs. In optimization, they drive gradient-based methods (gradient descent, Newton's method) to locate minima or maxima. In machine learning, gradients of loss functions with respect to parameters are used to update weights. In physics and engineering, spatial derivatives describe diffusion, wave propagation, and flux; partial differential equations (PDEs) are built from them. In sensitivity analysis and uncertainty quantification, partial derivatives indicate which inputs most influence outputs near an operating point. Computationally: use finite differences for quick, black-box gradient checks or when source code is unavailable; use automatic differentiation for accurate, efficient derivatives when you can run code with AD types; use symbolic differentiation for exact expressions in small, algebraic problems. Choose central differences for better accuracy (O(h^{2})) if you can afford extra function calls, and forward differences for speed when accuracy demands are modest. For ill-scaled variables, rescale or choose step sizes relative to variable magnitudes.

⚠️Common Mistakes

Common pitfalls include: (1) Forgetting to hold other variables constant when taking a partial derivative; always treat non-target variables as constants. (2) Confusing partial derivatives with total derivatives; the total derivative accounts for how all inputs change together, sometimes through dependencies x_{i}(t). (3) Using a finite-difference step size h that is too large (high truncation error) or too small (catastrophic cancellation and floating-point noise). A practical rule is h \approx \sqrt{\epsilon},\max(1,|x_{i}|), where \epsilon is machine epsilon. (4) Assuming differentiability at points with kinks or discontinuities; finite differences can be misleading near non-smooth regions. (5) Mixing units or angles (degrees vs. radians) inside trigonometric functions, which invalidates derivative values. (6) Interpreting the gradient’s direction incorrectly; it points to steepest ascent, so for minimization you move in the negative gradient direction. (7) Expecting mixed partials to commute without smoothness; Clairaut’s theorem requires continuity of second partials. (8) Ignoring variable scaling; poorly scaled inputs make gradients tiny or huge, complicating numerical estimation and optimization.

Key Formulas

Partial derivative (limit definition)

∂xi​∂f​(a)=h→0lim​hf(a1​,…,ai​+h,…,an​)−f(a)​

Explanation: The partial derivative is the slope of f in the i-th coordinate direction, computed as a limit of difference quotients. It mirrors the single-variable derivative but keeps other variables fixed.

Gradient

∇f(a)=[∂x1​∂f​(a)​⋯​∂xn​∂f​(a)​]⊤

Explanation: The gradient collects all first-order partial derivatives into a vector. It points in the direction of steepest ascent of the scalar function f.

Directional derivative

Du​f(a)=∇f(a)⊤u

Explanation: The rate of change in direction u equals the dot product of the gradient with a unit vector u. This connects geometry (direction) with calculus (rate of change).

Jacobian matrix

JF​(a)=[∂xi​∂Fj​​(a)]j=1..m,i=1..n​

Explanation: For vector-valued F, the Jacobian arranges partial derivatives into an m-by-n matrix. It linearly approximates F near a through JF​(a) h.

Hessian matrix

Hf​(a)=[∂xi​∂xj​∂2f​(a)]i,j=1..n​

Explanation: The Hessian stacks second-order partials and encodes curvature. It is symmetric when f has continuous second derivatives.

Multivariable Taylor expansion (to second order)

f(a+h)=f(a)+∇f(a)⊤h+21​h⊤Hf​(a)h+O(∥h∥3)

Explanation: The Taylor expansion approximates f near a using gradient (slope) and Hessian (curvature). Higher-order terms vanish faster for small h.

Clairaut's (Schwarz) theorem

∂xi​∂xj​∂2f​=∂xj​∂xi​∂2f​

Explanation: When second partial derivatives are continuous, mixed partials commute. This ensures the Hessian is symmetric.

Central difference approximation

∂xi​∂f​(a)≈2hf(a+hei​)−f(a−hei​)​

Explanation: Approximates the i-th partial derivative using symmetric sample points. It has error proportional to h2, often more accurate than forward differences.

Forward difference approximation

∂xi​∂f​(a)≈hf(a+hei​)−f(a)​

Explanation: Uses a one-sided step to estimate the derivative. It is simpler but has error proportional to h, typically less accurate than central differences.

Heuristic optimal step size

h∗≈ϵ​max(1,∣ai​∣)

Explanation: Balances truncation and rounding errors in finite differences. Here ϵ is machine epsilon for the floating-point type.

Total differential

df=i=1∑n​∂xi​∂f​dxi​

Explanation: Expresses the first-order change in f as a weighted sum of input changes. The weights are the partial derivatives.

Chain rule (scalar composition)

∂xi​∂​g(f(x))=g′(f(x))∂xi​∂f​(x)

Explanation: Differentiating a composition multiplies the outer derivative evaluated at the inner function by the partial of the inner. This is the basis of automatic differentiation.

Complexity Analysis

Assume a black-box scalar function f: Rn -> R with evaluation cost Cf​ at a point. Computing a single partial derivative via central differences requires two evaluations (f(x+h ei​), f(x−h ei​)), giving time O(Cf​) for one coordinate and O(n Cf​) for the full gradient. Forward differences halve the calls but reduce accuracy to O(h), so they are O(n Cf​) as well but with a smaller constant. Space cost is O(n) to store x and the gradient, plus O(1) temporary storage per evaluation. Using forward-mode automatic differentiation (AD) with dual numbers that carry a length-n derivative vector computes the full gradient in a single pass through f, but each primitive arithmetic operation now propagates an O(n)-length derivative. If f requires K primitive operations, time is O(K n) and space is O(n) for the active derivative vector, often competitive with central differences when Cf​ is dominated by arithmetic and memory traffic is manageable. Alternatively, seeding one dual direction at a time reduces per-operation cost to O(1) but requires n passes, totaling O(n K) time; this is similar to finite differences in count but with machine-precision accuracy and no step-size tuning. For vector-valued F: Rn -> Rm, building the full Jacobian by central differences costs O(m n Cf​). Reverse-mode AD computes gradients of scalar outputs in time roughly a small constant times Cf​, independent of n, at the expense of O(K) auxiliary storage for the computation trace. Hessian computation by naive differencing is O(n2 Cf​); practical large-scale methods use Hessian-vector products to achieve O(n Cf​) per product without forming H explicitly.

Code Examples

Central-difference partial derivative for a black-box function
1#include <cmath>
2#include <functional>
3#include <iostream>
4#include <limits>
5#include <vector>
6
7// Compute the i-th partial derivative of f at point x using central differences.
8double partial_derivative(const std::function<double(const std::vector<double>&)>& f,
9 std::vector<double> x,
10 std::size_t i) {
11 const double xi = x[i];
12 const double eps = std::numeric_limits<double>::epsilon();
13 // Heuristic step size: h ~ sqrt(eps)*max(1, |xi|)
14 const double h = std::sqrt(eps) * std::max(1.0, std::abs(xi));
15
16 x[i] = xi + h;
17 const double fp = f(x);
18 x[i] = xi - h;
19 const double fm = f(x);
20
21 // Restore (optional)
22 x[i] = xi;
23
24 return (fp - fm) / (2.0 * h);
25}
26
27int main() {
28 // Example scalar function: f(x, y, z) = x^2*y + sin(y) + x*exp(z)
29 auto f = [](const std::vector<double>& v) -> double {
30 double x = v[0], y = v[1], z = v[2];
31 return x*x*y + std::sin(y) + x*std::exp(z);
32 };
33
34 std::vector<double> x = {1.2, -0.7, 0.3};
35
36 for (std::size_t i = 0; i < x.size(); ++i) {
37 double dfi = partial_derivative(f, x, i);
38 std::cout << "Partial derivative wrt x[" << i << "]: " << dfi << "\n";
39 }
40
41 return 0;
42}
43

This program approximates the i-th partial derivative of a scalar function with respect to the i-th variable using a central finite difference. The step size is chosen relative to the variable magnitude to balance truncation and rounding errors. It demonstrates a black-box approach: only function evaluations are needed.

Time: O(C_f) per partial; O(n C_f) for all partialsSpace: O(n)
Compute a full gradient vector using central differences
1#include <cmath>
2#include <functional>
3#include <iostream>
4#include <limits>
5#include <vector>
6
7std::vector<double> gradient(const std::function<double(const std::vector<double>&)>& f,
8 const std::vector<double>& x) {
9 std::size_t n = x.size();
10 std::vector<double> g(n);
11 std::vector<double> xt = x;
12 const double eps = std::numeric_limits<double>::epsilon();
13
14 for (std::size_t i = 0; i < n; ++i) {
15 double xi = x[i];
16 double h = std::sqrt(eps) * std::max(1.0, std::abs(xi));
17 xt[i] = xi + h; double fp = f(xt);
18 xt[i] = xi - h; double fm = f(xt);
19 g[i] = (fp - fm) / (2.0 * h);
20 xt[i] = xi; // restore
21 }
22 return g;
23}
24
25int main() {
26 // Same test function as before for consistency.
27 auto f = [](const std::vector<double>& v) -> double {
28 double x = v[0], y = v[1], z = v[2];
29 return x*x*y + std::sin(y) + x*std::exp(z);
30 };
31
32 std::vector<double> x = {1.2, -0.7, 0.3};
33
34 std::vector<double> g = gradient(f, x);
35
36 std::cout << "f(x) = " << f(x) << "\n";
37 std::cout << "Gradient: [" << g[0] << ", " << g[1] << ", " << g[2] << "]\n";
38
39 return 0;
40}
41

This code computes the gradient by looping over coordinates and applying the central-difference formula for each partial derivative. It reuses a working copy of x to avoid reallocations. Central differences give O(h^2) truncation error and require two function evaluations per component.

Time: O(n C_f)Space: O(n)
Forward-mode automatic differentiation with dual numbers (full gradient in one pass)
1#include <cmath>
2#include <iostream>
3#include <vector>
4
5struct Dual {
6 double val; // primal value
7 std::vector<double> der; // derivative vector (size n)
8
9 Dual() : val(0.0) {}
10 Dual(double v, std::size_t n) : val(v), der(n, 0.0) {}
11
12 static Dual variable(double v, std::size_t n, std::size_t idx) {
13 Dual d(v, n);
14 d.der[idx] = 1.0; // seed direction e_idx
15 return d;
16 }
17};
18
19// Helper: elementwise operations on derivative vectors
20static inline std::vector<double> add(const std::vector<double>& a, const std::vector<double>& b) {
21 std::vector<double> r(a.size());
22 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] + b[i];
23 return r;
24}
25static inline std::vector<double> sub(const std::vector<double>& a, const std::vector<double>& b) {
26 std::vector<double> r(a.size());
27 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] - b[i];
28 return r;
29}
30static inline std::vector<double> scal(const std::vector<double>& a, double s) {
31 std::vector<double> r(a.size());
32 for (std::size_t i = 0; i < a.size(); ++i) r[i] = a[i] * s;
33 return r;
34}
35
36// Operators for Dual
37Dual operator+(const Dual& a, const Dual& b) {
38 Dual r; r.val = a.val + b.val; r.der = add(a.der, b.der); return r;
39}
40Dual operator-(const Dual& a, const Dual& b) {
41 Dual r; r.val = a.val - b.val; r.der = sub(a.der, b.der); return r;
42}
43Dual operator*(const Dual& a, const Dual& b) {
44 // Product rule: d(a*b) = a'*b + a*b'
45 Dual r; r.val = a.val * b.val;
46 // r.der = a.der*b.val + b.der*a.val
47 std::size_t n = a.der.size();
48 r.der.assign(n, 0.0);
49 for (std::size_t i = 0; i < n; ++i) r.der[i] = a.der[i] * b.val + b.der[i] * a.val;
50 return r;
51}
52
53// Dual with double interactions (minimal set used below)
54Dual operator+(const Dual& a, double b) { Dual r; r.val = a.val + b; r.der = a.der; return r; }
55Dual operator+(double a, const Dual& b) { return b + a; }
56Dual operator*(const Dual& a, double b) { Dual r; r.val = a.val * b; r.der = scal(a.der, b); return r; }
57Dual operator*(double a, const Dual& b) { return b * a; }
58
59// Elementary functions (extend as needed)
60Dual sin(const Dual& a) {
61 Dual r; r.val = std::sin(a.val); r.der = scal(a.der, std::cos(a.val)); return r;
62}
63Dual exp(const Dual& a) {
64 double ev = std::exp(a.val);
65 Dual r; r.val = ev; r.der = scal(a.der, ev); return r;
66}
67
68// Example function f(x,y,z) = x^2*y + sin(y) + x*exp(z)
69Dual f(const std::vector<Dual>& X) {
70 const Dual& x = X[0];
71 const Dual& y = X[1];
72 const Dual& z = X[2];
73 return (x * x) * y + sin(y) + x * exp(z);
74}
75
76int main() {
77 std::size_t n = 3; // number of variables
78 std::vector<double> point = {1.2, -0.7, 0.3};
79
80 // Lift to Dual variables with appropriate seeds
81 std::vector<Dual> X(n);
82 for (std::size_t i = 0; i < n; ++i) X[i] = Dual::variable(point[i], n, i);
83
84 Dual y = f(X); // one pass computes value and gradient
85
86 std::cout << "f(x) = " << y.val << "\n";
87 std::cout << "grad f = [";
88 for (std::size_t i = 0; i < n; ++i) {
89 std::cout << y.der[i] << (i + 1 < n ? ", " : "]\n");
90 }
91 return 0;
92}
93

This forward-mode automatic differentiation (AD) example implements a minimal dual-number type that carries a derivative vector. Overloaded operators apply the chain rule automatically, producing the full gradient in one evaluation of f. It is accurate up to floating-point error and avoids step-size tuning.

Time: O(K n), where K is the number of primitive arithmetic operations in fSpace: O(n)
#partial derivatives#gradient#jacobian#hessian#finite differences#central difference#automatic differentiation#dual numbers#directional derivative#taylor expansion#machine epsilon#sensitivity analysis#optimization#chain rule#numerical differentiation