🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Gradient & Directional Derivatives

Key Points

  • •
    The gradient \(∇ f\) points in the direction of steepest increase of a scalar field and its length equals the maximum rate of increase.
  • •
    A directional derivative \(D_{u} f(x)\) measures how fast \(f\) changes at \(x\) when you move in direction \(u\).
  • •
    Formally, \(D_{u} f(x) = limh→0​ hf(x+hu)−f(x)​ = ∇ f(x) ⋅ u\) for unit \(u\).
  • •
    Numerically, you can approximate \(∇ f\) using central differences with only function evaluations—no symbolic calculus required.
  • •
    Normalizing the direction vector is essential; otherwise the directional derivative scales incorrectly by the vector’s length.
  • •
    In optimization, moving with \(+∇ f\) performs steepest ascent, while moving with \(-∇ f\) performs steepest descent.
  • •
    Central differences have error \(O(h2)\) and require two evaluations per coordinate; forward differences are cheaper but less accurate.
  • •
    Automatic differentiation (dual numbers) can compute a directional derivative in one pass through the function with high accuracy.

Prerequisites

  • →Limits and continuity — The definition of directional derivatives uses a limit; differentiability implies existence of all directional derivatives.
  • →Vectors and norms — Gradients and directions are vectors; dot products and norms are essential to interpret rates and angles.
  • →Partial derivatives — The gradient is composed of partial derivatives with respect to each coordinate.
  • →Linear algebra (dot product, orthogonality) — Directional derivatives are projections of the gradient; level sets are orthogonal to the gradient.
  • →Taylor series (first order) — Explains the linear approximation of f using the gradient and justifies optimization steps.
  • →Floating-point arithmetic — Choosing finite-difference step sizes requires understanding rounding and cancellation error.
  • →Basic optimization concepts — Using gradients in descent/ascent methods requires understanding steps, convergence, and stopping criteria.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine hiking on a hill with a foggy map. You stand at a point and want to know which way to step to climb fastest. The arrow that points in the direction of the steepest climb is the gradient, and the speed of climbing you would experience in any chosen direction is a directional derivative. Concept: For a scalar function f that assigns a height (or temperature, cost, etc.) to each point in space, the gradient ∇f is a vector that summarizes how f changes with respect to each coordinate. The directional derivative D_u f(x) tells you the instantaneous rate of change of f at x along a specified direction u. Example: If f(x, y) is the elevation of a landscape at coordinates (x, y), then ∇f(x, y) points uphill and has magnitude equal to how steep the hill is there. If you walk northeast (a specific direction), the directional derivative gives how quickly your altitude changes per meter walked northeast. These ideas unify geometry, physics, and optimization: they describe local linear behavior of functions, control steepest-ascent/descent methods, and relate to fluxes in vector calculus.

02Intuition & Analogies

Hook: Picture contour lines on a topographic map. Where the lines are packed tightly, the terrain is steep; where they spread out, it is gentle. You stand on this map and hold a compass. Which way is the steepest uphill? That arrow is the gradient. Concept: The gradient acts like a smart arrow: it points toward the greatest increase and its length is the steepness. If you choose any other direction with your compass, the rate you climb is the projection of that gradient onto your chosen direction—that projection is exactly the directional derivative. Everyday analogies: 1) Heating: In a kitchen, temperature varies from the hot oven to the cold window. A fly feels the most rapid warming by flying along the gradient of temperature. 2) Costs in shopping: Suppose f(p1, p2, …) is total cost depending on item prices. The gradient tells you how sensitive the total is to each item’s price. If the market shifts prices in a particular pattern (a direction u), the directional derivative predicts how your total cost will change to first order. Example: Take f(x, y) = x^2 + 3xy + cos y. At (1, 0), ∇f = (2x + 3y, 3x − sin y) = (2, 3). If you walk in direction u = (3, 4)/5, the directional derivative equals ∇f · u = (2, 3) · (3/5, 4/5) = (6 + 12)/5 = 18/5. That is your instantaneous rate of increase per unit step along u.

03Formal Definition

Let f: \(Rn → R\) be differentiable at \(x ∈ Rn\). The gradient of f at x is the vector of partial derivatives: \(∇ f(x) = \left( ∂x1​∂f​(x), …, ∂xn​∂f​(x) \right)^⊤\). The directional derivative of f at x along a unit vector \(u ∈ Rn\) is defined by the limit \(D_{u} f(x) = limh→0​ hf(x+hu)−f(x)​\), provided the limit exists. If f is differentiable at x, then the directional derivative exists in every direction and satisfies the key identity \(D_{u} f(x) = ∇ f(x) ⋅ u\). Furthermore, the gradient is the unique vector that linearly approximates changes in f via the first-order Taylor expansion: \(f(x + Δ x) ≈ f(x) + ∇ f(x) ⋅ Δ x\) for small \(\∣Δx∥\). The magnitude of the gradient equals the maximum directional derivative over all unit vectors: \(max∥u∥=1​ D_{u} f(x) = \∣∇f(x)∥\), attained at \(u = ∇ f(x)/\∣∇f(x)∥\) if \(∇ f(x) = 0\).

04When to Use

Use gradients whenever you need the best linear local approximation of a scalar field, or to know the most sensitive direction of change. Typical use cases include: 1) Optimization: Gradient descent (minimization) and ascent (maximization) update parameters using (\pm \nabla f). Machine learning training, curve fitting, and control all rely on this principle. 2) Physics and engineering: Heat flows from hot to cold opposite the temperature gradient; electric fields are gradients of potentials (up to sign). 3) Image processing: The gradient magnitude highlights edges where pixel intensities change rapidly. 4) Sensitivity analysis: In finance or epidemiology models, gradients quantify how outputs respond to parameter changes. 5) Numerical modeling: Finite element and finite difference methods approximate derivatives to solve PDEs. 6) Navigation on scalar fields: In robotics, potential fields guide motion using gradients. Directional derivatives are handy when you care about change along a specific intended motion or constraint direction, for example, checking improvement along a search direction in line-search methods, or computing instantaneous change of an objective along a trajectory.

⚠️Common Mistakes

  1. Forgetting to normalize the direction vector: The definition (D_{\mathbf{u}} f) assumes (|\mathbf{u}|=1). If you use an unnormalized vector v, you actually compute (\nabla f \cdot v = |v| D_{v/|v|} f), which misstates the rate per unit length. 2) Confusing sign and direction: Moving along +(\nabla f) increases f fastest, while −(\nabla f) decreases f fastest. Mixing them up in algorithms flips ascent to descent. 3) Using too large or too small finite-difference step h: If h is too large, truncation error dominates; if too small, floating-point cancellation dominates. Central differences with a moderate h (e.g., around (\sqrt{\epsilon}) times a scale) usually balance errors. 4) Ignoring coordinate scaling: If variables have different units/scales, the gradient’s components reflect that. Pre-scaling or using preconditioning can improve optimization. 5) Assuming differentiability where it fails: At kinks or non-smooth points, the gradient may not exist; use subgradients or directional derivatives carefully. 6) Evaluating directional derivatives with a non-unit direction but interpreting them as unit rates; always report whether the direction was normalized. 7) Numerical gradient at boundaries: When you can’t take symmetric steps (e.g., constraints), switch to one-sided differences to avoid domain violations.

Key Formulas

Gradient Vector

∇f(x)=(∂x1​∂f​(x),…,∂xn​∂f​(x))⊤

Explanation: The gradient collects all first partial derivatives into one vector. It points toward the direction of steepest increase of the function.

Directional Derivative (Limit Definition)

Du​f(x)=h→0lim​hf(x+hu)−f(x)​

Explanation: This limit measures the instantaneous rate of change of f as we move in the direction u. It assumes u is a unit vector.

Directional Derivative via Gradient

Du​f(x)=∇f(x)⋅u

Explanation: For differentiable functions, the directional derivative equals the dot product of the gradient and the direction. This provides a fast way to compute directional change once the gradient is known.

First-Order Taylor Approximation

f(x+Δx)≈f(x)+∇f(x)⋅Δx+O(∥Δx∥2)

Explanation: Near a point, the function behaves like a linear function with slope given by the gradient. The quadratic error term shrinks as your step becomes small.

Steepest Ascent Magnitude

∥u∥=1max​Du​f(x)=∥∇f(x)∥

Explanation: Among all unit directions, the largest directional derivative equals the gradient's norm. The maximizing direction is along the gradient.

Central Difference for Partials

∂xi​∂f​(x)≈2hf(x+hei​)−f(x−hei​)​

Explanation: This numerically approximates each component of the gradient using two function evaluations per dimension. It has second-order accuracy in h.

Chain Rule for Directional Derivatives

Du​(g∘f)(x)=g′(f(x))Du​f(x)

Explanation: For a composition g(f(x)), the rate of change along u multiplies the inner rate by the outer derivative. Useful for nested scalar functions.

Gradient Descent Update

xk+1​=xk​−η∇f(xk​)

Explanation: Iteratively step opposite the gradient to decrease f. The step size η controls how far you move each iteration.

Gradient Ascent Update

xk+1​=xk​+η∇f(xk​)

Explanation: Iteratively step along the gradient to increase f. Common in maximizing likelihoods or rewards.

Central Difference Error Order

errorcentral​=O(h2)

Explanation: The truncation error of central differences decreases quadratically with the step size h, balancing accuracy and numerical stability.

Complexity Analysis

Let n be the dimension of the input and let Tf​ denote the cost of one evaluation of the scalar function f. A central-difference gradient requires two evaluations per coordinate: for each i, evaluate f(x + h ei​) and f(x − h ei​). Thus, computing the full gradient has time complexity O(n · Tf​). If f itself requires O(n) arithmetic to evaluate (e.g., a sum over coordinates), then a full gradient via central differences is O(n2) in basic operations. The space cost is O(n) to store the gradient vector and temporary perturbed points. If you instead compute a single directional derivative by central difference along a unit direction u, you need only two function evaluations total, so the time complexity is O(Tf​) and space is O(1) beyond storing the point and direction. Automatic differentiation (forward mode with dual numbers) can compute an exact directional derivative (up to floating-point roundoff) in a single pass with time O(Tf​ · c), where c is a small constant overhead from carrying dual parts; space is O(1) beyond the function’s working set. To recover the entire gradient with forward-mode AD, you would repeat with n seed directions, giving O(n · Tf​). In iterative optimization using gradient ascent/descent with K iterations, each step dominated by a gradient computation, the overall time is O(K · n · Tf​) (central differences) and space remains O(n). In practice, constants matter: central differences double the number of function calls per coordinate; forward differences cut that in half but reduce accuracy; AD often outperforms finite differences in accuracy and sometimes in speed when analytic derivatives are hard to code.

Code Examples

Numerical Gradient via Central Differences for f: R^n -> R
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute gradient using central differences.
5// f: function mapping vector<double> to double
6// x: point at which to compute the gradient
7// h: step size for finite differences (default ~1e-6)
8vector<double> gradientCentralDifference(function<double(const vector<double>&)> f,
9 const vector<double>& x,
10 double h = 1e-6) {
11 size_t n = x.size();
12 vector<double> grad(n, 0.0);
13 vector<double> xp = x, xm = x; // perturbed points
14
15 for (size_t i = 0; i < n; ++i) {
16 xp[i] += h;
17 xm[i] -= h;
18 double fp = f(xp);
19 double fm = f(xm);
20 grad[i] = (fp - fm) / (2.0 * h);
21 xp[i] = x[i]; // reset
22 xm[i] = x[i]; // reset
23 }
24 return grad;
25}
26
27// Helper: Euclidean norm
28double norm2(const vector<double>& v) {
29 double s = 0.0; for (double vi : v) s += vi * vi; return sqrt(s);
30}
31
32int main() {
33 // Example function: f(x, y) = x^2 + 3xy + cos(y)
34 auto f = [](const vector<double>& v) -> double {
35 double x = v[0], y = v[1];
36 return x*x + 3.0*x*y + cos(y);
37 };
38
39 vector<double> x = {1.0, 0.0};
40 vector<double> g = gradientCentralDifference(f, x, 1e-6);
41
42 cout << fixed << setprecision(6);
43 cout << "Gradient at (1,0): [" << g[0] << ", " << g[1] << "]\n";
44 cout << "Gradient norm: " << norm2(g) << "\n";
45
46 // Expected analytical gradient: (2x + 3y, 3x - sin y) -> (2, 3)
47 return 0;
48}
49

This program computes the gradient of a scalar function at a given point using central differences. Each component uses two function evaluations at x ± h e_i to achieve second-order accuracy in h. The example verifies the gradient for f(x, y) = x^2 + 3xy + cos y at (1, 0).

Time: O(n · T_f) where T_f is the time to evaluate f once (two evaluations per component).Space: O(n) to store the gradient and temporary vectors.
Directional Derivative via Forward-Mode Automatic Differentiation (Dual Numbers)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Minimal dual number type: a + b*eps with eps^2 = 0
5struct Dual {
6 double val; // real part
7 double der; // derivative part (directional component)
8 Dual(double v = 0.0, double d = 0.0) : val(v), der(d) {}
9};
10
11// Basic operators for Dual
12Dual operator+(const Dual& a, const Dual& b){ return Dual(a.val + b.val, a.der + b.der); }
13Dual operator-(const Dual& a, const Dual& b){ return Dual(a.val - b.val, a.der - b.der); }
14Dual operator*(const Dual& a, const Dual& b){ return Dual(a.val * b.val, a.val * b.der + a.der * b.val); }
15Dual operator/(const Dual& a, const Dual& b){
16 // (a/b)' = (a' b - a b') / b^2
17 return Dual(a.val / b.val, (a.der * b.val - a.val * b.der) / (b.val * b.val));
18}
19
20// Interactions with double
21Dual operator+(const Dual& a, double c){ return Dual(a.val + c, a.der); }
22Dual operator+(double c, const Dual& a){ return a + c; }
23Dual operator-(const Dual& a, double c){ return Dual(a.val - c, a.der); }
24Dual operator-(double c, const Dual& a){ return Dual(c - a.val, -a.der); }
25Dual operator*(const Dual& a, double c){ return Dual(a.val * c, a.der * c); }
26Dual operator*(double c, const Dual& a){ return a * c; }
27Dual operator/(const Dual& a, double c){ return Dual(a.val / c, a.der / c); }
28
29// Elementary functions
30Dual sin(const Dual& a){ return Dual(::sin(a.val), ::cos(a.val) * a.der); }
31Dual cos(const Dual& a){ return Dual(::cos(a.val), -::sin(a.val) * a.der); }
32Dual exp(const Dual& a){ return Dual(::exp(a.val), ::exp(a.val) * a.der); }
33Dual log(const Dual& a){ return Dual(::log(a.val), a.der / a.val); }
34
35// Compute directional derivative D_u f(x) by seeding dual numbers with the direction u
36// fDual: function mapping vector<Dual> to Dual (scalar output)
37// x: base point; u: direction (need not be unit; we'll normalize)
38double directionalDerivativeAD(function<Dual(const vector<Dual>&)> fDual,
39 const vector<double>& x,
40 const vector<double>& u) {
41 size_t n = x.size();
42 // Normalize direction to ensure rate per unit length
43 double norm = 0.0; for (double ui : u) norm += ui * ui; norm = sqrt(norm);
44 if (norm == 0.0) throw runtime_error("Direction vector must be non-zero");
45
46 vector<Dual> xd(n);
47 for (size_t i = 0; i < n; ++i) {
48 xd[i] = Dual(x[i], u[i] / norm); // seed derivative component with unit direction
49 }
50 Dual y = fDual(xd);
51 return y.der; // directional derivative along unit u
52}
53
54int main(){
55 // Example: f(x, y, z) = x*y + sin(z)
56 auto fDual = [](const vector<Dual>& v)->Dual{
57 Dual x = v[0], y = v[1], z = v[2];
58 return x * y + sin(z);
59 };
60
61 vector<double> x = {2.0, 3.0, 0.0};
62 vector<double> u = {1.0, -1.0, 2.0};
63
64 double Du = directionalDerivativeAD(fDual, x, u);
65 cout << fixed << setprecision(6);
66 cout << "Directional derivative along u at x: " << Du << "\n";
67
68 // Check against analytical formula: \nabla f = (y, x, cos z) = (3, 2, 1)
69 // u_unit = u / ||u||, so D_u f = (3, 2, 1) · u_unit
70 double normu = sqrt(1*1 + (-1)*(-1) + 2*2);
71 double Du_check = (3.0*1.0 + 2.0*(-1.0) + 1.0*2.0) / normu; // (3 - 2 + 2)/||u|| = 3/||u||
72 cout << "Analytical check: " << Du_check << "\n";
73 return 0;
74}
75

This program implements forward-mode automatic differentiation using dual numbers to compute the directional derivative in a single pass. By seeding the dual part with the components of the unit direction, the derivative part of the result equals D_u f(x). The example compares the AD result to the analytic dot-product formula.

Time: O(T_f) with a small constant overhead from dual arithmetic.Space: O(1) additional space beyond the function’s working memory.
Gradient Ascent Using Numerical Gradients
1#include <bits/stdc++.h>
2using namespace std;
3
4vector<double> gradientCentralDifference(function<double(const vector<double>&)> f,
5 const vector<double>& x,
6 double h = 1e-6) {
7 size_t n = x.size();
8 vector<double> grad(n, 0.0), xp = x, xm = x;
9 for (size_t i = 0; i < n; ++i) {
10 xp[i] += h; xm[i] -= h;
11 double fp = f(xp), fm = f(xm);
12 grad[i] = (fp - fm) / (2.0 * h);
13 xp[i] = x[i]; xm[i] = x[i];
14 }
15 return grad;
16}
17
18double norm2(const vector<double>& v){ double s=0; for(double vi:v) s+=vi*vi; return sqrt(s);}
19
20vector<double> gradientAscent(function<double(const vector<double>&)> f,
21 vector<double> x0,
22 double step = 0.1,
23 int maxIters = 200,
24 double tol = 1e-6) {
25 vector<double> x = move(x0);
26 for (int k = 0; k < maxIters; ++k) {
27 vector<double> g = gradientCentralDifference(f, x);
28 if (norm2(g) < tol) break; // stationarity
29 for (size_t i = 0; i < x.size(); ++i) x[i] += step * g[i];
30 }
31 return x;
32}
33
34int main(){
35 // Concave quadratic: f(x, y) = - (x-1)^2 - 2 (y+2)^2 + 10 (maximum at (1, -2))
36 auto f = [](const vector<double>& v)->double{
37 double x = v[0], y = v[1];
38 return - (x-1)*(x-1) - 2.0*(y+2)*(y+2) + 10.0;
39 };
40
41 vector<double> start = {5.0, 5.0};
42 vector<double> xmax = gradientAscent(f, start, 0.1, 200, 1e-8);
43
44 cout << fixed << setprecision(6);
45 cout << "Estimated maximizer: (" << xmax[0] << ", " << xmax[1] << ")\n";
46 cout << "f at estimate: " << f(xmax) << "\n";
47 return 0;
48}
49

This example performs gradient ascent using numerical gradients from central differences to maximize a concave quadratic, whose true maximizer is at (1, −2). It iteratively updates x ← x + η ∇f(x) until the gradient norm is small or a maximum number of iterations is reached.

Time: O(K · n · T_f), where K is iterations and T_f the time per function evaluation.Space: O(n) to store the current iterate and gradient.
#gradient#directional derivative#partial derivative#dot product#steepest descent#steepest ascent#numerical differentiation#central difference#automatic differentiation#finite difference#taylor expansion#optimization#scalar field#level set#hessian