Momentum Methods
Key Points
- β’Momentum methods add an exponentially weighted memory of past gradients to make descent steps smoother and faster, especially in ravines and ill-conditioned problems.
- β’Classical (Polyak/Heavy-ball) momentum keeps a velocity = + L() and updates parameters by = - .
- β’Nesterov Accelerated Gradient (NAG) computes the gradient at a look-ahead point, often giving better theoretical and practical convergence on smooth convex problems.
- β’The momentum coefficient [0,1) controls how much past gradient information persists; typical values are 0.8β0.99.
- β’Per-iteration cost stays O(d) where d is the number of parameters, with only an extra velocity vector of size d in memory.
- β’Momentum reduces oscillations across steep directions and increases effective step length along shallow directions.
- β’Choosing learning rate too large with high can cause divergence; stable choices follow known bounds for smooth strongly convex losses.
- β’Momentum is widely used in deep learning and classic optimization to speed up SGD and mitigate noisy gradients.
Prerequisites
- βGradient Descent β Momentum builds directly on gradient descent updates by modifying how gradients accumulate.
- βVector and Matrix Calculus β Computing gradients and understanding updates in high dimensions require basic calculus on vectors and matrices.
- βConvexity and Strong Convexity β Convergence guarantees and parameter choices for momentum depend on these properties.
- βLipschitz Smoothness β Smoothness constants bound stable learning rates and appear in optimal momentum settings.
- βLinear Algebra (Eigenvalues, Condition Number) β Ill-conditioning explains why momentum helps and provides intuition for acceleration.
- βStochastic Optimization Basics β Momentum is commonly used with mini-batch SGD, which introduces gradient noise.
- βNumerical Stability and Scaling β Feature scaling and step-size selection impact stability of momentum methods.
- βRegularization (Weight Decay) β To implement training properly, it is important to combine momentum with regularization correctly.
Detailed Explanation
Tap terms for definitions01Overview
Momentum methods are optimization techniques that augment gradient descent with a memory of past updates. Instead of moving only in the direction of the current gradient, they maintain a running "velocity" vector that blends recent gradients using exponential smoothing. This idea, introduced in the heavy-ball method by Polyak, helps reduce zig-zagging in narrow valleys of the loss landscape and accelerates motion along gentle slopes. In practice, momentum improves both deterministic gradient descent and stochastic gradient descent (SGD), which samples gradients on mini-batches of data. Two popular variants are classical (Polyak) momentum and Nesterov Accelerated Gradient (NAG). Classical momentum forms a velocity as a moving average of gradients and then steps. NAG takes a look-ahead step using the previous velocity before computing the gradient, often providing tighter theoretical guarantees for smooth convex losses. These methods incur minimal overhead: they add one extra vector the size of the parameters and a couple of vector operations per iteration, so they are easy to implement and scale. Momentum is now a standard component of modern deep learning optimizers and a useful tool for speeding up convergence in a wide range of optimization problems.
02Intuition & Analogies
Imagine pushing a heavy shopping cart down a bumpy aisle. If you push a little every second, the cart does not instantly stop when you stop pushing; it keeps rolling due to momentum. Similarly, in optimization, each gradient "push" does not immediately reset direction; previous pushes accumulate, giving inertia to motion. This inertia helps on terrains with ridges: if the aisle slopes slightly forward (a shallow direction) but has rumble strips across it (steep directions), naive steps will waste effort bouncing side-to-side. Momentum damps these sideways oscillations while letting the cart build speed forward, reaching the destination faster. A second analogy is smoothing noisy signals. If you measure the wind direction every second, readings fluctuate. Taking an exponentially weighted moving average dampens noise but still follows the trend. Momentum does the same to gradients: it smooths noisy mini-batch directions so updates are more stable and decisive. Nesterov momentum adds a predictive twist: before measuring the next gradient (checking the wind), you move a little in the direction you expect to go (a look-ahead with your current speed), then adjust based on what you find there. This predictive correction often avoids over-shooting and gives a more accurate course. In both analogies, a friction parameter (the momentum coefficient \beta) controls how much past motion matters: high \beta means strong memory (more inertia), low \beta means updates respond quickly to new information but may wobble more.
03Formal Definition
04When to Use
Use momentum when gradients oscillate or when curvature is highly anisotropic (some directions steep, others flat). Typical scenarios include training deep neural networks where narrow ravines arise from layer-wise scaling, and convex optimization with ill-conditioned Hessians. In stochastic settings (mini-batch SGD), momentum reduces variance from noisy gradients, helping both speed and stability. Choose classical momentum for simplicity and robustness; prefer Nesterov momentum if you want better performance on smooth convex problems or empirically faster progress in some deep-learning tasks. Momentum is helpful when you need inexpensive acceleration: it adds only O(d) memory and arithmetic, unlike second-order methods that require Hessians or large matrices. It is also useful when you cannot or do not want to tune learning rates frequently because momentum can "rescue" slightly aggressive step sizes by damping oscillations. However, if gradients are already well-behaved (well-conditioned problems or with strong adaptive methods like Adam), the incremental benefit may be smaller. Use momentum together with learning-rate schedules (e.g., decay or warmup) for best results.
β οΈCommon Mistakes
- Setting \beta too high with a large \eta. High inertia plus big steps often overshoots and diverges; start with \beta in [0.8, 0.95] and increase cautiously.
- Forgetting to scale updates consistently. The learning rate \eta multiplies the velocity in the parameter update, not the gradient term inside v_t for classical momentum; mixing these conventions changes effective step sizes.
- Mis-implementing Nesterov. The gradient must be evaluated at the look-ahead point \theta_t - \eta \beta v_{t-1}; computing it at \theta_t defeats the purpose.
- Not initializing v_0 to zero (or appropriately). Starting with stale or mismatched velocity can cause a big first step.
- Confusing weight decay (L2 regularization) with momentum. Weight decay should be applied to parameters (or gradients) consistently; momentum applies to the running average of gradients.
- Ignoring data scaling. Poorly scaled features cause ill-conditioning that momentum can mitigate but not fully fix; standardize inputs for better stability.
- Using the same hyperparameters across very different batch sizes. Larger batches reduce gradient noise and may allow slightly higher \eta or lower \beta; small batches may benefit from higher \beta.
- Expecting momentum to fix non-differentiable objectives without smoothing. Momentum assumes gradients (or subgradients) are available and informative.
Key Formulas
Classical Momentum (Heavy-ball)
Explanation: Velocity is the exponential moving average of past gradients and parameters move opposite to velocity scaled by learning rate. This reduces zig-zags and accelerates along consistent directions.
Nesterov Accelerated Gradient (NAG)
Explanation: Compute a look-ahead point using previous velocity, then evaluate the gradient there for a corrective update. This often yields better theoretical rates on smooth convex functions.
EMA Expansion
Explanation: Unrolling the recursion shows momentum is an exponentially weighted moving average of past gradients. Recent gradients get higher weight, older ones decay by powers of .
Scalar Quadratic Error Recurrence
Explanation: For L()=(-^*)^2, the error follows a second-order linear recurrence. Stability requires the roots of - (1- a + ) r + = 0 to lie inside the unit circle.
Heavy-ball Optimal Parameters (Strongly Convex)
Explanation: For -strongly convex and L-smooth functions, these parameters minimize the spectral radius of the iteration in the worst case. They provide accelerated linear convergence.
Condition Number
Explanation: The ratio of largest to smallest curvature controls convergence rates. Larger means ill-conditioning and slower progress for plain gradient descent; momentum mitigates this effect.
Lipschitz Gradient
Explanation: This smoothness condition bounds how quickly the gradient can change, which is key to deriving stable step sizes and convergence guarantees.
Arithmetic Series (reference)
Explanation: Useful when summing iteration-dependent terms in analyses. The closed form simplifies complexity and convergence proofs involving cumulative steps.
Per-epoch Complexity
Explanation: Optimizing models with d parameters over n examples costs O(nd) to compute one pass of gradients. Momentum adds only one extra d-dimensional vector in memory.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Minimize L(theta) = 0.5 * theta^T H theta, with H = diag([1000, 1]) 5 // Gradient: g(theta) = H * theta 6 7 struct Momentum { 8 double eta; // learning rate 9 double beta; // momentum coefficient 10 vector<double> v; // velocity 11 Momentum(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0) {} 12 void step(vector<double>& theta, const vector<double>& grad) { 13 // v_t = beta * v_{t-1} + g_t 14 for (size_t i = 0; i < theta.size(); ++i) { 15 v[i] = beta * v[i] + grad[i]; 16 } 17 // theta_{t+1} = theta_t - eta * v_t 18 for (size_t i = 0; i < theta.size(); ++i) { 19 theta[i] -= eta * v[i]; 20 } 21 } 22 }; 23 24 int main() { 25 // Problem setup 26 vector<double> theta = {5.0, 5.0}; // initial point 27 vector<double> diagH = {1000.0, 1.0}; // Hessian diagonal 28 29 // Hyperparameters: modest lr with momentum to reduce zig-zag 30 double eta = 0.01; // learning rate 31 double beta = 0.9; // momentum 32 Momentum opt(theta.size(), eta, beta); 33 34 auto grad = [&](const vector<double>& x){ 35 vector<double> g(2); 36 g[0] = diagH[0] * x[0]; 37 g[1] = diagH[1] * x[1]; 38 return g; 39 }; 40 41 auto loss = [&](const vector<double>& x){ 42 return 0.5 * (diagH[0]*x[0]*x[0] + diagH[1]*x[1]*x[1]); 43 }; 44 45 cout.setf(ios::fixed); cout<<setprecision(6); 46 for (int t = 0; t < 200; ++t) { 47 vector<double> g = grad(theta); 48 opt.step(theta, g); 49 if (t % 20 == 0) { 50 cout << "iter " << t << ": theta=[" << theta[0] << ", " << theta[1] 51 << "] L=" << loss(theta) << "\n"; 52 } 53 } 54 55 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n"; 56 return 0; 57 } 58
This program minimizes an ill-conditioned quadratic bowl where one direction is 1000 times steeper than the other. Plain gradient descent would zig-zag heavily; classical momentum uses a velocity v_t = \beta v_{t-1} + g_t and takes updates \theta_{t+1} = \theta_t - \eta v_t to damp oscillations and accelerate along the shallow direction. You can experiment by setting beta=0 to see slower convergence and more zig-zag.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Nesterov { 5 double eta; // learning rate 6 double beta; // momentum coefficient 7 vector<double> v; // velocity 8 Nesterov(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0) {} 9 // Step requires gradient at look-ahead point: theta_look = theta - eta * beta * v 10 template <class GradFunc> 11 void step(vector<double>& theta, GradFunc grad_func) { 12 vector<double> theta_look(theta.size()); 13 for (size_t i = 0; i < theta.size(); ++i) theta_look[i] = theta[i] - eta * beta * v[i]; 14 vector<double> g = grad_func(theta_look); // g_t = grad at look-ahead 15 for (size_t i = 0; i < theta.size(); ++i) { 16 v[i] = beta * v[i] + g[i]; // update velocity 17 theta[i] -= eta * v[i]; // update parameters 18 } 19 } 20 }; 21 22 int main() { 23 // H = diag([1000, 1]) 24 vector<double> diagH = {1000.0, 1.0}; 25 auto grad = [&](const vector<double>& x){ 26 vector<double> g(2); 27 g[0] = diagH[0] * x[0]; 28 g[1] = diagH[1] * x[1]; 29 return g; 30 }; 31 auto loss = [&](const vector<double>& x){ 32 return 0.5 * (diagH[0]*x[0]*x[0] + diagH[1]*x[1]*x[1]); 33 }; 34 35 vector<double> theta = {5.0, 5.0}; 36 Nesterov opt(theta.size(), 0.01, 0.9); 37 38 cout.setf(ios::fixed); cout<<setprecision(6); 39 for (int t = 0; t < 200; ++t) { 40 opt.step(theta, grad); 41 if (t % 20 == 0) { 42 cout << "iter " << t << ": theta=[" << theta[0] << ", " << theta[1] 43 << "] L=" << loss(theta) << "\n"; 44 } 45 } 46 47 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n"; 48 return 0; 49 } 50
This implementation computes the gradient at a look-ahead point \theta - \eta\beta v, then updates velocity and parameters. Compared to classical momentum on the same problem, Nesterov often reduces overshoot in the shallow direction while keeping oscillations small in the steep direction.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Binary logistic regression with L2 regularization trained by SGD + momentum. 5 // Model: p(y=1|x) = sigmoid(w^T x + b) 6 7 struct Momentum { 8 double eta, beta; vector<double> v; double vb; // velocity for weights and bias 9 Momentum(size_t d, double eta_, double beta_) : eta(eta_), beta(beta_), v(d, 0.0), vb(0.0) {} 10 void step(vector<double>& w, double& b, const vector<double>& gw, double gb) { 11 for (size_t i = 0; i < w.size(); ++i) v[i] = beta * v[i] + gw[i]; 12 vb = beta * vb + gb; 13 for (size_t i = 0; i < w.size(); ++i) w[i] -= eta * v[i]; 14 b -= eta * vb; 15 } 16 }; 17 18 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 19 20 int main() { 21 // Generate a simple 2D synthetic dataset 22 std::mt19937 rng(42); 23 normal_distribution<double> n01(0.0, 1.0); 24 int n = 2000; int d = 2; 25 vector<vector<double>> X(n, vector<double>(d)); 26 vector<int> y(n); 27 // True weights 28 vector<double> w_true = {2.0, -3.0}; double b_true = 0.5; 29 for (int i = 0; i < n; ++i) { 30 double x0 = n01(rng), x1 = n01(rng); 31 X[i][0] = x0; X[i][1] = x1; 32 double logit = w_true[0]*x0 + w_true[1]*x1 + b_true + 0.2*n01(rng); 33 double p = sigmoid(logit); 34 y[i] = (uniform_real_distribution<double>(0.0,1.0)(rng) < p) ? 1 : 0; 35 } 36 37 // Initialize parameters 38 vector<double> w(d, 0.0); double b = 0.0; 39 double lambda = 1e-3; // L2 regularization 40 double eta = 0.1; double beta = 0.9; int batch = 64; int epochs = 10; 41 Momentum opt(d, eta, beta); 42 43 vector<int> idx(n); iota(idx.begin(), idx.end(), 0); 44 45 auto loss_and_grad = [&](const vector<int>& batch_idx){ 46 vector<double> gw(d, 0.0); double gb = 0.0; double loss = 0.0; 47 for (int id : batch_idx) { 48 const auto& xi = X[id]; int yi = y[id]; 49 double z = inner_product(xi.begin(), xi.end(), w.begin(), 0.0) + b; 50 double p = sigmoid(z); 51 // Cross-entropy loss contribution 52 loss += -(yi ? log(max(p,1e-12)) : log(max(1-p,1e-12))); 53 // Gradient wrt w and b 54 double err = (p - yi); 55 for (int j = 0; j < d; ++j) gw[j] += err * xi[j]; 56 gb += err; 57 } 58 // Average over batch 59 double m = (double)batch_idx.size(); 60 for (int j = 0; j < d; ++j) gw[j] /= m; gb /= m; loss /= m; 61 // Add L2 regularization: 0.5*lambda*||w||^2 62 for (int j = 0; j < d; ++j) { loss += 0.5 * lambda * w[j]*w[j]; gw[j] += lambda * w[j]; } 63 return tuple<double, vector<double>, double>(loss, gw, gb); 64 }; 65 66 cout.setf(ios::fixed); cout<<setprecision(6); 67 for (int e = 0; e < epochs; ++e) { 68 shuffle(idx.begin(), idx.end(), rng); 69 double epoch_loss = 0.0; int steps = 0; 70 for (int i = 0; i < n; i += batch) { 71 vector<int> bidx; bidx.reserve(batch); 72 for (int k = i; k < min(i+batch, n); ++k) bidx.push_back(idx[k]); 73 auto [L, gw, gb] = loss_and_grad(bidx); 74 epoch_loss += L; ++steps; 75 opt.step(w, b, gw, gb); 76 } 77 epoch_loss /= steps; 78 // Compute accuracy on the whole dataset (for demonstration) 79 int correct = 0; 80 for (int i = 0; i < n; ++i) { 81 double p = sigmoid(inner_product(X[i].begin(), X[i].end(), w.begin(), 0.0) + b); 82 int pred = (p >= 0.5) ? 1 : 0; correct += (pred == y[i]); 83 } 84 cout << "epoch " << e << ": loss=" << epoch_loss << ", acc=" << (double)correct/n << "\n"; 85 } 86 87 cout << "Final w: [" << w[0] << ", " << w[1] << "] b=" << b << "\n"; 88 return 0; 89 } 90
This program trains a logistic regression classifier with mini-batch SGD and classical momentum. The velocity smooths noisy gradient estimates, typically improving stability and convergence speed compared to plain SGD. It also demonstrates how to integrate L2 regularization and compute basic metrics.