Gradient Descent
Key Points
- •Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.
- •Batch gradient descent uses the full dataset to compute the gradient each step, making it stable but potentially slower per iteration.
- •Choosing a good learning rate (step size) is crucial; too large diverges, too small converges very slowly.
- •For convex, smooth losses like linear regression, batch gradient descent reliably converges to the global minimum.
- •Per iteration cost is O(n d), where n is the number of examples and d is the number of features.
- •Feature scaling (standardization) often dramatically improves convergence speed and stability.
- •Stopping criteria typically monitor gradient norm, loss improvement, or a maximum number of iterations.
- •C++ implementations should pay attention to numerical stability (overflow in exp, underflow in probabilities) and careful vectorized loops.
Prerequisites
- →Multivariable calculus (gradients and partial derivatives) — Gradient descent requires computing and understanding ∇L(θ).
- →Linear algebra (vectors, matrices, matrix-vector products) — Batch gradients for linear/logistic regression use X^T(Xθ−y) and X^T(p−y).
- →Convexity basics — Convergence guarantees and step-size selection rely on convexity and smoothness assumptions.
- →Floating-point arithmetic and numerical stability — C++ implementations must avoid overflow/underflow, especially for exp/log in logistic regression.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine hiking down a hill in the fog with only a local slope meter—you can’t see the valley, but you can feel which way is downhill. Gradient descent is that hiker. It repeatedly measures the slope (the gradient) at the current position and takes a step downhill to reduce the loss (error). Concept: In machine learning and optimization, we often want to find parameters θ that minimize a loss function L(θ). The gradient ∇L(θ) tells us the direction of steepest ascent; stepping in the opposite direction decreases the loss. Batch gradient descent computes this gradient using the entire dataset each iteration, making each step exact (for that dataset) but computationally heavier than stochastic variants. Example: In linear regression, L(θ) is the mean squared error between predictions and targets. By computing the gradient over all samples, we update θ so that predictions get closer to targets overall, iteration by iteration. Over time—and with a suitable learning rate—θ converges toward parameters that minimize error. The algorithm is simple, widely applicable to differentiable losses, and forms the backbone of many learning systems.
02Intuition & Analogies
Hook: Think of trying to flatten a wrinkled bedsheet. You press where the bump is highest and move your hands outward, repeatedly smoothing until it’s flat. Gradient descent “presses” where the error is greatest and moves parameters in the direction that flattens the error surface. Concept: The gradient is like a compass pointing toward the steepest increase of error. If we walk the other way by some small step, we reduce the error. The learning rate is how big that step is—like how far you move your hands each time you press the sheet. Too big and you overshoot and create new wrinkles; too small and it takes forever. Batch gradient descent uses all the evidence (every data point) to choose the movement direction, like polling everyone in a crowd before deciding. Example: For a simple parabola L(θ) = (θ − 3)^2, the slope at θ tells you whether 3 is to the left or right and how far. If θ = 0, the slope is −6, so stepping right (opposite of negative slope) moves you toward θ = 3. For linear regression with many features, each feature’s contribution to the error becomes a component in the gradient vector. Summing over the whole dataset (batch) ensures our step points toward reducing the average error, not just the error of one sample.
03Formal Definition
04When to Use
Hook: Use the right hammer for the right nail. Batch gradient descent is the hammer for smooth, differentiable losses when you can afford full passes over the data. Concept: Apply batch gradient descent when the dataset fits in memory or streaming a full pass is acceptable, and when you want stable, low-variance gradient estimates per step. It shines in convex problems like linear regression and logistic regression, where global minima exist and each step steadily progresses. It also serves as a strong baseline and sanity check for more complex optimizers. Example use cases: (1) Fitting linear or logistic regression on medium-sized datasets (n·d manageable). (2) Training small neural networks for didactic purposes or when determinism and stability matter. (3) Computing precise gradients for ill-conditioned problems before switching to quasi-Newton methods. Consider alternatives when: data is huge (prefer mini-batch/SGD), the loss is highly non-convex and noisy (adaptive optimizers may help), or when second-order curvature information (Newton/LM/L-BFGS) yields faster convergence per iteration despite higher cost.
⚠️Common Mistakes
Hook: Most failures come from two knobs: step size and scaling. Concept: Common pitfalls include (1) Learning rate too large, causing divergence or oscillations; (2) Learning rate too small, leading to painfully slow progress; (3) Unscaled features, which distort the loss surface and make a single learning rate ineffective; (4) Forgetting the bias term or mishandling it during scaling; (5) Using integer arithmetic or low precision causing rounding errors; (6) Incorrect gradient signs or averaging factors (missing 1/n); (7) Poor stopping rules—either stopping too early due to noisy loss or too late due to tiny improvement; (8) Ignoring numerical stability in functions like exp for logistic regression. Example fixes: Standardize features (zero mean, unit variance) and keep bias unscaled or handled separately. Start with a conservative \eta (e.g., 1/L estimate or small like 1e-2) and consider decay schedules. Verify gradients with finite-difference checks on small problems. Monitor both loss and gradient norm. Use double precision in C++ and clamp logits for sigmoid to avoid overflow. Log progress and experiment with different \eta to find a stable, fast regime.
Key Formulas
Gradient Descent Update
Explanation: Update parameters by stepping opposite the gradient scaled by learning rate This is the core rule for batch gradient descent.
Empirical Risk
Explanation: The training loss is the average of per-sample losses. Batch gradient descent differentiates this sum to form the update.
MSE Loss (Linear Regression)
Explanation: This measures average squared error between predictions and targets. The factor simplifies the gradient expression.
Linear Regression Gradient
Explanation: For mean squared error, the gradient equals the data matrix transpose times residuals, averaged over n samples.
Sigmoid
Explanation: Maps any real input to a probability in (0,1). Used to convert linear scores to probabilities in logistic regression.
Logistic Regression Loss
Explanation: Cross-entropy penalizes wrong confident predictions heavily. It is convex in θ for binary logistic regression.
Logistic Regression Gradient
Explanation: The gradient is the data matrix transpose times the probability error vector. It averages errors over all samples.
Sublinear Convergence (Convex, L-smooth)
Explanation: With step size 1/L on an L-smooth convex function, gradient descent’s suboptimality shrinks as O. This gives a rate guarantee without strong convexity.
Linear Convergence (Strongly Convex)
Explanation: If f is convex and L-smooth, gradient descent with a suitable step size converges geometrically. The ratio controls the speed.
Learning Rate Decay
Explanation: A decaying step size can stabilize and improve convergence when curvature varies. k controls how quickly the learning rate shrinks.
Gradient-Norm Stopping
Explanation: Stop when the gradient is small, indicating a (near) stationary point. This is robust for smooth problems.
Relative Loss Decrease Stopping
Explanation: Stop when successive loss improvements become negligible. The max avoids division by very small numbers.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Minimize L(theta) = (theta - 3)^2 using batch gradient descent 5 // Analytical gradient: dL/dtheta = 2*(theta - 3) 6 7 int main() { 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 auto L = [](double th) { return (th - 3.0) * (th - 3.0); }; 12 auto grad = [](double th) { return 2.0 * (th - 3.0); }; 13 14 double theta = 0.0; // initial guess 15 double eta = 0.2; // learning rate 16 int max_iters = 50; 17 double tol = 1e-9; // gradient-norm stopping 18 19 cout << fixed << setprecision(6); 20 for (int t = 0; t < max_iters; ++t) { 21 double g = grad(theta); 22 cout << "iter=" << t 23 << " theta=" << theta 24 << " L=" << L(theta) 25 << " |grad|=" << fabs(g) << "\n"; 26 if (fabs(g) <= tol) break; 27 theta -= eta * g; // theta_{t+1} = theta_t - eta * grad 28 } 29 30 cout << "Final theta ~ " << theta << " (true minimizer is 3.0)\n"; 31 return 0; 32 } 33
This program demonstrates the basic gradient descent update on a simple convex function. The gradient is computed exactly, and the parameter is updated by stepping opposite the gradient. You can experiment with the learning rate: too large can cause divergence or oscillation; moderate values converge quickly to 3.0.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Linear regression with batch gradient descent 5 // Minimize L(theta) = (1/(2n)) * sum_i (x_i^T theta - y_i)^2 6 // Gradient: (1/n) * X^T (X theta - y) 7 8 struct Dataset { 9 vector<vector<double>> X; // n x d (will include bias as last column) 10 vector<double> y; // n 11 }; 12 13 // Generate synthetic linear data with bias and Gaussian noise 14 Dataset make_data(int n, int d_no_bias, unsigned seed=42) { 15 mt19937 rng(seed); 16 normal_distribution<double> noise(0.0, 0.3); 17 uniform_real_distribution<double> wdist(-2.0, 2.0); 18 vector<double> true_w(d_no_bias + 1); // include bias at end 19 for (double &w : true_w) w = wdist(rng); 20 21 Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0.0); 22 23 uniform_real_distribution<double> xdist(-2.0, 2.0); 24 for (int i = 0; i < n; ++i) { 25 double yclean = 0.0; 26 for (int j = 0; j < d_no_bias; ++j) { 27 double xj = xdist(rng); 28 ds.X[i][j] = xj; 29 yclean += true_w[j] * xj; 30 } 31 ds.X[i][d_no_bias] = 1.0; // bias feature 32 yclean += true_w[d_no_bias]; 33 ds.y[i] = yclean + noise(rng); 34 } 35 return ds; 36 } 37 38 // Standardize each non-bias feature: zero mean, unit variance 39 void standardize_features(vector<vector<double>> &X) { 40 int n = (int)X.size(); 41 int d = (int)X[0].size(); 42 // Exclude last column (bias) from scaling 43 for (int j = 0; j < d - 1; ++j) { 44 double mean = 0.0; 45 for (int i = 0; i < n; ++i) mean += X[i][j]; 46 mean /= n; 47 double var = 0.0; 48 for (int i = 0; i < n; ++i) var += (X[i][j] - mean) * (X[i][j] - mean); 49 var /= max(1, n - 1); 50 double stdv = sqrt(var) + 1e-12; // avoid divide-by-zero 51 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv; 52 } 53 } 54 55 int main() { 56 ios::sync_with_stdio(false); 57 cin.tie(nullptr); 58 59 int n = 500; // samples 60 int d_no_bias = 5; // real features (we will append bias) 61 Dataset ds = make_data(n, d_no_bias, 123); 62 63 // Scale features (not bias) 64 standardize_features(ds.X); 65 66 int d = d_no_bias + 1; // include bias feature 67 vector<double> theta(d, 0.0); // initialize to zeros 68 69 double eta = 0.1; // learning rate 70 int max_iters = 1000; 71 double tol = 1e-8; // gradient-norm stopping 72 73 vector<double> yhat(n), residual(n), grad(d); 74 75 auto loss = [&](const vector<double>& th){ 76 double sse = 0.0; 77 for (int i = 0; i < n; ++i) { 78 double pred = 0.0; 79 for (int j = 0; j < d; ++j) pred += ds.X[i][j] * th[j]; 80 double e = pred - ds.y[i]; 81 sse += e * e; 82 } 83 return 0.5 * sse / n; // (1/2n) * ||X theta - y||^2 84 }; 85 86 cout << fixed << setprecision(6); 87 for (int t = 0; t < max_iters; ++t) { 88 // yhat = X theta 89 for (int i = 0; i < n; ++i) { 90 double s = 0.0; 91 for (int j = 0; j < d; ++j) s += ds.X[i][j] * theta[j]; 92 yhat[i] = s; 93 } 94 // residual = yhat - y 95 for (int i = 0; i < n; ++i) residual[i] = yhat[i] - ds.y[i]; 96 // grad = (1/n) * X^T residual 97 fill(grad.begin(), grad.end(), 0.0); 98 for (int j = 0; j < d; ++j) { 99 double gj = 0.0; 100 for (int i = 0; i < n; ++i) gj += ds.X[i][j] * residual[i]; 101 grad[j] = gj / n; 102 } 103 // Progress report 104 if (t % 100 == 0) { 105 double L = loss(theta); 106 double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm); 107 cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n"; 108 if (gnorm <= tol) break; 109 } 110 // Update 111 for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j]; 112 } 113 114 cout << "Final loss=" << loss(theta) << "\nParameters (including bias last):\n"; 115 for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n'; 116 return 0; 117 } 118
This program fits a linear regression model with batch gradient descent. It standardizes features for better conditioning, appends a bias feature, and iteratively updates parameters using the full-dataset gradient. The loss decreases steadily when the learning rate is reasonable, illustrating stable batch updates.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Logistic regression with batch gradient descent 5 // Loss: L(theta) = (1/n) * sum_i [ -y_i log p_i - (1-y_i) log(1-p_i) ], p_i = sigmoid(x_i^T theta) 6 // Gradient: (1/n) * X^T (p - y) 7 8 struct Dataset { 9 vector<vector<double>> X; // n x d (includes bias column) 10 vector<int> y; // n labels in {0,1} 11 }; 12 13 Dataset make_data(int n, int d_no_bias, unsigned seed=7) { 14 mt19937 rng(seed); 15 normal_distribution<double> noise(0.0, 0.5); 16 uniform_real_distribution<double> wdist(-2.0, 2.0); 17 vector<double> true_w(d_no_bias + 1); 18 for (double &w : true_w) w = wdist(rng); 19 20 Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0); 21 normal_distribution<double> xdist(0.0, 1.0); 22 for (int i = 0; i < n; ++i) { 23 double z = 0.0; 24 for (int j = 0; j < d_no_bias; ++j) { 25 double xj = xdist(rng); 26 ds.X[i][j] = xj; 27 z += true_w[j] * xj; 28 } 29 ds.X[i][d_no_bias] = 1.0; // bias 30 z += true_w[d_no_bias]; 31 z += noise(rng); 32 // Probability via sigmoid, then sample label 33 double p = 1.0 / (1.0 + exp(-z)); 34 bernoulli_distribution bd(p); 35 ds.y[i] = (int)bd(rng); 36 } 37 return ds; 38 } 39 40 void standardize_features(vector<vector<double>> &X) { 41 int n = (int)X.size(); 42 int d = (int)X[0].size(); 43 for (int j = 0; j < d - 1; ++j) { 44 double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n; 45 double var = 0.0; for (int i = 0; i < n; ++i) var += (X[i][j]-mean)*(X[i][j]-mean); var /= max(1, n-1); 46 double stdv = sqrt(var) + 1e-12; 47 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j]-mean)/stdv; 48 } 49 } 50 51 // Numerically stable sigmoid 52 inline double sigmoid(double z) { 53 if (z >= 0) { 54 double ez = exp(-z); 55 return 1.0 / (1.0 + ez); 56 } else { 57 double ez = exp(z); 58 return ez / (1.0 + ez); 59 } 60 } 61 62 int main() { 63 ios::sync_with_stdio(false); 64 cin.tie(nullptr); 65 66 int n = 1000; 67 int d_no_bias = 4; 68 Dataset ds = make_data(n, d_no_bias, 2024); 69 standardize_features(ds.X); 70 71 int d = d_no_bias + 1; // include bias 72 vector<double> theta(d, 0.0); 73 vector<double> p(n), grad(d); 74 75 auto loss = [&](const vector<double>& th){ 76 double L = 0.0; 77 for (int i = 0; i < n; ++i) { 78 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * th[j]; 79 double pi = sigmoid(z); 80 // Clamp to avoid log(0) 81 double eps = 1e-12; 82 pi = min(max(pi, eps), 1.0 - eps); 83 L += - ds.y[i] * log(pi) - (1 - ds.y[i]) * log(1 - pi); 84 } 85 return L / n; 86 }; 87 88 double eta = 0.2; 89 int max_iters = 1000; 90 double tol = 1e-6; 91 92 cout << fixed << setprecision(6); 93 for (int t = 0; t < max_iters; ++t) { 94 // Compute probabilities p = sigmoid(X theta) 95 for (int i = 0; i < n; ++i) { 96 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j]; 97 p[i] = sigmoid(z); 98 } 99 // Gradient: (1/n) * X^T (p - y) 100 fill(grad.begin(), grad.end(), 0.0); 101 for (int j = 0; j < d; ++j) { 102 double gj = 0.0; 103 for (int i = 0; i < n; ++i) gj += ds.X[i][j] * (p[i] - ds.y[i]); 104 grad[j] = gj / n; 105 } 106 if (t % 100 == 0) { 107 double L = loss(theta); 108 double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm); 109 cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n"; 110 if (gnorm <= tol) break; 111 } 112 for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j]; 113 } 114 115 cout << "Final loss=" << loss(theta) << "\nParameters (bias last):\n"; 116 for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n'; 117 118 // Quick training accuracy check 119 int correct = 0; 120 for (int i = 0; i < n; ++i) { 121 double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j]; 122 int pred = sigmoid(z) >= 0.5 ? 1 : 0; 123 correct += (pred == ds.y[i]); 124 } 125 cout << "Training accuracy ~ " << (100.0 * correct / n) << "%\n"; 126 return 0; 127 } 128
This example trains a binary logistic regression model using batch gradient descent. It computes cross-entropy loss, uses a numerically stable sigmoid, standardizes features, and updates parameters using the full gradient each iteration. The model’s loss decreases and accuracy improves as training proceeds.