Stochastic Gradient Descent (SGD)
Key Points
- •Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.
- •Mini-batch sampling provides an unbiased estimate of the true gradient while reducing variance compared to single-sample updates.
- •Random shuffling each epoch is crucial to avoid bias and to ensure good mixing of examples during training.
- •Learning rate schedules and momentum are practical tools that stabilize SGD and speed up convergence.
- •For convex problems like linear or logistic regression, SGD converges under mild step-size conditions; for deep nets it often finds useful minima in practice.
- •Per-update cost scales with the batch size and number of features, enabling training on very large datasets.
- •Normalization and proper regularization (e.g., L2) significantly improve stability and generalization in SGD.
- •Common pitfalls include too-large learning rates, forgetting to average gradients by batch size, and not decaying the learning rate.
Prerequisites
- →Vectors and dot products — Gradient computations in linear and logistic models are built from dot products and linear algebra.
- →Derivatives and chain rule — SGD updates use gradients of the loss with respect to parameters.
- →Probability and expectation — Understanding unbiased gradient estimators and variance requires basic probability.
- →Convex vs. non-convex optimization — Convergence guarantees differ between convex problems and deep networks.
- →Overfitting and regularization — SGD benefits from regularization like L2 to improve generalization.
Detailed Explanation
Tap terms for definitions01Overview
Stochastic Gradient Descent (SGD) is a workhorse algorithm for training machine learning models by iteratively adjusting parameters to minimize a loss function. Instead of computing the gradient on the entire dataset (which can be very large and slow), SGD uses a small, randomly selected subset called a mini-batch to estimate the gradient and take a step. This approach drastically reduces computation per update and introduces noise that can help escape shallow local minima and flat plateaus. Mini-batch SGD with random sampling is the most common variant used in practice: during each epoch, we randomly shuffle data and repeatedly draw mini-batches of size b to perform updates. The process repeats for multiple epochs until convergence criteria are met (e.g., training loss stabilizes or a maximum number of epochs is reached). The algorithm is simple, scalable, and surprisingly powerful. Hook: think of tuning a radio in a noisy environment—you nudge the dial based on partial, noisy feedback, yet eventually lock onto the station. Concept: SGD nudges parameters based on noisy gradient estimates. Example: fitting a line to data (linear regression) by repeatedly updating the slope and intercept using small random groups of points.
02Intuition & Analogies
Imagine hiking down a foggy mountain. Full-batch gradient descent waits for the fog to clear entirely before choosing the perfect downhill direction—accurate, but it takes a long time to see far enough. SGD instead feels the slope under your feet using a handful of pebbles nearby (a mini-batch). Each step may not be perfect, but it is quick, and over many steps you still move downhill effectively. Randomly changing which pebbles you feel (random shuffling) prevents you from being tricked by a peculiar local arrangement of rocks and helps you explore more directions. Batch size is like how many pebbles you sample at once: one pebble (pure SGD) is noisy and jittery; a large handful (large batch) is stable but slower per update. Momentum is like adding a bit of inertia—you don’t immediately react to every jitter; instead, you keep moving in a smoothed, averaged downhill direction. Learning-rate decay is like taking shorter steps as you approach the valley to avoid overshooting. Regularization is a gentle elastic band pulling you toward simpler solutions to avoid overfitting the noisy terrain. In practice, this combination—mini-batches, shuffling, momentum, and decaying step sizes—makes SGD both fast and robust on massive datasets, from simple linear models to deep neural networks.
03Formal Definition
04When to Use
Use mini-batch SGD when datasets are large enough that full-batch gradients are expensive or cannot fit in memory, or when you want fast, incremental updates (e.g., online or streaming scenarios). It shines for models where computing per-example gradients is cheap and additive across samples: linear regression, logistic regression, matrix factorization, embeddings, and deep neural networks. Choose mini-batches to balance noise and throughput: small batches provide more frequent updates and better generalization in some cases; moderate to large batches can exploit parallel hardware. Random shuffling each epoch avoids cyclic patterns and ensures unbiased coverage of the dataset. Momentum is particularly helpful when the loss landscape has ravines—steep in one direction and flat in another—common in deep nets; it accelerates progress along consistent directions and damps oscillations. Learning-rate schedules (step decay, cosine, or 1/t) help you start with bold steps and then fine-tune. Opt for L2 regularization when you suspect overfitting, and normalize inputs to stabilize gradients. If your data arrives over time, SGD naturally adapts by updating with each new mini-batch without retraining from scratch.
⚠️Common Mistakes
• Using a learning rate that is too large causes divergence or wildly oscillating loss; too small makes training painfully slow. Start with a reasonable default (e.g., 0.1 for normalized linear/logistic models or 1e-3 for deep nets) and tune. • Forgetting to average by batch size b makes gradients b times larger than intended, effectively increasing the learning rate and destabilizing training. • Not shuffling data each epoch can introduce bias (e.g., class-ordered data) and lead to poor minima or zig-zagging behavior. • Skipping feature scaling leads to gradients of very different magnitudes per feature, causing slow or unstable convergence; standardize or normalize inputs. • Ignoring regularization in high-dimensional problems often overfits; add L2 (weight decay) or early stopping. • Using momentum incorrectly (e.g., applying learning rate both inside velocity and again on update, or confusing different momentum formulations) changes the effective step and can hinder convergence. • Evaluating progress only on the noisy mini-batch loss is misleading; track a smoothed training loss or an epoch-level metric and monitor validation performance. • Changing batch size without re-tuning learning rate alters the scale and noise of gradients; consider linear scaling of the learning rate with batch size as a heuristic.
Key Formulas
Empirical Risk
Explanation: This is the average loss over the dataset. SGD aims to minimize this objective by sampling small subsets of terms.
Mini-batch Gradient Estimator
Explanation: The gradient used at step t is the average of per-example gradients over the mini-batch. It is cheaper than the full gradient and still points approximately downhill.
SGD Update Rule
Explanation: Parameters move opposite the gradient by a step size . This is the core iterative update of SGD.
Unbiasedness
Explanation: Under uniform random sampling, the expected mini-batch gradient equals the true gradient. This property underpins SGD’s convergence behavior.
Variance vs. Batch Size
Explanation: The variance of the mini-batch gradient decreases roughly inversely with batch size b. Larger batches yield more stable updates but cost more computation.
Momentum (Exponential Moving Average)
Explanation: Momentum keeps a moving average of gradients to smooth noise and accelerate along consistent directions. The parameter controls how much past gradients matter.
Inverse Time Decay
Explanation: A simple learning-rate schedule that gradually decreases the step size. It helps with convergence once near a minimum.
MSE Loss
Explanation: Mean squared error for linear regression. The factor simplifies derivatives; the gradient is (X - y)/n.
Logistic Loss
Explanation: Binary cross-entropy loss for logistic regression, where is the sigmoid. Its gradient equals the average of (-y)x.
L2 Regularization
Explanation: Adds a penalty proportional to the squared norm of parameters. It discourages large weights and can improve generalization.
Robbins–Monro Conditions
Explanation: Classical requirements for diminishing step sizes to ensure convergence of stochastic approximation methods under convexity and smoothness.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <algorithm> 5 #include <numeric> 6 #include <cmath> 7 #include <cassert> 8 9 // Utility: dot product between w and x 10 double dot(const std::vector<double>& w, const std::vector<double>& x) { 11 assert(w.size() == x.size()); 12 double s = 0.0; 13 for (size_t j = 0; j < w.size(); ++j) s += w[j] * x[j]; 14 return s; 15 } 16 17 // Generate synthetic linear data: y = w_true^T x + noise 18 void make_dataset(size_t n, size_t d, std::vector<std::vector<double>>& X, std::vector<double>& y, 19 std::mt19937& rng) { 20 std::normal_distribution<double> nd(0.0, 1.0); 21 std::vector<double> w_true(d); 22 for (size_t j = 0; j < d; ++j) w_true[j] = nd(rng); 23 std::normal_distribution<double> noise(0.0, 0.1); 24 25 X.assign(n, std::vector<double>(d)); 26 y.assign(n, 0.0); 27 for (size_t i = 0; i < n; ++i) { 28 for (size_t j = 0; j < d; ++j) { 29 X[i][j] = nd(rng); // standard normal features 30 } 31 y[i] = dot(w_true, X[i]) + noise(rng); 32 } 33 } 34 35 // Compute mean squared error over dataset 36 double mse(const std::vector<std::vector<double>>& X, const std::vector<double>& y, const std::vector<double>& w) { 37 size_t n = X.size(); 38 double s = 0.0; 39 for (size_t i = 0; i < n; ++i) { 40 double pred = dot(w, X[i]); 41 double e = pred - y[i]; 42 s += e * e; 43 } 44 return 0.5 * s / static_cast<double>(n); 45 } 46 47 // Create mini-batches by shuffling indices each epoch 48 std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t batch_size, std::mt19937& rng) { 49 std::vector<size_t> idx(n); 50 std::iota(idx.begin(), idx.end(), 0); 51 std::shuffle(idx.begin(), idx.end(), rng); // random shuffling each epoch 52 53 std::vector<std::vector<size_t>> batches; 54 for (size_t start = 0; start < n; start += batch_size) { 55 size_t end = std::min(n, start + batch_size); 56 batches.emplace_back(idx.begin() + start, idx.begin() + end); 57 } 58 return batches; 59 } 60 61 int main() { 62 // Reproducible RNG 63 std::mt19937 rng(42); 64 65 // Hyperparameters 66 size_t n = 5000; // samples 67 size_t d = 20; // features 68 size_t epochs = 10; // number of passes over data 69 size_t batch = 64; // mini-batch size 70 double lr = 0.1; // learning rate 71 72 // Data 73 std::vector<std::vector<double>> X; 74 std::vector<double> y; 75 make_dataset(n, d, X, y, rng); 76 77 // Parameters initialized to zeros 78 std::vector<double> w(d, 0.0); 79 80 // Training loop 81 for (size_t e = 0; e < epochs; ++e) { 82 auto batches = make_minibatches(n, batch, rng); 83 for (const auto& B : batches) { 84 // Compute mini-batch gradient: grad = (1/b) * X_B^T (X_B w - y_B) 85 std::vector<double> grad(d, 0.0); 86 for (size_t idx : B) { 87 double err = dot(w, X[idx]) - y[idx]; 88 // accumulate gradient per feature 89 for (size_t j = 0; j < d; ++j) grad[j] += err * X[idx][j]; 90 } 91 double invb = 1.0 / static_cast<double>(B.size()); 92 for (size_t j = 0; j < d; ++j) grad[j] *= invb; 93 // Parameter update: w <- w - lr * grad 94 for (size_t j = 0; j < d; ++j) w[j] -= lr * grad[j]; 95 } 96 double train_mse = mse(X, y, w); 97 std::cout << "Epoch " << e+1 << "/" << epochs << ": MSE = " << train_mse << "\n"; 98 } 99 100 // Report final first few weights 101 std::cout << "First 5 weights: "; 102 for (size_t j = 0; j < std::min<size_t>(5, w.size()); ++j) std::cout << w[j] << ' '; 103 std::cout << "\n"; 104 return 0; 105 } 106
This program trains a linear regression model using mini-batch SGD on synthetic data. Each epoch shuffles sample indices, splits them into mini-batches, computes the average gradient for each batch, and updates parameters using the learning rate. We track mean squared error to observe convergence. The code demonstrates core SGD mechanics: random shuffling, mini-batch averaging, and per-batch updates.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <algorithm> 5 #include <numeric> 6 #include <cmath> 7 8 // Sigmoid function with numerical stability for large magnitude inputs 9 inline double sigmoid(double z) { 10 if (z >= 0) { 11 double ez = std::exp(-z); 12 return 1.0 / (1.0 + ez); 13 } else { 14 double ez = std::exp(z); 15 return ez / (1.0 + ez); 16 } 17 } 18 19 double dot(const std::vector<double>& a, const std::vector<double>& b) { 20 double s = 0.0; 21 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; 22 return s; 23 } 24 25 // Make a simple 2D binary classification dataset: two Gaussians 26 void make_dataset(size_t n_per_class, std::vector<std::vector<double>>& X, std::vector<int>& y, std::mt19937& rng) { 27 std::normal_distribution<double> c1x(-2.0, 1.0), c1y(0.0, 1.0); 28 std::normal_distribution<double> c2x(2.0, 1.0), c2y(0.5, 1.0); 29 X.clear(); y.clear(); 30 for (size_t i = 0; i < n_per_class; ++i) { 31 X.push_back({c1x(rng), c1y(rng), 1.0}); // + bias feature 32 y.push_back(0); 33 } 34 for (size_t i = 0; i < n_per_class; ++i) { 35 X.push_back({c2x(rng), c2y(rng), 1.0}); 36 y.push_back(1); 37 } 38 } 39 40 std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t b, std::mt19937& rng) { 41 std::vector<size_t> idx(n); 42 std::iota(idx.begin(), idx.end(), 0); 43 std::shuffle(idx.begin(), idx.end(), rng); 44 std::vector<std::vector<size_t>> batches; 45 for (size_t s = 0; s < n; s += b) batches.emplace_back(idx.begin()+s, idx.begin()+std::min(n, s+b)); 46 return batches; 47 } 48 49 // Compute binary cross-entropy loss averaged over dataset 50 double bce_loss(const std::vector<std::vector<double>>& X, const std::vector<int>& y, const std::vector<double>& w, double lambda) { 51 const double eps = 1e-12; 52 double s = 0.0; 53 for (size_t i = 0; i < X.size(); ++i) { 54 double p = sigmoid(dot(w, X[i])); 55 p = std::min(std::max(p, eps), 1.0 - eps); 56 s += - (y[i] * std::log(p) + (1 - y[i]) * std::log(1 - p)); 57 } 58 // L2 penalty (lambda/2)||w||^2, excluding bias term (last weight) 59 double reg = 0.0; 60 for (size_t j = 0; j + 1 < w.size(); ++j) reg += w[j] * w[j]; 61 return (s / static_cast<double>(X.size())) + 0.5 * lambda * reg; 62 } 63 64 int main() { 65 std::mt19937 rng(123); 66 67 // Hyperparameters 68 size_t n_per_class = 2000; // total n = 4000 69 size_t batch = 64; 70 size_t epochs = 15; 71 double lr0 = 0.5; // initial learning rate (features are small-scale) 72 double decay = 1e-3; // inverse time decay: lr_t = lr0 / (1 + decay * t) 73 double beta = 0.9; // momentum coefficient 74 double lambda = 1e-3; // L2 regularization strength 75 76 // Data (with bias feature) 77 std::vector<std::vector<double>> X; 78 std::vector<int> y; 79 make_dataset(n_per_class, X, y, rng); 80 const size_t n = X.size(); 81 const size_t d = X[0].size(); // includes bias term 82 83 // Parameters and momentum buffer 84 std::vector<double> w(d, 0.0); 85 std::vector<double> v(d, 0.0); 86 87 size_t t = 0; // global step counter for learning-rate decay 88 89 for (size_t e = 0; e < epochs; ++e) { 90 auto batches = make_minibatches(n, batch, rng); 91 for (const auto& B : batches) { 92 ++t; 93 double lr = lr0 / (1.0 + decay * static_cast<double>(t)); 94 // Compute mini-batch gradient of logistic loss + L2 (excluding bias) 95 std::vector<double> g(d, 0.0); 96 for (size_t i : B) { 97 double p = sigmoid(dot(w, X[i])); 98 double diff = (p - static_cast<double>(y[i])); 99 for (size_t j = 0; j < d; ++j) g[j] += diff * X[i][j]; 100 } 101 double invb = 1.0 / static_cast<double>(B.size()); 102 for (size_t j = 0; j < d; ++j) g[j] *= invb; 103 // Add L2 gradient lambda * w (exclude bias term) 104 for (size_t j = 0; j + 1 < d; ++j) g[j] += lambda * w[j]; 105 106 // Momentum update: v = beta * v + (1 - beta) * g; w -= lr * v 107 for (size_t j = 0; j < d; ++j) { 108 v[j] = beta * v[j] + (1.0 - beta) * g[j]; 109 w[j] -= lr * v[j]; 110 } 111 } 112 // Evaluate average loss and accuracy at epoch end 113 double loss = bce_loss(X, y, w, lambda); 114 size_t correct = 0; 115 for (size_t i = 0; i < n; ++i) { 116 double p = sigmoid(dot(w, X[i])); 117 int pred = (p >= 0.5) ? 1 : 0; 118 if (pred == y[i]) ++correct; 119 } 120 double acc = static_cast<double>(correct) / static_cast<double>(n); 121 std::cout << "Epoch " << (e+1) << "/" << epochs << ": loss=" << loss << ", acc=" << acc << "\n"; 122 } 123 124 std::cout << "Weights: "; 125 for (double wi : w) std::cout << wi << ' '; 126 std::cout << "\n"; 127 return 0; 128 } 129
This program trains a binary logistic regression classifier using mini-batch SGD with momentum, L2 regularization, and an inverse-time learning-rate decay. A simple synthetic 2D dataset is generated from two Gaussians; a bias feature is appended. Each update computes the average mini-batch gradient, adds the L2 term (excluding bias), applies momentum, and uses a decayed learning rate. The epoch-end loss and accuracy show convergence and classification performance.