🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Stochastic Gradient Descent (SGD)

Key Points

  • •
    Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.
  • •
    Mini-batch sampling provides an unbiased estimate of the true gradient while reducing variance compared to single-sample updates.
  • •
    Random shuffling each epoch is crucial to avoid bias and to ensure good mixing of examples during training.
  • •
    Learning rate schedules and momentum are practical tools that stabilize SGD and speed up convergence.
  • •
    For convex problems like linear or logistic regression, SGD converges under mild step-size conditions; for deep nets it often finds useful minima in practice.
  • •
    Per-update cost scales with the batch size and number of features, enabling training on very large datasets.
  • •
    Normalization and proper regularization (e.g., L2) significantly improve stability and generalization in SGD.
  • •
    Common pitfalls include too-large learning rates, forgetting to average gradients by batch size, and not decaying the learning rate.

Prerequisites

  • →Vectors and dot products — Gradient computations in linear and logistic models are built from dot products and linear algebra.
  • →Derivatives and chain rule — SGD updates use gradients of the loss with respect to parameters.
  • →Probability and expectation — Understanding unbiased gradient estimators and variance requires basic probability.
  • →Convex vs. non-convex optimization — Convergence guarantees differ between convex problems and deep networks.
  • →Overfitting and regularization — SGD benefits from regularization like L2 to improve generalization.

Detailed Explanation

Tap terms for definitions

01Overview

Stochastic Gradient Descent (SGD) is a workhorse algorithm for training machine learning models by iteratively adjusting parameters to minimize a loss function. Instead of computing the gradient on the entire dataset (which can be very large and slow), SGD uses a small, randomly selected subset called a mini-batch to estimate the gradient and take a step. This approach drastically reduces computation per update and introduces noise that can help escape shallow local minima and flat plateaus. Mini-batch SGD with random sampling is the most common variant used in practice: during each epoch, we randomly shuffle data and repeatedly draw mini-batches of size b to perform updates. The process repeats for multiple epochs until convergence criteria are met (e.g., training loss stabilizes or a maximum number of epochs is reached). The algorithm is simple, scalable, and surprisingly powerful. Hook: think of tuning a radio in a noisy environment—you nudge the dial based on partial, noisy feedback, yet eventually lock onto the station. Concept: SGD nudges parameters based on noisy gradient estimates. Example: fitting a line to data (linear regression) by repeatedly updating the slope and intercept using small random groups of points.

02Intuition & Analogies

Imagine hiking down a foggy mountain. Full-batch gradient descent waits for the fog to clear entirely before choosing the perfect downhill direction—accurate, but it takes a long time to see far enough. SGD instead feels the slope under your feet using a handful of pebbles nearby (a mini-batch). Each step may not be perfect, but it is quick, and over many steps you still move downhill effectively. Randomly changing which pebbles you feel (random shuffling) prevents you from being tricked by a peculiar local arrangement of rocks and helps you explore more directions. Batch size is like how many pebbles you sample at once: one pebble (pure SGD) is noisy and jittery; a large handful (large batch) is stable but slower per update. Momentum is like adding a bit of inertia—you don’t immediately react to every jitter; instead, you keep moving in a smoothed, averaged downhill direction. Learning-rate decay is like taking shorter steps as you approach the valley to avoid overshooting. Regularization is a gentle elastic band pulling you toward simpler solutions to avoid overfitting the noisy terrain. In practice, this combination—mini-batches, shuffling, momentum, and decaying step sizes—makes SGD both fast and robust on massive datasets, from simple linear models to deep neural networks.

03Formal Definition

Given training data D = {(xi​, yi​)}_{i=1}^{n} and a parametric model with parameters θ, define the empirical risk L(θ) = n1​ ∑i=1n​ ℓ(θ; xi​, yi​), where ℓ is a per-example loss. Mini-batch SGD chooses at iteration t a random subset Bt​ ⊂ \{1,…,n\} of size b (often uniformly without replacement per epoch), computes the stochastic gradient estimate gt​ = b1​ ∑i∈Bt​​ ∇_θ ℓ(θt​; xi​, yi​), and updates parameters by θt+1​ = θt​ - ηt​ gt​, where ηt​ > 0 is the learning rate. Under uniform random sampling, gt​ is an unbiased estimator of the full gradient: E[gt​∣ θt​] = ∇ L(θt​). Its covariance decreases approximately inversely with batch size, improving stability as b grows. With momentum, one maintains a velocity vt+1​ = β vt​ + (1-β) gt​ and applies θt+1​ = θt​ - ηt​ vt+1​. For convex, Lipschitz-smooth losses, convergence to the global optimum is guaranteed under Robbins–Monro step-size conditions: ∑t​ ηt​ = ∞ and ∑t​ ηt2​ < ∞. For non-convex losses, SGD converges to stationary points in expectation under similar smoothness assumptions and is widely effective in practice.

04When to Use

Use mini-batch SGD when datasets are large enough that full-batch gradients are expensive or cannot fit in memory, or when you want fast, incremental updates (e.g., online or streaming scenarios). It shines for models where computing per-example gradients is cheap and additive across samples: linear regression, logistic regression, matrix factorization, embeddings, and deep neural networks. Choose mini-batches to balance noise and throughput: small batches provide more frequent updates and better generalization in some cases; moderate to large batches can exploit parallel hardware. Random shuffling each epoch avoids cyclic patterns and ensures unbiased coverage of the dataset. Momentum is particularly helpful when the loss landscape has ravines—steep in one direction and flat in another—common in deep nets; it accelerates progress along consistent directions and damps oscillations. Learning-rate schedules (step decay, cosine, or 1/t) help you start with bold steps and then fine-tune. Opt for L2 regularization when you suspect overfitting, and normalize inputs to stabilize gradients. If your data arrives over time, SGD naturally adapts by updating with each new mini-batch without retraining from scratch.

⚠️Common Mistakes

• Using a learning rate that is too large causes divergence or wildly oscillating loss; too small makes training painfully slow. Start with a reasonable default (e.g., 0.1 for normalized linear/logistic models or 1e-3 for deep nets) and tune. • Forgetting to average by batch size b makes gradients b times larger than intended, effectively increasing the learning rate and destabilizing training. • Not shuffling data each epoch can introduce bias (e.g., class-ordered data) and lead to poor minima or zig-zagging behavior. • Skipping feature scaling leads to gradients of very different magnitudes per feature, causing slow or unstable convergence; standardize or normalize inputs. • Ignoring regularization in high-dimensional problems often overfits; add L2 (weight decay) or early stopping. • Using momentum incorrectly (e.g., applying learning rate both inside velocity and again on update, or confusing different momentum formulations) changes the effective step and can hinder convergence. • Evaluating progress only on the noisy mini-batch loss is misleading; track a smoothed training loss or an epoch-level metric and monitor validation performance. • Changing batch size without re-tuning learning rate alters the scale and noise of gradients; consider linear scaling of the learning rate with batch size as a heuristic.

Key Formulas

Empirical Risk

L(θ)=n1​i=1∑n​ℓ(θ;xi​,yi​)

Explanation: This is the average loss over the dataset. SGD aims to minimize this objective by sampling small subsets of terms.

Mini-batch Gradient Estimator

gt​=b1​i∈Bt​∑​∇θ​ℓ(θt​;xi​,yi​)

Explanation: The gradient used at step t is the average of per-example gradients over the mini-batch. It is cheaper than the full gradient and still points approximately downhill.

SGD Update Rule

θt+1​=θt​−ηt​gt​

Explanation: Parameters move opposite the gradient by a step size ηt​. This is the core iterative update of SGD.

Unbiasedness

E[gt​∣θt​]=∇L(θt​)

Explanation: Under uniform random sampling, the expected mini-batch gradient equals the true gradient. This property underpins SGD’s convergence behavior.

Variance vs. Batch Size

Var(gt​)≈b1​Σ

Explanation: The variance of the mini-batch gradient decreases roughly inversely with batch size b. Larger batches yield more stable updates but cost more computation.

Momentum (Exponential Moving Average)

vt+1​=βvt​+(1−β)gt​,θt+1​=θt​−ηt​vt+1​

Explanation: Momentum keeps a moving average of gradients to smooth noise and accelerate along consistent directions. The parameter β controls how much past gradients matter.

Inverse Time Decay

ηt​=1+λtη0​​

Explanation: A simple learning-rate schedule that gradually decreases the step size. It helps with convergence once near a minimum.

MSE Loss

LMSE​(θ)=2n1​i=1∑n​(θ⊤xi​−yi​)2

Explanation: Mean squared error for linear regression. The factor 21​ simplifies derivatives; the gradient is XT(Xθ - y)/n.

Logistic Loss

Llog​(θ)=−n1​i=1∑n​[yi​logσ(θ⊤xi​)+(1−yi​)log(1−σ(θ⊤xi​))]

Explanation: Binary cross-entropy loss for logistic regression, where σ is the sigmoid. Its gradient equals the average of (y^​-y)x.

L2 Regularization

Lreg​(θ)=L(θ)+2λ​∥θ∥22​

Explanation: Adds a penalty proportional to the squared norm of parameters. It discourages large weights and can improve generalization.

Robbins–Monro Conditions

t=1∑∞​ηt​=∞,t=1∑∞​ηt2​<∞

Explanation: Classical requirements for diminishing step sizes to ensure convergence of stochastic approximation methods under convexity and smoothness.

Complexity Analysis

Let n be the number of examples, d the number of features, b the mini-batch size, and E the number of epochs. Computing a mini-batch gradient for linear or logistic models costs O(b d): for each of b examples, we compute a dot product with d features and accumulate gradients. An epoch contains approximately n / b updates, so the total time per epoch is O(bn​ * (b d)) = O(n d). Thus, for E epochs, the overall time is O(E n d). This matches full-batch gradient descent per epoch but allows much earlier, incremental parameter updates and better use of streaming and parallel hardware. For deep networks, replace d by the number of parameters P; the per-update cost becomes O(b * forwardb​ackward(P)) and still scales linearly with batch size. Space complexity depends on data storage and model size. Storing the full dataset requires O(n d) memory (or O(P) for model parameters if data is streamed from disk). The model parameters and their gradients typically require O(d) (or O(P)) memory. With momentum or Adam-like methods, you store additional vectors of size O(d), giving O(d) auxiliary memory. The mini-batch itself needs O(b d) transient space if materialized; in streaming implementations, you can bound peak memory by O(b d + d). Random shuffling per epoch can be implemented by permuting an integer index array of size n in O(n) time and O(n) space. Overall, mini-batch SGD’s per-step computation and memory footprint are modest, making it suitable for large-scale problems. Its wall-clock efficiency often outperforms full-batch methods by delivering informative updates earlier and by exploiting parallel compute over mini-batches.

Code Examples

Mini-batch SGD for Linear Regression (MSE) with Random Shuffling
1#include <iostream>
2#include <vector>
3#include <random>
4#include <algorithm>
5#include <numeric>
6#include <cmath>
7#include <cassert>
8
9// Utility: dot product between w and x
10double dot(const std::vector<double>& w, const std::vector<double>& x) {
11 assert(w.size() == x.size());
12 double s = 0.0;
13 for (size_t j = 0; j < w.size(); ++j) s += w[j] * x[j];
14 return s;
15}
16
17// Generate synthetic linear data: y = w_true^T x + noise
18void make_dataset(size_t n, size_t d, std::vector<std::vector<double>>& X, std::vector<double>& y,
19 std::mt19937& rng) {
20 std::normal_distribution<double> nd(0.0, 1.0);
21 std::vector<double> w_true(d);
22 for (size_t j = 0; j < d; ++j) w_true[j] = nd(rng);
23 std::normal_distribution<double> noise(0.0, 0.1);
24
25 X.assign(n, std::vector<double>(d));
26 y.assign(n, 0.0);
27 for (size_t i = 0; i < n; ++i) {
28 for (size_t j = 0; j < d; ++j) {
29 X[i][j] = nd(rng); // standard normal features
30 }
31 y[i] = dot(w_true, X[i]) + noise(rng);
32 }
33}
34
35// Compute mean squared error over dataset
36double mse(const std::vector<std::vector<double>>& X, const std::vector<double>& y, const std::vector<double>& w) {
37 size_t n = X.size();
38 double s = 0.0;
39 for (size_t i = 0; i < n; ++i) {
40 double pred = dot(w, X[i]);
41 double e = pred - y[i];
42 s += e * e;
43 }
44 return 0.5 * s / static_cast<double>(n);
45}
46
47// Create mini-batches by shuffling indices each epoch
48std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t batch_size, std::mt19937& rng) {
49 std::vector<size_t> idx(n);
50 std::iota(idx.begin(), idx.end(), 0);
51 std::shuffle(idx.begin(), idx.end(), rng); // random shuffling each epoch
52
53 std::vector<std::vector<size_t>> batches;
54 for (size_t start = 0; start < n; start += batch_size) {
55 size_t end = std::min(n, start + batch_size);
56 batches.emplace_back(idx.begin() + start, idx.begin() + end);
57 }
58 return batches;
59}
60
61int main() {
62 // Reproducible RNG
63 std::mt19937 rng(42);
64
65 // Hyperparameters
66 size_t n = 5000; // samples
67 size_t d = 20; // features
68 size_t epochs = 10; // number of passes over data
69 size_t batch = 64; // mini-batch size
70 double lr = 0.1; // learning rate
71
72 // Data
73 std::vector<std::vector<double>> X;
74 std::vector<double> y;
75 make_dataset(n, d, X, y, rng);
76
77 // Parameters initialized to zeros
78 std::vector<double> w(d, 0.0);
79
80 // Training loop
81 for (size_t e = 0; e < epochs; ++e) {
82 auto batches = make_minibatches(n, batch, rng);
83 for (const auto& B : batches) {
84 // Compute mini-batch gradient: grad = (1/b) * X_B^T (X_B w - y_B)
85 std::vector<double> grad(d, 0.0);
86 for (size_t idx : B) {
87 double err = dot(w, X[idx]) - y[idx];
88 // accumulate gradient per feature
89 for (size_t j = 0; j < d; ++j) grad[j] += err * X[idx][j];
90 }
91 double invb = 1.0 / static_cast<double>(B.size());
92 for (size_t j = 0; j < d; ++j) grad[j] *= invb;
93 // Parameter update: w <- w - lr * grad
94 for (size_t j = 0; j < d; ++j) w[j] -= lr * grad[j];
95 }
96 double train_mse = mse(X, y, w);
97 std::cout << "Epoch " << e+1 << "/" << epochs << ": MSE = " << train_mse << "\n";
98 }
99
100 // Report final first few weights
101 std::cout << "First 5 weights: ";
102 for (size_t j = 0; j < std::min<size_t>(5, w.size()); ++j) std::cout << w[j] << ' ';
103 std::cout << "\n";
104 return 0;
105}
106

This program trains a linear regression model using mini-batch SGD on synthetic data. Each epoch shuffles sample indices, splits them into mini-batches, computes the average gradient for each batch, and updates parameters using the learning rate. We track mean squared error to observe convergence. The code demonstrates core SGD mechanics: random shuffling, mini-batch averaging, and per-batch updates.

Time: O(E * n * d)Space: O(n * d + d)
Mini-batch SGD for Logistic Regression with Momentum, L2 Regularization, and Learning-Rate Decay
1#include <iostream>
2#include <vector>
3#include <random>
4#include <algorithm>
5#include <numeric>
6#include <cmath>
7
8// Sigmoid function with numerical stability for large magnitude inputs
9inline double sigmoid(double z) {
10 if (z >= 0) {
11 double ez = std::exp(-z);
12 return 1.0 / (1.0 + ez);
13 } else {
14 double ez = std::exp(z);
15 return ez / (1.0 + ez);
16 }
17}
18
19double dot(const std::vector<double>& a, const std::vector<double>& b) {
20 double s = 0.0;
21 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
22 return s;
23}
24
25// Make a simple 2D binary classification dataset: two Gaussians
26void make_dataset(size_t n_per_class, std::vector<std::vector<double>>& X, std::vector<int>& y, std::mt19937& rng) {
27 std::normal_distribution<double> c1x(-2.0, 1.0), c1y(0.0, 1.0);
28 std::normal_distribution<double> c2x(2.0, 1.0), c2y(0.5, 1.0);
29 X.clear(); y.clear();
30 for (size_t i = 0; i < n_per_class; ++i) {
31 X.push_back({c1x(rng), c1y(rng), 1.0}); // + bias feature
32 y.push_back(0);
33 }
34 for (size_t i = 0; i < n_per_class; ++i) {
35 X.push_back({c2x(rng), c2y(rng), 1.0});
36 y.push_back(1);
37 }
38}
39
40std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t b, std::mt19937& rng) {
41 std::vector<size_t> idx(n);
42 std::iota(idx.begin(), idx.end(), 0);
43 std::shuffle(idx.begin(), idx.end(), rng);
44 std::vector<std::vector<size_t>> batches;
45 for (size_t s = 0; s < n; s += b) batches.emplace_back(idx.begin()+s, idx.begin()+std::min(n, s+b));
46 return batches;
47}
48
49// Compute binary cross-entropy loss averaged over dataset
50double bce_loss(const std::vector<std::vector<double>>& X, const std::vector<int>& y, const std::vector<double>& w, double lambda) {
51 const double eps = 1e-12;
52 double s = 0.0;
53 for (size_t i = 0; i < X.size(); ++i) {
54 double p = sigmoid(dot(w, X[i]));
55 p = std::min(std::max(p, eps), 1.0 - eps);
56 s += - (y[i] * std::log(p) + (1 - y[i]) * std::log(1 - p));
57 }
58 // L2 penalty (lambda/2)||w||^2, excluding bias term (last weight)
59 double reg = 0.0;
60 for (size_t j = 0; j + 1 < w.size(); ++j) reg += w[j] * w[j];
61 return (s / static_cast<double>(X.size())) + 0.5 * lambda * reg;
62}
63
64int main() {
65 std::mt19937 rng(123);
66
67 // Hyperparameters
68 size_t n_per_class = 2000; // total n = 4000
69 size_t batch = 64;
70 size_t epochs = 15;
71 double lr0 = 0.5; // initial learning rate (features are small-scale)
72 double decay = 1e-3; // inverse time decay: lr_t = lr0 / (1 + decay * t)
73 double beta = 0.9; // momentum coefficient
74 double lambda = 1e-3; // L2 regularization strength
75
76 // Data (with bias feature)
77 std::vector<std::vector<double>> X;
78 std::vector<int> y;
79 make_dataset(n_per_class, X, y, rng);
80 const size_t n = X.size();
81 const size_t d = X[0].size(); // includes bias term
82
83 // Parameters and momentum buffer
84 std::vector<double> w(d, 0.0);
85 std::vector<double> v(d, 0.0);
86
87 size_t t = 0; // global step counter for learning-rate decay
88
89 for (size_t e = 0; e < epochs; ++e) {
90 auto batches = make_minibatches(n, batch, rng);
91 for (const auto& B : batches) {
92 ++t;
93 double lr = lr0 / (1.0 + decay * static_cast<double>(t));
94 // Compute mini-batch gradient of logistic loss + L2 (excluding bias)
95 std::vector<double> g(d, 0.0);
96 for (size_t i : B) {
97 double p = sigmoid(dot(w, X[i]));
98 double diff = (p - static_cast<double>(y[i]));
99 for (size_t j = 0; j < d; ++j) g[j] += diff * X[i][j];
100 }
101 double invb = 1.0 / static_cast<double>(B.size());
102 for (size_t j = 0; j < d; ++j) g[j] *= invb;
103 // Add L2 gradient lambda * w (exclude bias term)
104 for (size_t j = 0; j + 1 < d; ++j) g[j] += lambda * w[j];
105
106 // Momentum update: v = beta * v + (1 - beta) * g; w -= lr * v
107 for (size_t j = 0; j < d; ++j) {
108 v[j] = beta * v[j] + (1.0 - beta) * g[j];
109 w[j] -= lr * v[j];
110 }
111 }
112 // Evaluate average loss and accuracy at epoch end
113 double loss = bce_loss(X, y, w, lambda);
114 size_t correct = 0;
115 for (size_t i = 0; i < n; ++i) {
116 double p = sigmoid(dot(w, X[i]));
117 int pred = (p >= 0.5) ? 1 : 0;
118 if (pred == y[i]) ++correct;
119 }
120 double acc = static_cast<double>(correct) / static_cast<double>(n);
121 std::cout << "Epoch " << (e+1) << "/" << epochs << ": loss=" << loss << ", acc=" << acc << "\n";
122 }
123
124 std::cout << "Weights: ";
125 for (double wi : w) std::cout << wi << ' ';
126 std::cout << "\n";
127 return 0;
128}
129

This program trains a binary logistic regression classifier using mini-batch SGD with momentum, L2 regularization, and an inverse-time learning-rate decay. A simple synthetic 2D dataset is generated from two Gaussians; a bias feature is appended. Each update computes the average mini-batch gradient, adds the L2 term (excluding bias), applies momentum, and uses a decayed learning rate. The epoch-end loss and accuracy show convergence and classification performance.

Time: O(E * n * d)Space: O(n * d + d)
#stochastic gradient descent#mini-batch#random shuffling#momentum#learning rate schedule#l2 regularization#logistic regression#linear regression#empirical risk minimization#convex optimization#gradient estimation#variance reduction#epoch#batch size#convergence