⚙️AlgorithmIntermediate

Stochastic Gradient Descent (SGD)

Key Points

•
Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.
•
Mini-batch sampling provides an unbiased estimate of the true gradient while reducing variance compared to single-sample updates.
•
Random shuffling each epoch is crucial to avoid bias and to ensure good mixing of examples during training.
•
Learning rate schedules and momentum are practical tools that stabilize SGD and speed up convergence.
•
For convex problems like linear or logistic regression, SGD converges under mild step-size conditions; for deep nets it often finds useful minima in practice.
•
Per-update cost scales with the batch size and number of features, enabling training on very large datasets.
•
Normalization and proper regularization (e.g., L2) significantly improve stability and generalization in SGD.
•
Common pitfalls include too-large learning rates, forgetting to average gradients by batch size, and not decaying the learning rate.

Prerequisites

→Vectors and dot products — Gradient computations in linear and logistic models are built from dot products and linear algebra.
→Derivatives and chain rule — SGD updates use gradients of the loss with respect to parameters.
→Probability and expectation — Understanding unbiased gradient estimators and variance requires basic probability.
→Convex vs. non-convex optimization — Convergence guarantees differ between convex problems and deep networks.
→Overfitting and regularization — SGD benefits from regularization like L2 to improve generalization.

Detailed Explanation

Tap terms for definitions

01Overview

Stochastic Gradient Descent (SGD) is a workhorse algorithm for training machine learning models by iteratively adjusting parameters to minimize a loss function. Instead of computing the gradient on the entire dataset (which can be very large and slow), SGD uses a small, randomly selected subset called a mini-batch to estimate the gradient and take a step. This approach drastically reduces computation per update and introduces noise that can help escape shallow local minima and flat plateaus. Mini-batch SGD with random sampling is the most common variant used in practice: during each epoch, we randomly shuffle data and repeatedly draw mini-batches of size b to perform updates. The process repeats for multiple epochs until convergence criteria are met (e.g., training loss stabilizes or a maximum number of epochs is reached). The algorithm is simple, scalable, and surprisingly powerful. Hook: think of tuning a radio in a noisy environment—you nudge the dial based on partial, noisy feedback, yet eventually lock onto the station. Concept: SGD nudges parameters based on noisy gradient estimates. Example: fitting a line to data (linear regression) by repeatedly updating the slope and intercept using small random groups of points.

02Intuition & Analogies

Imagine hiking down a foggy mountain. Full-batch gradient descent waits for the fog to clear entirely before choosing the perfect downhill direction—accurate, but it takes a long time to see far enough. SGD instead feels the slope under your feet using a handful of pebbles nearby (a mini-batch). Each step may not be perfect, but it is quick, and over many steps you still move downhill effectively. Randomly changing which pebbles you feel (random shuffling) prevents you from being tricked by a peculiar local arrangement of rocks and helps you explore more directions. Batch size is like how many pebbles you sample at once: one pebble (pure SGD) is noisy and jittery; a large handful (large batch) is stable but slower per update. Momentum is like adding a bit of inertia—you don’t immediately react to every jitter; instead, you keep moving in a smoothed, averaged downhill direction. Learning-rate decay is like taking shorter steps as you approach the valley to avoid overshooting. Regularization is a gentle elastic band pulling you toward simpler solutions to avoid overfitting the noisy terrain. In practice, this combination—mini-batches, shuffling, momentum, and decaying step sizes—makes SGD both fast and robust on massive datasets, from simple linear models to deep neural networks.

03Formal Definition

Given training data D = {(

x_{i}

y_{i}

)}_{

i=1}^{n}

and a parametric model with parameters

θ

, define the empirical risk L(

θ

) =

\frac{1}{n}

\sum_{i = 1}^{n}

ℓ

(

θ

;

x_{i}

y_{i}

), where

ℓ

is a per-example loss. Mini-batch SGD chooses at iteration t a random subset

B_{t}

\subset

\{1,

\dots

,n\} of size b (often uniformly without replacement per epoch), computes the stochastic gradient estimate

g_{t}

\frac{1}{b}

\sum_{i \in B_{t}}

\nabla

θ

ℓ

(

θ_{t}

;

x_{i}

y_{i}

), and updates parameters by

θ_{t + 1}

θ_{t}

η_{t}

g_{t}

, where

η_{t}

> 0 is the learning rate. Under uniform random sampling,

g_{t}

is an unbiased estimator of the full gradient:

E

[

g_{t} ∣

θ_{t}

] =

\nabla

θ_{t}

). Its covariance decreases approximately inversely with batch size, improving stability as b grows. With momentum, one maintains a velocity

v_{t + 1}

β

v_{t}

+ (1-

β

)

g_{t}

and applies

θ_{t + 1}

θ_{t}

η_{t}

v_{t + 1}

. For convex, Lipschitz-smooth losses, convergence to the global optimum is guaranteed under Robbins–Monro step-size conditions:

\sum_{t}

η_{t}

\infty

and

\sum_{t}

η_{t}^{2}

\infty

. For non-convex losses, SGD converges to stationary points in expectation under similar smoothness assumptions and is widely effective in practice.

04When to Use

Use mini-batch SGD when datasets are large enough that full-batch gradients are expensive or cannot fit in memory, or when you want fast, incremental updates (e.g., online or streaming scenarios). It shines for models where computing per-example gradients is cheap and additive across samples: linear regression, logistic regression, matrix factorization, embeddings, and deep neural networks. Choose mini-batches to balance noise and throughput: small batches provide more frequent updates and better generalization in some cases; moderate to large batches can exploit parallel hardware. Random shuffling each epoch avoids cyclic patterns and ensures unbiased coverage of the dataset. Momentum is particularly helpful when the loss landscape has ravines—steep in one direction and flat in another—common in deep nets; it accelerates progress along consistent directions and damps oscillations. Learning-rate schedules (step decay, cosine, or 1/t) help you start with bold steps and then fine-tune. Opt for L2 regularization when you suspect overfitting, and normalize inputs to stabilize gradients. If your data arrives over time, SGD naturally adapts by updating with each new mini-batch without retraining from scratch.

⚠️Common Mistakes

• Using a learning rate that is too large causes divergence or wildly oscillating loss; too small makes training painfully slow. Start with a reasonable default (e.g., 0.1 for normalized linear/logistic models or 1e-3 for deep nets) and tune. • Forgetting to average by batch size b makes gradients b times larger than intended, effectively increasing the learning rate and destabilizing training. • Not shuffling data each epoch can introduce bias (e.g., class-ordered data) and lead to poor minima or zig-zagging behavior. • Skipping feature scaling leads to gradients of very different magnitudes per feature, causing slow or unstable convergence; standardize or normalize inputs. • Ignoring regularization in high-dimensional problems often overfits; add L2 (weight decay) or early stopping. • Using momentum incorrectly (e.g., applying learning rate both inside velocity and again on update, or confusing different momentum formulations) changes the effective step and can hinder convergence. • Evaluating progress only on the noisy mini-batch loss is misleading; track a smoothed training loss or an epoch-level metric and monitor validation performance. • Changing batch size without re-tuning learning rate alters the scale and noise of gradients; consider linear scaling of the learning rate with batch size as a heuristic.

Key Formulas

Empirical Risk

L (θ) = \frac{1}{n} i = 1 \sum n ℓ (θ; x_{i}, y_{i})

Explanation: This is the average loss over the dataset. SGD aims to minimize this objective by sampling small subsets of terms.

Mini-batch Gradient Estimator

g_{t} = \frac{1}{b} i \in B_{t} \sum \nabla_{θ} ℓ (θ_{t}; x_{i}, y_{i})

Explanation: The gradient used at step t is the average of per-example gradients over the mini-batch. It is cheaper than the full gradient and still points approximately downhill.

SGD Update Rule

θ_{t + 1} = θ_{t} - η_{t} g_{t}

Explanation: Parameters move opposite the gradient by a step size $η_{t}$ . This is the core iterative update of SGD.

Unbiasedness

E [g_{t} ∣ θ_{t}] = \nabla L (θ_{t})

Explanation: Under uniform random sampling, the expected mini-batch gradient equals the true gradient. This property underpins SGD’s convergence behavior.

Variance vs. Batch Size

Var (g_{t}) \approx \frac{1}{b} Σ

Explanation: The variance of the mini-batch gradient decreases roughly inversely with batch size b. Larger batches yield more stable updates but cost more computation.

Momentum (Exponential Moving Average)

v_{t + 1} = β v_{t} + (1 - β) g_{t}, θ_{t + 1} = θ_{t} - η_{t} v_{t + 1}

Explanation: Momentum keeps a moving average of gradients to smooth noise and accelerate along consistent directions. The parameter $β$ controls how much past gradients matter.

Inverse Time Decay

η_{t} = \frac{η _{0}}{1 + λ t}

Explanation: A simple learning-rate schedule that gradually decreases the step size. It helps with convergence once near a minimum.

MSE Loss

L_{MSE} (θ) = \frac{1}{2 n} i = 1 \sum n (θ^{⊤} x_{i} - y_{i})^{2}

Explanation: Mean squared error for linear regression. The factor $\frac{1}{2}$ simplifies derivatives; the gradient is $X^{T}$ (X $θ$ - y)/n.

Logistic Loss

L_{log} (θ) = - \frac{1}{n} i = 1 \sum n [y_{i} lo g σ (θ^{⊤} x_{i}) + (1 - y_{i}) lo g (1 - σ (θ^{⊤} x_{i}))]

Explanation: Binary cross-entropy loss for logistic regression, where $σ$ is the sigmoid. Its gradient equals the average of ( $\overset{y}{^}$ -y)x.

L2 Regularization

L_{reg} (θ) = L (θ) + \frac{λ}{2} ∥ θ ∥_{2}^{2}

Explanation: Adds a penalty proportional to the squared norm of parameters. It discourages large weights and can improve generalization.

Robbins–Monro Conditions

t = 1 \sum \infty η_{t} = \infty, t = 1 \sum \infty η_{t}^{2} < \infty

Explanation: Classical requirements for diminishing step sizes to ensure convergence of stochastic approximation methods under convexity and smoothness.

Complexity Analysis

Let n be the number of examples, d the number of features, b the mini-batch size, and E the number of epochs. Computing a mini-batch gradient for linear or logistic models costs O(b d): for each of b examples, we compute a dot product with d features and accumulate gradients. An epoch contains approximately n / b updates, so the total time per epoch is O(

\frac{n}{b}

* (b d)) = O(n d). Thus, for E epochs, the overall time is O(E n d). This matches full-batch gradient descent per epoch but allows much earlier, incremental parameter updates and better use of streaming and parallel hardware. For deep networks, replace d by the number of parameters P; the per-update cost becomes O(b * forwar

d_{b} a c k w a r d

(P)) and still scales linearly with batch size. Space complexity depends on data storage and model size. Storing the full dataset requires O(n d) memory (or O(P) for model parameters if data is streamed from disk). The model parameters and their gradients typically require O(d) (or O(P)) memory. With momentum or Adam-like methods, you store additional vectors of size O(d), giving O(d) auxiliary memory. The mini-batch itself needs O(b d) transient space if materialized; in streaming implementations, you can bound peak memory by O(b d + d). Random shuffling per epoch can be implemented by permuting an integer index array of size n in O(n) time and O(n) space. Overall, mini-batch SGD’s per-step computation and memory footprint are modest, making it suitable for large-scale problems. Its wall-clock efficiency often outperforms full-batch methods by delivering informative updates earlier and by exploiting parallel compute over mini-batches.

Code Examples

Mini-batch SGD for Linear Regression (MSE) with Random Shuffling

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <algorithm>
5 #include <numeric>
6 #include <cmath>
7 #include <cassert>
8 
9 // Utility: dot product between w and x
10 double dot(const std::vector<double>& w, const std::vector<double>& x) {
11     assert(w.size() == x.size());
12     double s = 0.0;
13     for (size_t j = 0; j < w.size(); ++j) s += w[j] * x[j];
14     return s;
15 }
16 
17 // Generate synthetic linear data: y = w_true^T x + noise
18 void make_dataset(size_t n, size_t d, std::vector<std::vector<double>>& X, std::vector<double>& y,
19                   std::mt19937& rng) {
20     std::normal_distribution<double> nd(0.0, 1.0);
21     std::vector<double> w_true(d);
22     for (size_t j = 0; j < d; ++j) w_true[j] = nd(rng);
23     std::normal_distribution<double> noise(0.0, 0.1);
24 
25     X.assign(n, std::vector<double>(d));
26     y.assign(n, 0.0);
27     for (size_t i = 0; i < n; ++i) {
28         for (size_t j = 0; j < d; ++j) {
29             X[i][j] = nd(rng); // standard normal features
30         }
31         y[i] = dot(w_true, X[i]) + noise(rng);
32     }
33 }
34 
35 // Compute mean squared error over dataset
36 double mse(const std::vector<std::vector<double>>& X, const std::vector<double>& y, const std::vector<double>& w) {
37     size_t n = X.size();
38     double s = 0.0;
39     for (size_t i = 0; i < n; ++i) {
40         double pred = dot(w, X[i]);
41         double e = pred - y[i];
42         s += e * e;
43     }
44     return 0.5 * s / static_cast<double>(n);
45 }
46 
47 // Create mini-batches by shuffling indices each epoch
48 std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t batch_size, std::mt19937& rng) {
49     std::vector<size_t> idx(n);
50     std::iota(idx.begin(), idx.end(), 0);
51     std::shuffle(idx.begin(), idx.end(), rng); // random shuffling each epoch
52 
53     std::vector<std::vector<size_t>> batches;
54     for (size_t start = 0; start < n; start += batch_size) {
55         size_t end = std::min(n, start + batch_size);
56         batches.emplace_back(idx.begin() + start, idx.begin() + end);
57     }
58     return batches;
59 }
60 
61 int main() {
62     // Reproducible RNG
63     std::mt19937 rng(42);
64 
65     // Hyperparameters
66     size_t n = 5000;      // samples
67     size_t d = 20;        // features
68     size_t epochs = 10;   // number of passes over data
69     size_t batch = 64;    // mini-batch size
70     double lr = 0.1;      // learning rate
71 
72     // Data
73     std::vector<std::vector<double>> X; 
74     std::vector<double> y; 
75     make_dataset(n, d, X, y, rng);
76 
77     // Parameters initialized to zeros
78     std::vector<double> w(d, 0.0);
79 
80     // Training loop
81     for (size_t e = 0; e < epochs; ++e) {
82         auto batches = make_minibatches(n, batch, rng);
83         for (const auto& B : batches) {
84             // Compute mini-batch gradient: grad = (1/b) * X_B^T (X_B w - y_B)
85             std::vector<double> grad(d, 0.0);
86             for (size_t idx : B) {
87                 double err = dot(w, X[idx]) - y[idx];
88                 // accumulate gradient per feature
89                 for (size_t j = 0; j < d; ++j) grad[j] += err * X[idx][j];
90             }
91             double invb = 1.0 / static_cast<double>(B.size());
92             for (size_t j = 0; j < d; ++j) grad[j] *= invb;
93             // Parameter update: w <- w - lr * grad
94             for (size_t j = 0; j < d; ++j) w[j] -= lr * grad[j];
95         }
96         double train_mse = mse(X, y, w);
97         std::cout << "Epoch " << e+1 << "/" << epochs << ": MSE = " << train_mse << "\n";
98     }
99 
100     // Report final first few weights
101     std::cout << "First 5 weights: ";
102     for (size_t j = 0; j < std::min<size_t>(5, w.size()); ++j) std::cout << w[j] << ' ';
103     std::cout << "\n";
104     return 0;
105 }
106

This program trains a linear regression model using mini-batch SGD on synthetic data. Each epoch shuffles sample indices, splits them into mini-batches, computes the average gradient for each batch, and updates parameters using the learning rate. We track mean squared error to observe convergence. The code demonstrates core SGD mechanics: random shuffling, mini-batch averaging, and per-batch updates.

Time: O(E * n * d)Space: O(n * d + d)

Mini-batch SGD for Logistic Regression with Momentum, L2 Regularization, and Learning-Rate Decay

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <algorithm>
5 #include <numeric>
6 #include <cmath>
7 
8 // Sigmoid function with numerical stability for large magnitude inputs
9 inline double sigmoid(double z) {
10     if (z >= 0) {
11         double ez = std::exp(-z);
12         return 1.0 / (1.0 + ez);
13     } else {
14         double ez = std::exp(z);
15         return ez / (1.0 + ez);
16     }
17 }
18 
19 double dot(const std::vector<double>& a, const std::vector<double>& b) {
20     double s = 0.0;
21     for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
22     return s;
23 }
24 
25 // Make a simple 2D binary classification dataset: two Gaussians
26 void make_dataset(size_t n_per_class, std::vector<std::vector<double>>& X, std::vector<int>& y, std::mt19937& rng) {
27     std::normal_distribution<double> c1x(-2.0, 1.0), c1y(0.0, 1.0);
28     std::normal_distribution<double> c2x(2.0, 1.0), c2y(0.5, 1.0);
29     X.clear(); y.clear();
30     for (size_t i = 0; i < n_per_class; ++i) {
31         X.push_back({c1x(rng), c1y(rng), 1.0}); // + bias feature
32         y.push_back(0);
33     }
34     for (size_t i = 0; i < n_per_class; ++i) {
35         X.push_back({c2x(rng), c2y(rng), 1.0});
36         y.push_back(1);
37     }
38 }
39 
40 std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t b, std::mt19937& rng) {
41     std::vector<size_t> idx(n);
42     std::iota(idx.begin(), idx.end(), 0);
43     std::shuffle(idx.begin(), idx.end(), rng);
44     std::vector<std::vector<size_t>> batches;
45     for (size_t s = 0; s < n; s += b) batches.emplace_back(idx.begin()+s, idx.begin()+std::min(n, s+b));
46     return batches;
47 }
48 
49 // Compute binary cross-entropy loss averaged over dataset
50 double bce_loss(const std::vector<std::vector<double>>& X, const std::vector<int>& y, const std::vector<double>& w, double lambda) {
51     const double eps = 1e-12;
52     double s = 0.0;
53     for (size_t i = 0; i < X.size(); ++i) {
54         double p = sigmoid(dot(w, X[i]));
55         p = std::min(std::max(p, eps), 1.0 - eps);
56         s += - (y[i] * std::log(p) + (1 - y[i]) * std::log(1 - p));
57     }
58     // L2 penalty (lambda/2)||w||^2, excluding bias term (last weight)
59     double reg = 0.0;
60     for (size_t j = 0; j + 1 < w.size(); ++j) reg += w[j] * w[j];
61     return (s / static_cast<double>(X.size())) + 0.5 * lambda * reg;
62 }
63 
64 int main() {
65     std::mt19937 rng(123);
66 
67     // Hyperparameters
68     size_t n_per_class = 2000; // total n = 4000
69     size_t batch = 64;
70     size_t epochs = 15;
71     double lr0 = 0.5;      // initial learning rate (features are small-scale)
72     double decay = 1e-3;   // inverse time decay: lr_t = lr0 / (1 + decay * t)
73     double beta = 0.9;     // momentum coefficient
74     double lambda = 1e-3;  // L2 regularization strength
75 
76     // Data (with bias feature)
77     std::vector<std::vector<double>> X;
78     std::vector<int> y;
79     make_dataset(n_per_class, X, y, rng);
80     const size_t n = X.size();
81     const size_t d = X[0].size(); // includes bias term
82 
83     // Parameters and momentum buffer
84     std::vector<double> w(d, 0.0);
85     std::vector<double> v(d, 0.0);
86 
87     size_t t = 0; // global step counter for learning-rate decay
88 
89     for (size_t e = 0; e < epochs; ++e) {
90         auto batches = make_minibatches(n, batch, rng);
91         for (const auto& B : batches) {
92             ++t;
93             double lr = lr0 / (1.0 + decay * static_cast<double>(t));
94             // Compute mini-batch gradient of logistic loss + L2 (excluding bias)
95             std::vector<double> g(d, 0.0);
96             for (size_t i : B) {
97                 double p = sigmoid(dot(w, X[i]));
98                 double diff = (p - static_cast<double>(y[i]));
99                 for (size_t j = 0; j < d; ++j) g[j] += diff * X[i][j];
100             }
101             double invb = 1.0 / static_cast<double>(B.size());
102             for (size_t j = 0; j < d; ++j) g[j] *= invb;
103             // Add L2 gradient lambda * w (exclude bias term)
104             for (size_t j = 0; j + 1 < d; ++j) g[j] += lambda * w[j];
105 
106             // Momentum update: v = beta * v + (1 - beta) * g; w -= lr * v
107             for (size_t j = 0; j < d; ++j) {
108                 v[j] = beta * v[j] + (1.0 - beta) * g[j];
109                 w[j] -= lr * v[j];
110             }
111         }
112         // Evaluate average loss and accuracy at epoch end
113         double loss = bce_loss(X, y, w, lambda);
114         size_t correct = 0;
115         for (size_t i = 0; i < n; ++i) {
116             double p = sigmoid(dot(w, X[i]));
117             int pred = (p >= 0.5) ? 1 : 0;
118             if (pred == y[i]) ++correct;
119         }
120         double acc = static_cast<double>(correct) / static_cast<double>(n);
121         std::cout << "Epoch " << (e+1) << "/" << epochs << ": loss=" << loss << ", acc=" << acc << "\n";
122     }
123 
124     std::cout << "Weights: ";
125     for (double wi : w) std::cout << wi << ' ';
126     std::cout << "\n";
127     return 0;
128 }
129

This program trains a binary logistic regression classifier using mini-batch SGD with momentum, L2 regularization, and an inverse-time learning-rate decay. A simple synthetic 2D dataset is generated from two Gaussians; a bias feature is appended. Each update computes the average mini-batch gradient, adds the L2 term (excluding bias), applies momentum, and uses a decayed learning rate. The epoch-end loss and accuracy show convergence and classification performance.

Time: O(E * n * d)Space: O(n * d + d)

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <algorithm>
5	#include <numeric>
6	#include <cmath>
7	#include <cassert>
8
9	// Utility: dot product between w and x
10	double dot(const std::vector<double>& w, const std::vector<double>& x) {
11	assert(w.size() == x.size());
12	double s = 0.0;
13	for (size_t j = 0; j < w.size(); ++j) s += w[j] * x[j];
14	return s;
15	}
16
17	// Generate synthetic linear data: y = w_true^T x + noise
18	void make_dataset(size_t n, size_t d, std::vector<std::vector<double>>& X, std::vector<double>& y,
19	std::mt19937& rng) {
20	std::normal_distribution<double> nd(0.0, 1.0);
21	std::vector<double> w_true(d);
22	for (size_t j = 0; j < d; ++j) w_true[j] = nd(rng);
23	std::normal_distribution<double> noise(0.0, 0.1);
24
25	X.assign(n, std::vector<double>(d));
26	y.assign(n, 0.0);
27	for (size_t i = 0; i < n; ++i) {
28	for (size_t j = 0; j < d; ++j) {
29	X[i][j] = nd(rng); // standard normal features
30	}
31	y[i] = dot(w_true, X[i]) + noise(rng);
32	}
33	}
34
35	// Compute mean squared error over dataset
36	double mse(const std::vector<std::vector<double>>& X, const std::vector<double>& y, const std::vector<double>& w) {
37	size_t n = X.size();
38	double s = 0.0;
39	for (size_t i = 0; i < n; ++i) {
40	double pred = dot(w, X[i]);
41	double e = pred - y[i];
42	s += e * e;
43	}
44	return 0.5 * s / static_cast<double>(n);
45	}
46
47	// Create mini-batches by shuffling indices each epoch
48	std::vector<std::vector<size_t>> make_minibatches(size_t n, size_t batch_size, std::mt19937& rng) {
49	std::vector<size_t> idx(n);
50	std::iota(idx.begin(), idx.end(), 0);
51	std::shuffle(idx.begin(), idx.end(), rng); // random shuffling each epoch
52
53	std::vector<std::vector<size_t>> batches;
54	for (size_t start = 0; start < n; start += batch_size) {
55	size_t end = std::min(n, start + batch_size);
56	batches.emplace_back(idx.begin() + start, idx.begin() + end);
57	}
58	return batches;
59	}
60
61	int main() {
62	// Reproducible RNG
63	std::mt19937 rng(42);
64
65	// Hyperparameters
66	size_t n = 5000; // samples
67	size_t d = 20; // features
68	size_t epochs = 10; // number of passes over data
69	size_t batch = 64; // mini-batch size
70	double lr = 0.1; // learning rate
71
72	// Data
73	std::vector<std::vector<double>> X;
74	std::vector<double> y;
75	make_dataset(n, d, X, y, rng);
76
77	// Parameters initialized to zeros
78	std::vector<double> w(d, 0.0);
79
80	// Training loop
81	for (size_t e = 0; e < epochs; ++e) {
82	auto batches = make_minibatches(n, batch, rng);
83	for (const auto& B : batches) {
84	// Compute mini-batch gradient: grad = (1/b) * X_B^T (X_B w - y_B)
85	std::vector<double> grad(d, 0.0);
86	for (size_t idx : B) {
87	double err = dot(w, X[idx]) - y[idx];
88	// accumulate gradient per feature
89	for (size_t j = 0; j < d; ++j) grad[j] += err * X[idx][j];
90	}
91	double invb = 1.0 / static_cast<double>(B.size());
92	for (size_t j = 0; j < d; ++j) grad[j] *= invb;
93	// Parameter update: w <- w - lr * grad
94	for (size_t j = 0; j < d; ++j) w[j] -= lr * grad[j];
95	}
96	double train_mse = mse(X, y, w);
97	std::cout << "Epoch " << e+1 << "/" << epochs << ": MSE = " << train_mse << "\n";
98	}
99
100	// Report final first few weights
101	std::cout << "First 5 weights: ";
102	for (size_t j = 0; j < std::min<size_t>(5, w.size()); ++j) std::cout << w[j] << ' ';
103	std::cout << "\n";
104	return 0;
105	}
106