⚙️AlgorithmIntermediate

Gradient Descent

Key Points

•
Gradient descent is a simple, repeatable way to move downhill on a loss surface by stepping in the opposite direction of the gradient.
•
Batch gradient descent uses the full dataset to compute the gradient each step, making it stable but potentially slower per iteration.
•
Choosing a good learning rate (step size) is crucial; too large diverges, too small converges very slowly.
•
For convex, smooth losses like linear regression, batch gradient descent reliably converges to the global minimum.
•
Per iteration cost is O(n d), where n is the number of examples and d is the number of features.
•
Feature scaling (standardization) often dramatically improves convergence speed and stability.
•
Stopping criteria typically monitor gradient norm, loss improvement, or a maximum number of iterations.
•
C++ implementations should pay attention to numerical stability (overflow in exp, underflow in probabilities) and careful vectorized loops.

Prerequisites

→Multivariable calculus (gradients and partial derivatives) — Gradient descent requires computing and understanding ∇L(θ).
→Linear algebra (vectors, matrices, matrix-vector products) — Batch gradients for linear/logistic regression use X^T(Xθ−y) and X^T(p−y).
→Convexity basics — Convergence guarantees and step-size selection rely on convexity and smoothness assumptions.
→Floating-point arithmetic and numerical stability — C++ implementations must avoid overflow/underflow, especially for exp/log in logistic regression.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine hiking down a hill in the fog with only a local slope meter—you can’t see the valley, but you can feel which way is downhill. Gradient descent is that hiker. It repeatedly measures the slope (the gradient) at the current position and takes a step downhill to reduce the loss (error). Concept: In machine learning and optimization, we often want to find parameters θ that minimize a loss function L(θ). The gradient ∇L(θ) tells us the direction of steepest ascent; stepping in the opposite direction decreases the loss. Batch gradient descent computes this gradient using the entire dataset each iteration, making each step exact (for that dataset) but computationally heavier than stochastic variants. Example: In linear regression, L(θ) is the mean squared error between predictions and targets. By computing the gradient over all samples, we update θ so that predictions get closer to targets overall, iteration by iteration. Over time—and with a suitable learning rate—θ converges toward parameters that minimize error. The algorithm is simple, widely applicable to differentiable losses, and forms the backbone of many learning systems.

02Intuition & Analogies

Hook: Think of trying to flatten a wrinkled bedsheet. You press where the bump is highest and move your hands outward, repeatedly smoothing until it’s flat. Gradient descent “presses” where the error is greatest and moves parameters in the direction that flattens the error surface. Concept: The gradient is like a compass pointing toward the steepest increase of error. If we walk the other way by some small step, we reduce the error. The learning rate is how big that step is—like how far you move your hands each time you press the sheet. Too big and you overshoot and create new wrinkles; too small and it takes forever. Batch gradient descent uses all the evidence (every data point) to choose the movement direction, like polling everyone in a crowd before deciding. Example: For a simple parabola L(θ) = (θ − 3)^2, the slope at θ tells you whether 3 is to the left or right and how far. If θ = 0, the slope is −6, so stepping right (opposite of negative slope) moves you toward θ = 3. For linear regression with many features, each feature’s contribution to the error becomes a component in the gradient vector. Summing over the whole dataset (batch) ensures our step points toward reducing the average error, not just the error of one sample.

03Formal Definition

Hook: The essence of batch gradient descent is one line. Concept: Given a differentiable loss L:

R^{d}

\to

R

, batch gradient descent updates parameters by

θ_{t + 1}

θ_{t}

η

\nabla

θ_{t}

), where

η

> 0 is the learning rate. The gradient

\nabla

θ_{t}

) is computed exactly (for the dataset) by aggregating contributions from all samples. For empirical risk minimization with dataset D = \{(

x_{i}

y_{i}

)\}_{

i=1}^{n}

, L(

θ

) =

\frac{1}{n}

\sum_{i = 1}^{n}

ℓ

(

θ

;

x_{i}

y_{i}

), and thus

\nabla

θ

) =

\frac{1}{n}

\sum_{i = 1}^{n}

\nabla

θ

ℓ

(

θ

;

x_{i}

y_{i}

). Example: Linear regression with mean squared error

ℓ_{i}

\frac{1}{2}

(

x_{i}

⊤

θ

y_{i}

)^2 yields

\nabla

θ

) =

\frac{1}{n}

⊤

θ

- y), where X

\in

R^{n \times d}

, y

\in

R^{n}

. Under standard assumptions (L-smoothness and possibly

μ

-strong convexity), appropriate

η

ensures convergence: sublinear O

\frac{1}{t}

for convex smooth, and linear rate (geometric) for strongly convex. Stopping criteria often check

∥

\nabla

θ_{t}

)

∥_{2}

\leq

ε

or relative loss decrease.

04When to Use

Hook: Use the right hammer for the right nail. Batch gradient descent is the hammer for smooth, differentiable losses when you can afford full passes over the data. Concept: Apply batch gradient descent when the dataset fits in memory or streaming a full pass is acceptable, and when you want stable, low-variance gradient estimates per step. It shines in convex problems like linear regression and logistic regression, where global minima exist and each step steadily progresses. It also serves as a strong baseline and sanity check for more complex optimizers. Example use cases: (1) Fitting linear or logistic regression on medium-sized datasets ( $n·d$ manageable). (2) Training small neural networks for didactic purposes or when determinism and stability matter. (3) Computing precise gradients for ill-conditioned problems before switching to quasi-Newton methods. Consider alternatives when: data is huge (prefer mini-batch/SGD), the loss is highly non-convex and noisy (adaptive optimizers may help), or when second-order curvature information (Newton/LM/L-BFGS) yields faster convergence per iteration despite higher cost.

⚠️Common Mistakes

Hook: Most failures come from two knobs: step size and scaling. Concept: Common pitfalls include (1) Learning rate too large, causing divergence or oscillations; (2) Learning rate too small, leading to painfully slow progress; (3) Unscaled features, which distort the loss surface and make a single learning rate ineffective; (4) Forgetting the bias term or mishandling it during scaling; (5) Using integer arithmetic or low precision causing rounding errors; (6) Incorrect gradient signs or averaging factors (missing 1/n); (7) Poor stopping rules—either stopping too early due to noisy loss or too late due to tiny improvement; (8) Ignoring numerical stability in functions like exp for logistic regression. Example fixes: Standardize features (zero mean, unit variance) and keep bias unscaled or handled separately. Start with a conservative \eta (e.g., 1/L estimate or small like 1e-2) and consider decay schedules. Verify gradients with finite-difference checks on small problems. Monitor both loss and gradient norm. Use double precision in C++ and clamp logits for sigmoid to avoid overflow. Log progress and experiment with different \eta to find a stable, fast regime.

Key Formulas

Gradient Descent Update

θ_{t + 1} = θ_{t} - η \nabla L (θ_{t})

Explanation: Update parameters by stepping opposite the gradient scaled by learning rate $η .$ This is the core rule for batch gradient descent.

Empirical Risk

L (θ) = \frac{1}{n} i = 1 \sum n ℓ (θ; x_{i}, y_{i})

Explanation: The training loss is the average of per-sample losses. Batch gradient descent differentiates this sum to form the update.

MSE Loss (Linear Regression)

L (θ) = \frac{1}{2 n} i = 1 \sum n (x_{i}^{⊤} θ - y_{i})^{2}

Explanation: This measures average squared error between predictions and targets. The $\frac{1}{2}$ factor simplifies the gradient expression.

Linear Regression Gradient

\nabla L (θ) = \frac{1}{n} X^{⊤} (Xθ - y)

Explanation: For mean squared error, the gradient equals the data matrix transpose times residuals, averaged over n samples.

Sigmoid

σ (z) = \frac{1}{1 + e ^{- z}}

Explanation: Maps any real input to a probability in (0,1). Used to convert linear scores to probabilities in logistic regression.

Logistic Regression Loss

L (θ) = \frac{1}{n} i = 1 \sum n [- y_{i} lo g p_{i} - (1 - y_{i}) lo g (1 - p_{i})], p_{i} = σ (x_{i}^{⊤} θ)

Explanation: Cross-entropy penalizes wrong confident predictions heavily. It is convex in θ for binary logistic regression.

Logistic Regression Gradient

\nabla L (θ) = \frac{1}{n} X^{⊤} (σ (Xθ) - y)

Explanation: The gradient is the data matrix transpose times the probability error vector. It averages errors over all samples.

Sublinear Convergence (Convex, L-smooth)

f (x_{t}) - f (x^{*}) \leq \frac{L ∥ x _{0} - x ^{*} ∥ ^{2}}{2 t}

Explanation: With step size 1/L on an L-smooth convex function, gradient descent’s suboptimality shrinks as O $\frac{1}{t}$ . This gives a rate guarantee without strong convexity.

Linear Convergence (Strongly Convex)

f (x_{t}) - f (x^{*}) \leq (1 - \frac{μ}{L})^{t} (f (x_{0}) - f (x^{*}))

Explanation: If f is $μ - s t ro n g l y$ convex and L-smooth, gradient descent with a suitable step size converges geometrically. The ratio $μ / L$ controls the speed.

Learning Rate Decay

η_{t} = \frac{η _{0}}{1 + k t}

Explanation: A decaying step size can stabilize and improve convergence when curvature varies. k controls how quickly the learning rate shrinks.

Gradient-Norm Stopping

∥ \nabla L (θ_{t}) ∥_{2} \leq ε

Explanation: Stop when the gradient is small, indicating a (near) stationary point. This is robust for smooth problems.

Relative Loss Decrease Stopping

\frac{∣ L ( θ _{t} ) - L ( θ _{t - 1} ) ∣}{max ( 1 , L ( θ _{t - 1} ))} \leq δ

Explanation: Stop when successive loss improvements become negligible. The max avoids division by very small numbers.

Complexity Analysis

Let n be the number of samples and d the number of features (including bias). In batch gradient descent, each iteration computes predictions for all n samples, accumulates residuals, and forms the gradient by multiplying

X^{T}

with the residual vector. A straightforward implementation does this in O(n d) time per iteration: O(n d) for computing ŷ = X

θ,

O(n) for residuals, and O(n d) for

X^{T}

r, which simplifies to O(n d). Memory usage for parameters is O(d); if the dataset must be stored in memory, it requires O(n d). Additional workspace for predictions and gradients is O(n + d), typically dominated by the dataset. Over E iterations (or epochs, since batch uses the full dataset each time), the total time is O(E n d). Convergence speed in iterations depends on problem conditioning. For L-smooth convex functions with a fixed step size η ≤ 1/L, the suboptimality decreases as O

\frac{1}{t}

; for

μ - s t ro n g l y

convex functions, it decreases geometrically at rate (1 −

μ / L)^{t} .

Poor conditioning (large κ = L/

μ)

slows convergence, often mitigated by feature scaling or preconditioning. Numerically, using double precision reduces rounding error, and careful implementations of nonlinearities (e.g., sigmoid with clamping) avoid overflow. If line search is used, each iteration’s time increases due to multiple function/gradient evaluations but may reduce the number of iterations. In practice, batch GD trades higher per-iteration cost for low variance and stable progress compared to SGD or mini-batch variants.

Code Examples

Batch Gradient Descent on a 1D Quadratic: Mechanics and Convergence

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Minimize L(theta) = (theta - 3)^2 using batch gradient descent
5 // Analytical gradient: dL/dtheta = 2*(theta - 3)
6 
7 int main() {
8     ios::sync_with_stdio(false);
9     cin.tie(nullptr);
10 
11     auto L = [](double th) { return (th - 3.0) * (th - 3.0); };
12     auto grad = [](double th) { return 2.0 * (th - 3.0); };
13 
14     double theta = 0.0;      // initial guess
15     double eta = 0.2;        // learning rate
16     int max_iters = 50;
17     double tol = 1e-9;       // gradient-norm stopping
18 
19     cout << fixed << setprecision(6);
20     for (int t = 0; t < max_iters; ++t) {
21         double g = grad(theta);
22         cout << "iter=" << t
23              << " theta=" << theta
24              << " L=" << L(theta)
25              << " |grad|=" << fabs(g) << "\n";
26         if (fabs(g) <= tol) break;
27         theta -= eta * g; // theta_{t+1} = theta_t - eta * grad
28     }
29 
30     cout << "Final theta ~ " << theta << " (true minimizer is 3.0)\n";
31     return 0;
32 }
33

This program demonstrates the basic gradient descent update on a simple convex function. The gradient is computed exactly, and the parameter is updated by stepping opposite the gradient. You can experiment with the learning rate: too large can cause divergence or oscillation; moderate values converge quickly to 3.0.

Time: O(T) per run, where T is the number of iterationsSpace: O(1)

Batch Gradient Descent for Multivariate Linear Regression (MSE)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Linear regression with batch gradient descent
5 // Minimize L(theta) = (1/(2n)) * sum_i (x_i^T theta - y_i)^2
6 // Gradient: (1/n) * X^T (X theta - y)
7 
8 struct Dataset {
9     vector<vector<double>> X; // n x d (will include bias as last column)
10     vector<double> y;         // n
11 };
12 
13 // Generate synthetic linear data with bias and Gaussian noise
14 Dataset make_data(int n, int d_no_bias, unsigned seed=42) {
15     mt19937 rng(seed);
16     normal_distribution<double> noise(0.0, 0.3);
17     uniform_real_distribution<double> wdist(-2.0, 2.0);
18     vector<double> true_w(d_no_bias + 1); // include bias at end
19     for (double &w : true_w) w = wdist(rng);
20 
21     Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0.0);
22 
23     uniform_real_distribution<double> xdist(-2.0, 2.0);
24     for (int i = 0; i < n; ++i) {
25         double yclean = 0.0;
26         for (int j = 0; j < d_no_bias; ++j) {
27             double xj = xdist(rng);
28             ds.X[i][j] = xj;
29             yclean += true_w[j] * xj;
30         }
31         ds.X[i][d_no_bias] = 1.0; // bias feature
32         yclean += true_w[d_no_bias];
33         ds.y[i] = yclean + noise(rng);
34     }
35     return ds;
36 }
37 
38 // Standardize each non-bias feature: zero mean, unit variance
39 void standardize_features(vector<vector<double>> &X) {
40     int n = (int)X.size();
41     int d = (int)X[0].size();
42     // Exclude last column (bias) from scaling
43     for (int j = 0; j < d - 1; ++j) {
44         double mean = 0.0;
45         for (int i = 0; i < n; ++i) mean += X[i][j];
46         mean /= n;
47         double var = 0.0;
48         for (int i = 0; i < n; ++i) var += (X[i][j] - mean) * (X[i][j] - mean);
49         var /= max(1, n - 1);
50         double stdv = sqrt(var) + 1e-12; // avoid divide-by-zero
51         for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
52     }
53 }
54 
55 int main() {
56     ios::sync_with_stdio(false);
57     cin.tie(nullptr);
58 
59     int n = 500;           // samples
60     int d_no_bias = 5;     // real features (we will append bias)
61     Dataset ds = make_data(n, d_no_bias, 123);
62 
63     // Scale features (not bias)
64     standardize_features(ds.X);
65 
66     int d = d_no_bias + 1; // include bias feature
67     vector<double> theta(d, 0.0); // initialize to zeros
68 
69     double eta = 0.1;      // learning rate
70     int max_iters = 1000;
71     double tol = 1e-8;     // gradient-norm stopping
72 
73     vector<double> yhat(n), residual(n), grad(d);
74 
75     auto loss = [&](const vector<double>& th){
76         double sse = 0.0;
77         for (int i = 0; i < n; ++i) {
78             double pred = 0.0;
79             for (int j = 0; j < d; ++j) pred += ds.X[i][j] * th[j];
80             double e = pred - ds.y[i];
81             sse += e * e;
82         }
83         return 0.5 * sse / n; // (1/2n) * ||X theta - y||^2
84     };
85 
86     cout << fixed << setprecision(6);
87     for (int t = 0; t < max_iters; ++t) {
88         // yhat = X theta
89         for (int i = 0; i < n; ++i) {
90             double s = 0.0;
91             for (int j = 0; j < d; ++j) s += ds.X[i][j] * theta[j];
92             yhat[i] = s;
93         }
94         // residual = yhat - y
95         for (int i = 0; i < n; ++i) residual[i] = yhat[i] - ds.y[i];
96         // grad = (1/n) * X^T residual
97         fill(grad.begin(), grad.end(), 0.0);
98         for (int j = 0; j < d; ++j) {
99             double gj = 0.0;
100             for (int i = 0; i < n; ++i) gj += ds.X[i][j] * residual[i];
101             grad[j] = gj / n;
102         }
103         // Progress report
104         if (t % 100 == 0) {
105             double L = loss(theta);
106             double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
107             cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n";
108             if (gnorm <= tol) break;
109         }
110         // Update
111         for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
112     }
113 
114     cout << "Final loss=" << loss(theta) << "\nParameters (including bias last):\n";
115     for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
116     return 0;
117 }
118

This program fits a linear regression model with batch gradient descent. It standardizes features for better conditioning, appends a bias feature, and iteratively updates parameters using the full-dataset gradient. The loss decreases steadily when the learning rate is reasonable, illustrating stable batch updates.

Time: O(T n d) total; O(n d) per iterationSpace: O(n d) to store data + O(d) parameters + O(n + d) workspace

Batch Gradient Descent for Binary Logistic Regression (Cross-Entropy)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Logistic regression with batch gradient descent
5 // Loss: L(theta) = (1/n) * sum_i [ -y_i log p_i - (1-y_i) log(1-p_i) ], p_i = sigmoid(x_i^T theta)
6 // Gradient: (1/n) * X^T (p - y)
7 
8 struct Dataset {
9     vector<vector<double>> X; // n x d (includes bias column)
10     vector<int> y;            // n labels in {0,1}
11 };
12 
13 Dataset make_data(int n, int d_no_bias, unsigned seed=7) {
14     mt19937 rng(seed);
15     normal_distribution<double> noise(0.0, 0.5);
16     uniform_real_distribution<double> wdist(-2.0, 2.0);
17     vector<double> true_w(d_no_bias + 1);
18     for (double &w : true_w) w = wdist(rng);
19 
20     Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0);
21     normal_distribution<double> xdist(0.0, 1.0);
22     for (int i = 0; i < n; ++i) {
23         double z = 0.0;
24         for (int j = 0; j < d_no_bias; ++j) {
25             double xj = xdist(rng);
26             ds.X[i][j] = xj;
27             z += true_w[j] * xj;
28         }
29         ds.X[i][d_no_bias] = 1.0; // bias
30         z += true_w[d_no_bias];
31         z += noise(rng);
32         // Probability via sigmoid, then sample label
33         double p = 1.0 / (1.0 + exp(-z));
34         bernoulli_distribution bd(p);
35         ds.y[i] = (int)bd(rng);
36     }
37     return ds;
38 }
39 
40 void standardize_features(vector<vector<double>> &X) {
41     int n = (int)X.size();
42     int d = (int)X[0].size();
43     for (int j = 0; j < d - 1; ++j) {
44         double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
45         double var = 0.0; for (int i = 0; i < n; ++i) var += (X[i][j]-mean)*(X[i][j]-mean); var /= max(1, n-1);
46         double stdv = sqrt(var) + 1e-12;
47         for (int i = 0; i < n; ++i) X[i][j] = (X[i][j]-mean)/stdv;
48     }
49 }
50 
51 // Numerically stable sigmoid
52 inline double sigmoid(double z) {
53     if (z >= 0) {
54         double ez = exp(-z);
55         return 1.0 / (1.0 + ez);
56     } else {
57         double ez = exp(z);
58         return ez / (1.0 + ez);
59     }
60 }
61 
62 int main() {
63     ios::sync_with_stdio(false);
64     cin.tie(nullptr);
65 
66     int n = 1000;
67     int d_no_bias = 4;
68     Dataset ds = make_data(n, d_no_bias, 2024);
69     standardize_features(ds.X);
70 
71     int d = d_no_bias + 1; // include bias
72     vector<double> theta(d, 0.0);
73     vector<double> p(n), grad(d);
74 
75     auto loss = [&](const vector<double>& th){
76         double L = 0.0;
77         for (int i = 0; i < n; ++i) {
78             double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * th[j];
79             double pi = sigmoid(z);
80             // Clamp to avoid log(0)
81             double eps = 1e-12;
82             pi = min(max(pi, eps), 1.0 - eps);
83             L += - ds.y[i] * log(pi) - (1 - ds.y[i]) * log(1 - pi);
84         }
85         return L / n;
86     };
87 
88     double eta = 0.2;
89     int max_iters = 1000;
90     double tol = 1e-6;
91 
92     cout << fixed << setprecision(6);
93     for (int t = 0; t < max_iters; ++t) {
94         // Compute probabilities p = sigmoid(X theta)
95         for (int i = 0; i < n; ++i) {
96             double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
97             p[i] = sigmoid(z);
98         }
99         // Gradient: (1/n) * X^T (p - y)
100         fill(grad.begin(), grad.end(), 0.0);
101         for (int j = 0; j < d; ++j) {
102             double gj = 0.0;
103             for (int i = 0; i < n; ++i) gj += ds.X[i][j] * (p[i] - ds.y[i]);
104             grad[j] = gj / n;
105         }
106         if (t % 100 == 0) {
107             double L = loss(theta);
108             double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
109             cout << "iter=" << t << " loss=" << L << " |grad|=" << gnorm << "\n";
110             if (gnorm <= tol) break;
111         }
112         for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
113     }
114 
115     cout << "Final loss=" << loss(theta) << "\nParameters (bias last):\n";
116     for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
117 
118     // Quick training accuracy check
119     int correct = 0;
120     for (int i = 0; i < n; ++i) {
121         double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
122         int pred = sigmoid(z) >= 0.5 ? 1 : 0;
123         correct += (pred == ds.y[i]);
124     }
125     cout << "Training accuracy ~ " << (100.0 * correct / n) << "%\n";
126     return 0;
127 }
128

This example trains a binary logistic regression model using batch gradient descent. It computes cross-entropy loss, uses a numerically stable sigmoid, standardizes features, and updates parameters using the full gradient each iteration. The model’s loss decreases and accuracy improves as training proceeds.

Time: O(T n d) total; O(n d) per iterationSpace: O(n d) for data + O(d) for parameters + O(n + d) workspace

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Minimize L(theta) = (theta - 3)^2 using batch gradient descent
5	// Analytical gradient: dL/dtheta = 2*(theta - 3)
6
7	int main() {
8	ios::sync_with_stdio(false);
9	cin.tie(nullptr);
10
11	auto L = [](double th) { return (th - 3.0) * (th - 3.0); };
12	auto grad = [](double th) { return 2.0 * (th - 3.0); };
13
14	double theta = 0.0; // initial guess
15	double eta = 0.2; // learning rate
16	int max_iters = 50;
17	double tol = 1e-9; // gradient-norm stopping
18
19	cout << fixed << setprecision(6);
20	for (int t = 0; t < max_iters; ++t) {
21	double g = grad(theta);
22	cout << "iter=" << t
23	<< " theta=" << theta
24	<< " L=" << L(theta)
25	<< " \|grad\|=" << fabs(g) << "\n";
26	if (fabs(g) <= tol) break;
27	theta -= eta * g; // theta_{t+1} = theta_t - eta * grad
28	}
29
30	cout << "Final theta ~ " << theta << " (true minimizer is 3.0)\n";
31	return 0;
32	}
33

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Linear regression with batch gradient descent
5	// Minimize L(theta) = (1/(2n)) * sum_i (x_i^T theta - y_i)^2
6	// Gradient: (1/n) * X^T (X theta - y)
7
8	struct Dataset {
9	vector<vector<double>> X; // n x d (will include bias as last column)
10	vector<double> y; // n
11	};
12
13	// Generate synthetic linear data with bias and Gaussian noise
14	Dataset make_data(int n, int d_no_bias, unsigned seed=42) {
15	mt19937 rng(seed);
16	normal_distribution<double> noise(0.0, 0.3);
17	uniform_real_distribution<double> wdist(-2.0, 2.0);
18	vector<double> true_w(d_no_bias + 1); // include bias at end
19	for (double &w : true_w) w = wdist(rng);
20
21	Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0.0);
22
23	uniform_real_distribution<double> xdist(-2.0, 2.0);
24	for (int i = 0; i < n; ++i) {
25	double yclean = 0.0;
26	for (int j = 0; j < d_no_bias; ++j) {
27	double xj = xdist(rng);
28	ds.X[i][j] = xj;
29	yclean += true_w[j] * xj;
30	}
31	ds.X[i][d_no_bias] = 1.0; // bias feature
32	yclean += true_w[d_no_bias];
33	ds.y[i] = yclean + noise(rng);
34	}
35	return ds;
36	}
37
38	// Standardize each non-bias feature: zero mean, unit variance
39	void standardize_features(vector<vector<double>> &X) {
40	int n = (int)X.size();
41	int d = (int)X[0].size();
42	// Exclude last column (bias) from scaling
43	for (int j = 0; j < d - 1; ++j) {
44	double mean = 0.0;
45	for (int i = 0; i < n; ++i) mean += X[i][j];
46	mean /= n;
47	double var = 0.0;
48	for (int i = 0; i < n; ++i) var += (X[i][j] - mean) * (X[i][j] - mean);
49	var /= max(1, n - 1);
50	double stdv = sqrt(var) + 1e-12; // avoid divide-by-zero
51	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
52	}
53	}
54
55	int main() {
56	ios::sync_with_stdio(false);
57	cin.tie(nullptr);
58
59	int n = 500; // samples
60	int d_no_bias = 5; // real features (we will append bias)
61	Dataset ds = make_data(n, d_no_bias, 123);
62
63	// Scale features (not bias)
64	standardize_features(ds.X);
65
66	int d = d_no_bias + 1; // include bias feature
67	vector<double> theta(d, 0.0); // initialize to zeros
68
69	double eta = 0.1; // learning rate
70	int max_iters = 1000;
71	double tol = 1e-8; // gradient-norm stopping
72
73	vector<double> yhat(n), residual(n), grad(d);
74
75	auto loss = [&](const vector<double>& th){
76	double sse = 0.0;
77	for (int i = 0; i < n; ++i) {
78	double pred = 0.0;
79	for (int j = 0; j < d; ++j) pred += ds.X[i][j] * th[j];
80	double e = pred - ds.y[i];
81	sse += e * e;
82	}
83	return 0.5 * sse / n; // (1/2n) * \|\|X theta - y\|\|^2
84	};
85
86	cout << fixed << setprecision(6);
87	for (int t = 0; t < max_iters; ++t) {
88	// yhat = X theta
89	for (int i = 0; i < n; ++i) {
90	double s = 0.0;
91	for (int j = 0; j < d; ++j) s += ds.X[i][j] * theta[j];
92	yhat[i] = s;
93	}
94	// residual = yhat - y
95	for (int i = 0; i < n; ++i) residual[i] = yhat[i] - ds.y[i];
96	// grad = (1/n) * X^T residual
97	fill(grad.begin(), grad.end(), 0.0);
98	for (int j = 0; j < d; ++j) {
99	double gj = 0.0;
100	for (int i = 0; i < n; ++i) gj += ds.X[i][j] * residual[i];
101	grad[j] = gj / n;
102	}
103	// Progress report
104	if (t % 100 == 0) {
105	double L = loss(theta);
106	double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
107	cout << "iter=" << t << " loss=" << L << " \|grad\|=" << gnorm << "\n";
108	if (gnorm <= tol) break;
109	}
110	// Update
111	for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
112	}
113
114	cout << "Final loss=" << loss(theta) << "\nParameters (including bias last):\n";
115	for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
116	return 0;
117	}
118

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Logistic regression with batch gradient descent
5	// Loss: L(theta) = (1/n) * sum_i [ -y_i log p_i - (1-y_i) log(1-p_i) ], p_i = sigmoid(x_i^T theta)
6	// Gradient: (1/n) * X^T (p - y)
7
8	struct Dataset {
9	vector<vector<double>> X; // n x d (includes bias column)
10	vector<int> y; // n labels in {0,1}
11	};
12
13	Dataset make_data(int n, int d_no_bias, unsigned seed=7) {
14	mt19937 rng(seed);
15	normal_distribution<double> noise(0.0, 0.5);
16	uniform_real_distribution<double> wdist(-2.0, 2.0);
17	vector<double> true_w(d_no_bias + 1);
18	for (double &w : true_w) w = wdist(rng);
19
20	Dataset ds; ds.X.assign(n, vector<double>(d_no_bias + 1)); ds.y.assign(n, 0);
21	normal_distribution<double> xdist(0.0, 1.0);
22	for (int i = 0; i < n; ++i) {
23	double z = 0.0;
24	for (int j = 0; j < d_no_bias; ++j) {
25	double xj = xdist(rng);
26	ds.X[i][j] = xj;
27	z += true_w[j] * xj;
28	}
29	ds.X[i][d_no_bias] = 1.0; // bias
30	z += true_w[d_no_bias];
31	z += noise(rng);
32	// Probability via sigmoid, then sample label
33	double p = 1.0 / (1.0 + exp(-z));
34	bernoulli_distribution bd(p);
35	ds.y[i] = (int)bd(rng);
36	}
37	return ds;
38	}
39
40	void standardize_features(vector<vector<double>> &X) {
41	int n = (int)X.size();
42	int d = (int)X[0].size();
43	for (int j = 0; j < d - 1; ++j) {
44	double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
45	double var = 0.0; for (int i = 0; i < n; ++i) var += (X[i][j]-mean)*(X[i][j]-mean); var /= max(1, n-1);
46	double stdv = sqrt(var) + 1e-12;
47	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j]-mean)/stdv;
48	}
49	}
50
51	// Numerically stable sigmoid
52	inline double sigmoid(double z) {
53	if (z >= 0) {
54	double ez = exp(-z);
55	return 1.0 / (1.0 + ez);
56	} else {
57	double ez = exp(z);
58	return ez / (1.0 + ez);
59	}
60	}
61
62	int main() {
63	ios::sync_with_stdio(false);
64	cin.tie(nullptr);
65
66	int n = 1000;
67	int d_no_bias = 4;
68	Dataset ds = make_data(n, d_no_bias, 2024);
69	standardize_features(ds.X);
70
71	int d = d_no_bias + 1; // include bias
72	vector<double> theta(d, 0.0);
73	vector<double> p(n), grad(d);
74
75	auto loss = [&](const vector<double>& th){
76	double L = 0.0;
77	for (int i = 0; i < n; ++i) {
78	double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * th[j];
79	double pi = sigmoid(z);
80	// Clamp to avoid log(0)
81	double eps = 1e-12;
82	pi = min(max(pi, eps), 1.0 - eps);
83	L += - ds.y[i] * log(pi) - (1 - ds.y[i]) * log(1 - pi);
84	}
85	return L / n;
86	};
87
88	double eta = 0.2;
89	int max_iters = 1000;
90	double tol = 1e-6;
91
92	cout << fixed << setprecision(6);
93	for (int t = 0; t < max_iters; ++t) {
94	// Compute probabilities p = sigmoid(X theta)
95	for (int i = 0; i < n; ++i) {
96	double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
97	p[i] = sigmoid(z);
98	}
99	// Gradient: (1/n) * X^T (p - y)
100	fill(grad.begin(), grad.end(), 0.0);
101	for (int j = 0; j < d; ++j) {
102	double gj = 0.0;
103	for (int i = 0; i < n; ++i) gj += ds.X[i][j] * (p[i] - ds.y[i]);
104	grad[j] = gj / n;
105	}
106	if (t % 100 == 0) {
107	double L = loss(theta);
108	double gnorm = 0.0; for (double v : grad) gnorm += v*v; gnorm = sqrt(gnorm);
109	cout << "iter=" << t << " loss=" << L << " \|grad\|=" << gnorm << "\n";
110	if (gnorm <= tol) break;
111	}
112	for (int j = 0; j < d; ++j) theta[j] -= eta * grad[j];
113	}
114
115	cout << "Final loss=" << loss(theta) << "\nParameters (bias last):\n";
116	for (int j = 0; j < d; ++j) cout << "theta[" << j << "]=" << theta[j] << '\n';
117
118	// Quick training accuracy check
119	int correct = 0;
120	for (int i = 0; i < n; ++i) {
121	double z = 0.0; for (int j = 0; j < d; ++j) z += ds.X[i][j] * theta[j];
122	int pred = sigmoid(z) >= 0.5 ? 1 : 0;
123	correct += (pred == ds.y[i]);
124	}
125	cout << "Training accuracy ~ " << (100.0 * correct / n) << "%\n";
126	return 0;
127	}
128