∑MathIntermediate

L2 Regularization (Ridge/Weight Decay)

Key Points

•
L2 regularization (also called ridge or weight decay) adds a penalty proportional to the sum of squared weights to discourage large parameters.
•
It acts like a soft pull (a spring) that shrinks coefficients toward zero without setting many exactly to zero, improving generalization.
•
In linear regression, ridge has a closed-form solution using normal equations with a $λ I$ term, which also stabilizes ill-conditioned $X^{T}$ X.
•
The gradient of the L2 penalty is simple: $λ w,$ so SGD updates become weight decay steps: w ← (1 − $η λ) w$ − $η \nabla d a t a .$
•
Bias terms (intercepts) are usually not regularized; standardizing features before using L2 is important to balance the penalty.
•
L2 has a Bayesian view: it is equivalent to a Gaussian prior on weights and yields a MAP estimate.
•
Use L2 when you want smooth shrinking, to handle multicollinearity, reduce variance, and improve numerical stability.
•
Choose λ via cross-validation; too large over-smooths (underfits), too small barely helps (overfits).

Prerequisites

→Linear algebra (vectors, matrices, matrix multiplication) — Ridge uses X^T X, linear systems, and norms, which require matrix operations.
→Differential calculus and gradients — L2 changes gradients by adding λw; understanding updates requires derivatives.
→Least squares and ordinary linear regression — Ridge is a regularized variant of least squares with a modified normal equation.
→Optimization algorithms (SGD, gradient descent) — In practice, L2 is implemented as weight decay in first-order methods.
→Feature scaling/standardization — L2 penalizes coefficients, so scaling ensures fair and stable shrinkage across features.
→Model selection and cross-validation — Choosing λ requires validation to balance bias and variance.
→Probability and MAP estimation — L2 corresponds to a Gaussian prior on parameters; MAP connects regularization and Bayesian views.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine training a model that fits your training data perfectly but fails miserably on new data—the classic overfitting problem. What if you could add a gentle brake that discourages wild parameter values without forcing them to be exactly zero? Concept: L2 regularization adds a penalty to the objective that grows with the square of each weight, nudging the model toward smaller, more conservative parameters. This reduces variance, improves generalization, and makes solutions more stable when features are correlated. In linear regression, this is called ridge regression; in gradient-based training (including deep learning), the equivalent update form is weight decay. Example: Suppose two features are nearly duplicates. Ordinary least squares can swing coefficients to large, opposite values to fit noise. Ridge balances them by shrinking both moderately, producing a more reliable predictor with smaller coefficients and better test performance.

02Intuition & Analogies

Hook: Think of your model’s weights like knobs on a stereo system. If you let every knob crank up freely, you might amplify both music and static. Adding L2 regularization is like attaching each knob to a soft spring pulling it back toward zero: you still get volume where needed, but the system resists extremes. Concept: Squaring the weights makes the penalty grow faster for large values, so very large weights are strongly discouraged, while small weights are only lightly touched. Unlike L1 (which is like a budget that can force exact zeros), L2 acts smoothly and continuously, spreading the shrinkage across all weights. Example: Picture fitting a line to noisy data. Without regularization, the slope might tilt sharply to chase noise points. With L2, the slope and intercept are tugged back a bit (usually leaving the intercept unpenalized), giving a line that may fit slightly worse on training data but will be more robust on new data. Another analogy: friction in a physical system—L2 adds a uniform drag that prevents parameters from accelerating to large magnitudes; in optimization steps, this appears as weight decay, gradually fading weights unless there is consistent gradient signal to sustain them.

03Formal Definition

Hook: We tame models by penalizing complexity directly in the objective. Concept: Given parameters w ∈ ℝ^d and a data loss

L_{d} a t a

(w) (e.g., squared error, logistic loss), L2 regularization augments the objective with a quadratic penalty on w. The standard form is J(w) =

L_{d} a t a

(w) + (

λ /2) ∥ w ∥_{2}^{2},

where ∥w∥_2^2 =

\sum_{i = 1}^{d}

w_{i}

^2 and λ ≥ 0 controls the strength. The factor

\frac{1}{2}

is conventional and simplifies gradients. Often, the bias term b is excluded from the penalty. In linear regression with X ∈

R^{n \times d}

and labels y ∈ ℝ^n, ridge minimizes

\frac{1}{2 n}

∥Xw − y∥_2^2 + (

λ /2) ∥ w ∥_{2}^{2}

and has the normal equations (

X^{T}

X + nλ I)

w = X^{T}

y (the n factor depends on your scaling convention). Its closed-form solution is w = (

X^{T}

X +

λ I)^{- 1} X^{T}

y (assuming compatible scaling). The gradient of the full objective is ∇J(w) = ∇

L_{d} a t a

(w) +

λ w,

so first-order methods implement weight decay naturally. Example: In logistic regression,

L_{d} a t a

is the average cross-entropy, and the same

λ ∥ w ∥_{2}^{2} /2

term is added, yielding smoother decision boundaries and better calibration.

04When to Use

Hook: If your model performs great on training data but degrades on validation, ask whether you need a gentle constraint. Concept: Use L2 regularization when you want to reduce variance without inducing sparsity, especially with correlated features or high-dimensional inputs. It is also effective when $X^T$ X is ill-conditioned: adding λI improves conditioning and numerical stability. Use cases: (1) Ridge regression for tabular data with multicollinearity, (2) Text or polynomial features where many small coefficients are preferable to a few large ones, (3) Deep learning as weight decay to curb exploding weights and improve generalization, (4) Online learning with SGD where L2 provides continuous shrinkage during updates. Example: In predicting house prices with many overlapping features (square footage, total rooms, bedrooms), ridge keeps coefficients modest and reduces sensitivity to noise; in a neural network, adding weight decay to linear layers prevents weight magnitudes from drifting upward over long training.

⚠️Common Mistakes

Hook: L2 is simple to add but surprisingly easy to misuse. Concept: Common pitfalls include (1) not standardizing features—L2 penalizes coefficients, so unscaled features skew shrinkage; (2) regularizing the bias—penalizing the intercept often hurts performance; (3) picking λ by guesswork—use cross-validation; (4) mixing conventions—forgetting whether λ couples with a 1/n factor in your loss, leading to mismatched strengths; (5) confusing λw vs. 2λw gradients—using λ∥w∥^2/2 gives gradient λw; (6) assuming L2 gives sparsity—it rarely sets weights exactly to zero; (7) with adaptive optimizers, equating L2 with weight decay—AdamW uses decoupled weight decay, which behaves better than naive L2. Example: A practitioner feeds raw dollar amounts and counts into a ridge model without scaling; the large-scale feature gets over-penalized into near-zero influence while small-scale features dominate. Another example: enabling regularization on the bias collapses predictions toward zero mean, degrading fit.

Key Formulas

Regularized Objective

J (w) = L_{data} (w) + \frac{λ}{2} ∥ w ∥_{2}^{2}

Explanation: Total objective equals the data loss plus half lambda times the squared L2 norm of weights. The factor $\frac{1}{2}$ makes the gradient of the penalty simply $λ w .$

L2 Norm

∥ w ∥_{2} = i = 1 \sum d w_{i}^{2}, ∥ w ∥_{2}^{2} = i = 1 \sum d w_{i}^{2}

Explanation: The L2 norm measures the Euclidean length of the vector. The squared L2 norm is used in L2 regularization to penalize large coefficients.

Gradient with L2

\nabla J (w) = \nabla L_{data} (w) + λ w

Explanation: The gradient of the regularized objective is the sum of the data-loss gradient and λ times the weight vector. This is what drives weight decay in first-order methods.

Weight Decay Update

w_{t + 1} = (1 - η λ) w_{t} - η \nabla L_{data} (w_{t})

Explanation: In SGD with step size $η,$ L2 adds a multiplicative shrinkage (1− $η λ)$ to weights each step, plus the usual data-gradient step. This is the computational view of L2 in iterative training.

Ridge Objective (Scaled)

w min \frac{1}{2 n} ∥ Xw - y ∥_{2}^{2} + \frac{λ}{2} ∥ w ∥_{2}^{2}

Explanation: Ridge regression minimizes mean squared error with an L2 penalty. Different libraries absorb factors of n into $λ;$ be consistent when comparing values.

Ridge Normal Equations

(X^{T} X + λ I) w = X^{T} y, w = (X^{T} X + λ I)^{- 1} X^{T} y

Explanation: Adding $λ I$ to $X^{T}$ X yields a well-conditioned linear system. Solving it gives the ridge coefficients in closed form (up to scaling conventions).

Hat Matrix and Degrees of Freedom

S_{λ} = X (X^{T} X + λ I)^{- 1} X^{T}, df (λ) = trace (S_{λ})

Explanation: The hat matrix maps the targets to fitted values. Its trace defines the effective degrees of freedom, which decrease as λ increases.

Logistic Regression with L2

w min \frac{1}{n} i = 1 \sum n lo g (1 + e^{- y_{i} w^{T} x_{i}}) + \frac{λ}{2} ∥ w ∥_{2}^{2}

Explanation: For binary labels $y_{i}$ ∈ {−1, +1}, the logistic loss plus an L2 penalty yields better generalization and smoother decision boundaries.

Gaussian Prior Equivalence

w \sim N (0, σ_{w}^{2} I) \Rightarrow - lo g p (w) = \frac{1}{2 σ _{w}^{2}} ∥ w ∥_{2}^{2} + C

Explanation: Placing a zero-mean isotropic Gaussian prior on weights leads to an L2 penalty in the MAP objective, with λ = 1/ $σ_{w}^{2}$ (up to scaling of the data term).

Tikhonov Regularization

w = (X^{T} X + λ Γ^{T} Γ)^{- 1} X^{T} y

Explanation: Generalizes ridge by penalizing \ $∣Γ w ∥$ _2^2, allowing you to shape which directions in parameter space are shrunk more.

Complexity Analysis

There are two common computational regimes for L2 regularization. For ridge regression with dense data, a closed-form solve is efficient when the number of features d is modest. Computing

X^{T}

X takes O(n

d^{2}

) time and O(

d^{2}

) space, and

X^{T}

y costs O(nd). Solving the d×d linear system (

X^{T}

X + λI)

w = X^{T}

y via Cholesky decomposition requires O(

d^{3}

) time and O(

d^{2}

) space. The total complexity is O(n

d^{2}

d^{3}

) time and O(nd +

d^{2}

) space if you store X and the Gram matrix; this is practical for d up to a few thousands on a single machine. The resulting system is symmetric positive definite when λ > 0, which guarantees a stable Cholesky factorization. For very high-dimensional or sparse problems, iterative methods are preferable. Using (stochastic) gradient descent, each pass over n examples has O(nd) time if you compute gradients exactly, or O(bd) per minibatch of size b, and O(d) space for parameters. With L2, the per-step update adds only a simple weight decay factor, keeping asymptotic costs unchanged. Convergence rates depend on conditioning: adding λ improves the strong convexity of the objective, often yielding faster and more stable convergence. For sparse X, using sparse matrix multiplications reduces costs proportionally to the number of nonzeros nnz, e.g., O(nnz +

d^{3}

) for forming and solving normal equations in small d, or O(nnz) per epoch for SGD. In neural networks, L2 adds negligible overhead—just a vector scaling—while providing regularization uniformly across layers (typically excluding biases and normalization parameters).

Code Examples

Ridge Regression (Closed-Form) via Cholesky with Optional Unpenalized Bias

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute X^T X (d x d) and X^T y (d)
5 static void computeXtX_Xty(const vector<vector<double>>& X, const vector<double>& y,
6                            vector<vector<double>>& XtX, vector<double>& Xty) {
7     size_t n = X.size();
8     size_t d = X[0].size();
9     XtX.assign(d, vector<double>(d, 0.0));
10     Xty.assign(d, 0.0);
11     for (size_t i = 0; i < n; ++i) {
12         const vector<double>& xi = X[i];
13         double yi = y[i];
14         for (size_t a = 0; a < d; ++a) {
15             Xty[a] += xi[a] * yi;
16             double xia = xi[a];
17             for (size_t b = 0; b < d; ++b) {
18                 XtX[a][b] += xia * xi[b];
19             }
20         }
21     }
22 }
23 
24 // Cholesky decomposition for SPD matrix A: A = L L^T (lower triangular L)
25 static bool choleskyDecompose(const vector<vector<double>>& A, vector<vector<double>>& L) {
26     size_t n = A.size();
27     L.assign(n, vector<double>(n, 0.0));
28     for (size_t i = 0; i < n; ++i) {
29         for (size_t j = 0; j <= i; ++j) {
30             double sum = A[i][j];
31             for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
32             if (i == j) {
33                 if (sum <= 0.0) return false; // not SPD
34                 L[i][i] = sqrt(max(sum, 0.0));
35             } else {
36                 L[i][j] = sum / L[j][j];
37             }
38         }
39     }
40     return true;
41 }
42 
43 // Solve A x = b given Cholesky L such that A = L L^T
44 static vector<double> choleskySolve(const vector<vector<double>>& L, const vector<double>& b) {
45     size_t n = L.size();
46     vector<double> y(n, 0.0), x(n, 0.0);
47     // Forward solve: L y = b
48     for (size_t i = 0; i < n; ++i) {
49         double sum = b[i];
50         for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
51         y[i] = sum / L[i][i];
52     }
53     // Backward solve: L^T x = y
54     for (int i = (int)n - 1; i >= 0; --i) {
55         double sum = y[i];
56         for (size_t k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
57         x[i] = sum / L[i][i];
58     }
59     return x;
60 }
61 
62 // Fit ridge regression: minimize 1/2 ||X w - y||^2 + (lambda/2) ||w||^2
63 // Optionally exclude an index (bias_index) from regularization by not adding lambda to its diagonal.
64 static vector<double> ridge_fit(const vector<vector<double>>& X, const vector<double>& y,
65                                 double lambda, int bias_index = -1) {
66     vector<vector<double>> XtX; vector<double> Xty;
67     computeXtX_Xty(X, y, XtX, Xty);
68 
69     size_t d = XtX.size();
70     for (size_t i = 0; i < d; ++i) {
71         if ((int)i == bias_index) continue; // do not penalize bias
72         XtX[i][i] += lambda;
73     }
74 
75     vector<vector<double>> L;
76     if (!choleskyDecompose(XtX, L)) {
77         throw runtime_error("Matrix not SPD; increase lambda or check data.");
78     }
79     return choleskySolve(L, Xty);
80 }
81 
82 // Predict y = X w
83 static vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
84     vector<double> yhat(X.size(), 0.0);
85     for (size_t i = 0; i < X.size(); ++i) {
86         double s = 0.0;
87         for (size_t j = 0; j < w.size(); ++j) s += X[i][j] * w[j];
88         yhat[i] = s;
89     }
90     return yhat;
91 }
92 
93 int main() {
94     ios::sync_with_stdio(false);
95     cin.tie(nullptr);
96 
97     // Create a toy dataset with a bias column (first column of ones)
98     // True model: y = 3 + 2*x1 - x2 + noise
99     int n = 200, d = 3; // [1, x1, x2]
100     vector<vector<double>> X(n, vector<double>(d, 1.0));
101     vector<double> y(n);
102     mt19937 rng(42);
103     normal_distribution<double> noise(0.0, 1.0);
104     uniform_real_distribution<double> unif(-3.0, 3.0);
105     for (int i = 0; i < n; ++i) {
106         double x1 = unif(rng);
107         double x2 = unif(rng) + 0.5 * x1; // introduce correlation
108         X[i][1] = x1;
109         X[i][2] = x2;
110         y[i] = 3.0 + 2.0 * x1 - 1.0 * x2 + noise(rng);
111     }
112 
113     double lambda = 10.0; // regularization strength
114     // bias is column 0; do not penalize it
115     vector<double> w = ridge_fit(X, y, lambda, /*bias_index=*/0);
116 
117     // Report weights
118     cout << fixed << setprecision(4);
119     cout << "Ridge weights (bias, w1, w2): ";
120     for (double wi : w) cout << wi << ' ';
121     cout << "\n";
122 
123     // Evaluate training RMSE
124     vector<double> yhat = predict(X, w);
125     double se = 0.0;
126     for (int i = 0; i < n; ++i) se += (yhat[i] - y[i]) * (yhat[i] - y[i]);
127     cout << "Train RMSE: " << sqrt(se / n) << "\n";
128 }
129

This program fits ridge regression using the closed-form normal equations with a Cholesky factorization. We build X^T X and X^T y, add λ to the diagonal (skipping the bias index to avoid penalizing the intercept), and solve (X^T X + λI)w = X^T y. The toy data introduces correlated features to highlight ridge’s stabilizing effect.

Time: O(nd^2 + d^3)Space: O(nd + d^2)

Binary Logistic Regression with L2 via SGD (Weight Decay)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5 
6 // Train logistic regression with L2 using SGD/minibatch weight decay
7 struct LogisticL2 {
8     vector<double> w; // weights for features (excluding bias)
9     double b = 0.0;   // bias (unpenalized)
10 
11     // Initialize weights to zeros
12     explicit LogisticL2(size_t d) : w(d, 0.0), b(0.0) {}
13 
14     // One training epoch over data (minibatch SGD)
15     void train_epoch(const vector<vector<double>>& X, const vector<int>& y,
16                      double lr, double lambda, size_t batch_size = 32, mt19937* rng = nullptr) {
17         size_t n = X.size();
18         vector<size_t> idx(n);
19         iota(idx.begin(), idx.end(), 0);
20         if (rng) shuffle(idx.begin(), idx.end(), *rng);
21 
22         for (size_t start = 0; start < n; start += batch_size) {
23             size_t end = min(n, start + batch_size);
24             // Accumulate gradients on minibatch
25             vector<double> gw(w.size(), 0.0);
26             double gb = 0.0;
27             for (size_t ii = start; ii < end; ++ii) {
28                 size_t i = idx[ii];
29                 // Model prediction: p = sigma(w^T x + b)
30                 double z = b;
31                 for (size_t j = 0; j < w.size(); ++j) z += w[j] * X[i][j];
32                 double p = sigmoid(z);
33                 // Labels y are in {0,1}; gradient of average cross-entropy: (p - y)
34                 double diff = p - static_cast<double>(y[i]);
35                 for (size_t j = 0; j < w.size(); ++j) gw[j] += diff * X[i][j];
36                 gb += diff;
37             }
38             double m = static_cast<double>(end - start);
39             for (size_t j = 0; j < w.size(); ++j) gw[j] /= m;
40             gb /= m;
41 
42             // Decoupled weight decay (equivalent to L2 for SGD): shrink weights, not bias
43             for (size_t j = 0; j < w.size(); ++j) w[j] *= (1.0 - lr * lambda);
44 
45             // Gradient step
46             for (size_t j = 0; j < w.size(); ++j) w[j] -= lr * gw[j];
47             b -= lr * gb; // do not decay bias
48         }
49     }
50 
51     // Predict probability p(y=1|x)
52     double predict_proba(const vector<double>& x) const {
53         double z = b;
54         for (size_t j = 0; j < w.size(); ++j) z += w[j] * x[j];
55         return sigmoid(z);
56     }
57 
58     int predict_label(const vector<double>& x, double thresh = 0.5) const {
59         return predict_proba(x) >= thresh ? 1 : 0;
60     }
61 };
62 
63 int main() {
64     ios::sync_with_stdio(false);
65     cin.tie(nullptr);
66 
67     // Generate a toy binary classification dataset in 2D
68     int n = 1000; int d = 2;
69     vector<vector<double>> X(n, vector<double>(d));
70     vector<int> y(n);
71 
72     mt19937 rng(123);
73     normal_distribution<double> ga(0.0, 1.0), gb(0.0, 1.0), noise(0.0, 0.5);
74     // Two clusters separated roughly by a line
75     for (int i = 0; i < n; ++i) {
76         if (i < n/2) {
77             X[i][0] = ga(rng) - 2.0; X[i][1] = ga(rng);
78             y[i] = 0;
79         } else {
80             X[i][0] = gb(rng) + 2.0; X[i][1] = gb(rng);
81             y[i] = 1;
82         }
83         // Add some noise to make it non-trivial
84         X[i][0] += noise(rng); X[i][1] += noise(rng);
85     }
86 
87     // Standardize features for fair L2 penalization
88     for (int j = 0; j < d; ++j) {
89         double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
90         double var = 0.0; for (int i = 0; i < n; ++i) { double t = X[i][j]-mean; var += t*t; }
91         double stdv = sqrt(var / n + 1e-12);
92         for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
93     }
94 
95     LogisticL2 clf(d);
96     double lr = 0.1, lambda = 0.01;
97     for (int epoch = 0; epoch < 30; ++epoch) {
98         clf.train_epoch(X, y, lr, lambda, 64, &rng);
99     }
100 
101     // Evaluate accuracy
102     int correct = 0;
103     for (int i = 0; i < n; ++i) correct += (clf.predict_label(X[i]) == y[i]);
104     cout << fixed << setprecision(4);
105     cout << "Train accuracy: " << (100.0 * correct / n) << "%\n";
106     cout << "Weights: "; for (double wi : clf.w) cout << wi << ' '; cout << "| bias: " << clf.b << "\n";
107 }
108

This example trains a logistic regression classifier using minibatch SGD with L2 regularization implemented as weight decay. We standardize features, shrink weights each step by (1−ηλ), then apply the gradient of the cross-entropy loss. The bias term is not penalized. The code demonstrates how L2 appears as a simple multiplicative decay in iterative optimization.

Time: O(E * (n * d)) for E epochs (or O(E * (n * d / batch_size)) per update)Space: O(n d + d)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute X^T X (d x d) and X^T y (d)
5	static void computeXtX_Xty(const vector<vector<double>>& X, const vector<double>& y,
6	vector<vector<double>>& XtX, vector<double>& Xty) {
7	size_t n = X.size();
8	size_t d = X[0].size();
9	XtX.assign(d, vector<double>(d, 0.0));
10	Xty.assign(d, 0.0);
11	for (size_t i = 0; i < n; ++i) {
12	const vector<double>& xi = X[i];
13	double yi = y[i];
14	for (size_t a = 0; a < d; ++a) {
15	Xty[a] += xi[a] * yi;
16	double xia = xi[a];
17	for (size_t b = 0; b < d; ++b) {
18	XtX[a][b] += xia * xi[b];
19	}
20	}
21	}
22	}
23
24	// Cholesky decomposition for SPD matrix A: A = L L^T (lower triangular L)
25	static bool choleskyDecompose(const vector<vector<double>>& A, vector<vector<double>>& L) {
26	size_t n = A.size();
27	L.assign(n, vector<double>(n, 0.0));
28	for (size_t i = 0; i < n; ++i) {
29	for (size_t j = 0; j <= i; ++j) {
30	double sum = A[i][j];
31	for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
32	if (i == j) {
33	if (sum <= 0.0) return false; // not SPD
34	L[i][i] = sqrt(max(sum, 0.0));
35	} else {
36	L[i][j] = sum / L[j][j];
37	}
38	}
39	}
40	return true;
41	}
42
43	// Solve A x = b given Cholesky L such that A = L L^T
44	static vector<double> choleskySolve(const vector<vector<double>>& L, const vector<double>& b) {
45	size_t n = L.size();
46	vector<double> y(n, 0.0), x(n, 0.0);
47	// Forward solve: L y = b
48	for (size_t i = 0; i < n; ++i) {
49	double sum = b[i];
50	for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
51	y[i] = sum / L[i][i];
52	}
53	// Backward solve: L^T x = y
54	for (int i = (int)n - 1; i >= 0; --i) {
55	double sum = y[i];
56	for (size_t k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
57	x[i] = sum / L[i][i];
58	}
59	return x;
60	}
61
62	// Fit ridge regression: minimize 1/2 \|\|X w - y\|\|^2 + (lambda/2) \|\|w\|\|^2
63	// Optionally exclude an index (bias_index) from regularization by not adding lambda to its diagonal.
64	static vector<double> ridge_fit(const vector<vector<double>>& X, const vector<double>& y,
65	double lambda, int bias_index = -1) {
66	vector<vector<double>> XtX; vector<double> Xty;
67	computeXtX_Xty(X, y, XtX, Xty);
68
69	size_t d = XtX.size();
70	for (size_t i = 0; i < d; ++i) {
71	if ((int)i == bias_index) continue; // do not penalize bias
72	XtX[i][i] += lambda;
73	}
74
75	vector<vector<double>> L;
76	if (!choleskyDecompose(XtX, L)) {
77	throw runtime_error("Matrix not SPD; increase lambda or check data.");
78	}
79	return choleskySolve(L, Xty);
80	}
81
82	// Predict y = X w
83	static vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
84	vector<double> yhat(X.size(), 0.0);
85	for (size_t i = 0; i < X.size(); ++i) {
86	double s = 0.0;
87	for (size_t j = 0; j < w.size(); ++j) s += X[i][j] * w[j];
88	yhat[i] = s;
89	}
90	return yhat;
91	}
92
93	int main() {
94	ios::sync_with_stdio(false);
95	cin.tie(nullptr);
96
97	// Create a toy dataset with a bias column (first column of ones)
98	// True model: y = 3 + 2*x1 - x2 + noise
99	int n = 200, d = 3; // [1, x1, x2]
100	vector<vector<double>> X(n, vector<double>(d, 1.0));
101	vector<double> y(n);
102	mt19937 rng(42);
103	normal_distribution<double> noise(0.0, 1.0);
104	uniform_real_distribution<double> unif(-3.0, 3.0);
105	for (int i = 0; i < n; ++i) {
106	double x1 = unif(rng);
107	double x2 = unif(rng) + 0.5 * x1; // introduce correlation
108	X[i][1] = x1;
109	X[i][2] = x2;
110	y[i] = 3.0 + 2.0 * x1 - 1.0 * x2 + noise(rng);
111	}
112
113	double lambda = 10.0; // regularization strength
114	// bias is column 0; do not penalize it
115	vector<double> w = ridge_fit(X, y, lambda, /bias_index=/0);
116
117	// Report weights
118	cout << fixed << setprecision(4);
119	cout << "Ridge weights (bias, w1, w2): ";
120	for (double wi : w) cout << wi << ' ';
121	cout << "\n";
122
123	// Evaluate training RMSE
124	vector<double> yhat = predict(X, w);
125	double se = 0.0;
126	for (int i = 0; i < n; ++i) se += (yhat[i] - y[i]) * (yhat[i] - y[i]);
127	cout << "Train RMSE: " << sqrt(se / n) << "\n";
128	}
129

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5
6	// Train logistic regression with L2 using SGD/minibatch weight decay
7	struct LogisticL2 {
8	vector<double> w; // weights for features (excluding bias)
9	double b = 0.0; // bias (unpenalized)
10
11	// Initialize weights to zeros
12	explicit LogisticL2(size_t d) : w(d, 0.0), b(0.0) {}
13
14	// One training epoch over data (minibatch SGD)
15	void train_epoch(const vector<vector<double>>& X, const vector<int>& y,
16	double lr, double lambda, size_t batch_size = 32, mt19937* rng = nullptr) {
17	size_t n = X.size();
18	vector<size_t> idx(n);
19	iota(idx.begin(), idx.end(), 0);
20	if (rng) shuffle(idx.begin(), idx.end(), *rng);
21
22	for (size_t start = 0; start < n; start += batch_size) {
23	size_t end = min(n, start + batch_size);
24	// Accumulate gradients on minibatch
25	vector<double> gw(w.size(), 0.0);
26	double gb = 0.0;
27	for (size_t ii = start; ii < end; ++ii) {
28	size_t i = idx[ii];
29	// Model prediction: p = sigma(w^T x + b)
30	double z = b;
31	for (size_t j = 0; j < w.size(); ++j) z += w[j] * X[i][j];
32	double p = sigmoid(z);
33	// Labels y are in {0,1}; gradient of average cross-entropy: (p - y)
34	double diff = p - static_cast<double>(y[i]);
35	for (size_t j = 0; j < w.size(); ++j) gw[j] += diff * X[i][j];
36	gb += diff;
37	}
38	double m = static_cast<double>(end - start);
39	for (size_t j = 0; j < w.size(); ++j) gw[j] /= m;
40	gb /= m;
41
42	// Decoupled weight decay (equivalent to L2 for SGD): shrink weights, not bias
43	for (size_t j = 0; j < w.size(); ++j) w[j] = (1.0 - lr lambda);
44
45	// Gradient step
46	for (size_t j = 0; j < w.size(); ++j) w[j] -= lr * gw[j];
47	b -= lr * gb; // do not decay bias
48	}
49	}
50
51	// Predict probability p(y=1\|x)
52	double predict_proba(const vector<double>& x) const {
53	double z = b;
54	for (size_t j = 0; j < w.size(); ++j) z += w[j] * x[j];
55	return sigmoid(z);
56	}
57
58	int predict_label(const vector<double>& x, double thresh = 0.5) const {
59	return predict_proba(x) >= thresh ? 1 : 0;
60	}
61	};
62
63	int main() {
64	ios::sync_with_stdio(false);
65	cin.tie(nullptr);
66
67	// Generate a toy binary classification dataset in 2D
68	int n = 1000; int d = 2;
69	vector<vector<double>> X(n, vector<double>(d));
70	vector<int> y(n);
71
72	mt19937 rng(123);
73	normal_distribution<double> ga(0.0, 1.0), gb(0.0, 1.0), noise(0.0, 0.5);
74	// Two clusters separated roughly by a line
75	for (int i = 0; i < n; ++i) {
76	if (i < n/2) {
77	X[i][0] = ga(rng) - 2.0; X[i][1] = ga(rng);
78	y[i] = 0;
79	} else {
80	X[i][0] = gb(rng) + 2.0; X[i][1] = gb(rng);
81	y[i] = 1;
82	}
83	// Add some noise to make it non-trivial
84	X[i][0] += noise(rng); X[i][1] += noise(rng);
85	}
86
87	// Standardize features for fair L2 penalization
88	for (int j = 0; j < d; ++j) {
89	double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
90	double var = 0.0; for (int i = 0; i < n; ++i) { double t = X[i][j]-mean; var += t*t; }
91	double stdv = sqrt(var / n + 1e-12);
92	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
93	}
94
95	LogisticL2 clf(d);
96	double lr = 0.1, lambda = 0.01;
97	for (int epoch = 0; epoch < 30; ++epoch) {
98	clf.train_epoch(X, y, lr, lambda, 64, &rng);
99	}
100
101	// Evaluate accuracy
102	int correct = 0;
103	for (int i = 0; i < n; ++i) correct += (clf.predict_label(X[i]) == y[i]);
104	cout << fixed << setprecision(4);
105	cout << "Train accuracy: " << (100.0 * correct / n) << "%\n";
106	cout << "Weights: "; for (double wi : clf.w) cout << wi << ' '; cout << "\| bias: " << clf.b << "\n";
107	}
108