🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

L2 Regularization (Ridge/Weight Decay)

Key Points

  • •
    L2 regularization (also called ridge or weight decay) adds a penalty proportional to the sum of squared weights to discourage large parameters.
  • •
    It acts like a soft pull (a spring) that shrinks coefficients toward zero without setting many exactly to zero, improving generalization.
  • •
    In linear regression, ridge has a closed-form solution using normal equations with a λI term, which also stabilizes ill-conditioned XT X.
  • •
    The gradient of the L2 penalty is simple: λw, so SGD updates become weight decay steps: w ← (1 − ηλ)w − η∇data.
  • •
    Bias terms (intercepts) are usually not regularized; standardizing features before using L2 is important to balance the penalty.
  • •
    L2 has a Bayesian view: it is equivalent to a Gaussian prior on weights and yields a MAP estimate.
  • •
    Use L2 when you want smooth shrinking, to handle multicollinearity, reduce variance, and improve numerical stability.
  • •
    Choose λ via cross-validation; too large over-smooths (underfits), too small barely helps (overfits).

Prerequisites

  • →Linear algebra (vectors, matrices, matrix multiplication) — Ridge uses X^T X, linear systems, and norms, which require matrix operations.
  • →Differential calculus and gradients — L2 changes gradients by adding λw; understanding updates requires derivatives.
  • →Least squares and ordinary linear regression — Ridge is a regularized variant of least squares with a modified normal equation.
  • →Optimization algorithms (SGD, gradient descent) — In practice, L2 is implemented as weight decay in first-order methods.
  • →Feature scaling/standardization — L2 penalizes coefficients, so scaling ensures fair and stable shrinkage across features.
  • →Model selection and cross-validation — Choosing λ requires validation to balance bias and variance.
  • →Probability and MAP estimation — L2 corresponds to a Gaussian prior on parameters; MAP connects regularization and Bayesian views.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine training a model that fits your training data perfectly but fails miserably on new data—the classic overfitting problem. What if you could add a gentle brake that discourages wild parameter values without forcing them to be exactly zero? Concept: L2 regularization adds a penalty to the objective that grows with the square of each weight, nudging the model toward smaller, more conservative parameters. This reduces variance, improves generalization, and makes solutions more stable when features are correlated. In linear regression, this is called ridge regression; in gradient-based training (including deep learning), the equivalent update form is weight decay. Example: Suppose two features are nearly duplicates. Ordinary least squares can swing coefficients to large, opposite values to fit noise. Ridge balances them by shrinking both moderately, producing a more reliable predictor with smaller coefficients and better test performance.

02Intuition & Analogies

Hook: Think of your model’s weights like knobs on a stereo system. If you let every knob crank up freely, you might amplify both music and static. Adding L2 regularization is like attaching each knob to a soft spring pulling it back toward zero: you still get volume where needed, but the system resists extremes. Concept: Squaring the weights makes the penalty grow faster for large values, so very large weights are strongly discouraged, while small weights are only lightly touched. Unlike L1 (which is like a budget that can force exact zeros), L2 acts smoothly and continuously, spreading the shrinkage across all weights. Example: Picture fitting a line to noisy data. Without regularization, the slope might tilt sharply to chase noise points. With L2, the slope and intercept are tugged back a bit (usually leaving the intercept unpenalized), giving a line that may fit slightly worse on training data but will be more robust on new data. Another analogy: friction in a physical system—L2 adds a uniform drag that prevents parameters from accelerating to large magnitudes; in optimization steps, this appears as weight decay, gradually fading weights unless there is consistent gradient signal to sustain them.

03Formal Definition

Hook: We tame models by penalizing complexity directly in the objective. Concept: Given parameters w ∈ ℝ^d and a data loss Ld​ata(w) (e.g., squared error, logistic loss), L2 regularization augments the objective with a quadratic penalty on w. The standard form is J(w) = Ld​ata(w) + (λ/2)∥w∥22​, where ∥w∥_2^2 = ∑i=1d​ wi​^2 and λ ≥ 0 controls the strength. The factor 21​ is conventional and simplifies gradients. Often, the bias term b is excluded from the penalty. In linear regression with X ∈ Rn×d and labels y ∈ ℝ^n, ridge minimizes 2n1​∥Xw − y∥_2^2 + (λ/2)∥w∥22​ and has the normal equations (XT X + nλ I)w=XT y (the n factor depends on your scaling convention). Its closed-form solution is w = (XT X + λI)−1XT y (assuming compatible scaling). The gradient of the full objective is ∇J(w) = ∇Ld​ata(w) + λw, so first-order methods implement weight decay naturally. Example: In logistic regression, Ld​ata is the average cross-entropy, and the same λ∥w∥22​/2 term is added, yielding smoother decision boundaries and better calibration.

04When to Use

Hook: If your model performs great on training data but degrades on validation, ask whether you need a gentle constraint. Concept: Use L2 regularization when you want to reduce variance without inducing sparsity, especially with correlated features or high-dimensional inputs. It is also effective when X^T X is ill-conditioned: adding λI improves conditioning and numerical stability. Use cases: (1) Ridge regression for tabular data with multicollinearity, (2) Text or polynomial features where many small coefficients are preferable to a few large ones, (3) Deep learning as weight decay to curb exploding weights and improve generalization, (4) Online learning with SGD where L2 provides continuous shrinkage during updates. Example: In predicting house prices with many overlapping features (square footage, total rooms, bedrooms), ridge keeps coefficients modest and reduces sensitivity to noise; in a neural network, adding weight decay to linear layers prevents weight magnitudes from drifting upward over long training.

⚠️Common Mistakes

Hook: L2 is simple to add but surprisingly easy to misuse. Concept: Common pitfalls include (1) not standardizing features—L2 penalizes coefficients, so unscaled features skew shrinkage; (2) regularizing the bias—penalizing the intercept often hurts performance; (3) picking λ by guesswork—use cross-validation; (4) mixing conventions—forgetting whether λ couples with a 1/n factor in your loss, leading to mismatched strengths; (5) confusing λw vs. 2λw gradients—using λ∥w∥^2/2 gives gradient λw; (6) assuming L2 gives sparsity—it rarely sets weights exactly to zero; (7) with adaptive optimizers, equating L2 with weight decay—AdamW uses decoupled weight decay, which behaves better than naive L2. Example: A practitioner feeds raw dollar amounts and counts into a ridge model without scaling; the large-scale feature gets over-penalized into near-zero influence while small-scale features dominate. Another example: enabling regularization on the bias collapses predictions toward zero mean, degrading fit.

Key Formulas

Regularized Objective

J(w)=Ldata​(w)+2λ​∥w∥22​

Explanation: Total objective equals the data loss plus half lambda times the squared L2 norm of weights. The factor 21​ makes the gradient of the penalty simply λw.

L2 Norm

∥w∥2​=i=1∑d​wi2​​,∥w∥22​=i=1∑d​wi2​

Explanation: The L2 norm measures the Euclidean length of the vector. The squared L2 norm is used in L2 regularization to penalize large coefficients.

Gradient with L2

∇J(w)=∇Ldata​(w)+λw

Explanation: The gradient of the regularized objective is the sum of the data-loss gradient and λ times the weight vector. This is what drives weight decay in first-order methods.

Weight Decay Update

wt+1​=(1−ηλ)wt​−η∇Ldata​(wt​)

Explanation: In SGD with step size η, L2 adds a multiplicative shrinkage (1−ηλ) to weights each step, plus the usual data-gradient step. This is the computational view of L2 in iterative training.

Ridge Objective (Scaled)

wmin​2n1​∥Xw−y∥22​+2λ​∥w∥22​

Explanation: Ridge regression minimizes mean squared error with an L2 penalty. Different libraries absorb factors of n into λ; be consistent when comparing values.

Ridge Normal Equations

(XTX+λI)w=XTy,w=(XTX+λI)−1XTy

Explanation: Adding λI to XT X yields a well-conditioned linear system. Solving it gives the ridge coefficients in closed form (up to scaling conventions).

Hat Matrix and Degrees of Freedom

Sλ​=X(XTX+λI)−1XT,df(λ)=trace(Sλ​)

Explanation: The hat matrix maps the targets to fitted values. Its trace defines the effective degrees of freedom, which decrease as λ increases.

Logistic Regression with L2

wmin​n1​i=1∑n​log(1+e−yi​wTxi​)+2λ​∥w∥22​

Explanation: For binary labels yi​ ∈ {−1, +1}, the logistic loss plus an L2 penalty yields better generalization and smoother decision boundaries.

Gaussian Prior Equivalence

w∼N(0,σw2​I)⇒−logp(w)=2σw2​1​∥w∥22​+C

Explanation: Placing a zero-mean isotropic Gaussian prior on weights leads to an L2 penalty in the MAP objective, with λ = 1/σw2​ (up to scaling of the data term).

Tikhonov Regularization

w=(XTX+λΓTΓ)−1XTy

Explanation: Generalizes ridge by penalizing \∣Γw∥_2^2, allowing you to shape which directions in parameter space are shrunk more.

Complexity Analysis

There are two common computational regimes for L2 regularization. For ridge regression with dense data, a closed-form solve is efficient when the number of features d is modest. Computing XT X takes O(nd2) time and O(d2) space, and XT y costs O(nd). Solving the d×d linear system (XT X + λI)w=XT y via Cholesky decomposition requires O(d3) time and O(d2) space. The total complexity is O(nd2 + d3) time and O(nd + d2) space if you store X and the Gram matrix; this is practical for d up to a few thousands on a single machine. The resulting system is symmetric positive definite when λ > 0, which guarantees a stable Cholesky factorization. For very high-dimensional or sparse problems, iterative methods are preferable. Using (stochastic) gradient descent, each pass over n examples has O(nd) time if you compute gradients exactly, or O(bd) per minibatch of size b, and O(d) space for parameters. With L2, the per-step update adds only a simple weight decay factor, keeping asymptotic costs unchanged. Convergence rates depend on conditioning: adding λ improves the strong convexity of the objective, often yielding faster and more stable convergence. For sparse X, using sparse matrix multiplications reduces costs proportionally to the number of nonzeros nnz, e.g., O(nnz + d3) for forming and solving normal equations in small d, or O(nnz) per epoch for SGD. In neural networks, L2 adds negligible overhead—just a vector scaling—while providing regularization uniformly across layers (typically excluding biases and normalization parameters).

Code Examples

Ridge Regression (Closed-Form) via Cholesky with Optional Unpenalized Bias
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute X^T X (d x d) and X^T y (d)
5static void computeXtX_Xty(const vector<vector<double>>& X, const vector<double>& y,
6 vector<vector<double>>& XtX, vector<double>& Xty) {
7 size_t n = X.size();
8 size_t d = X[0].size();
9 XtX.assign(d, vector<double>(d, 0.0));
10 Xty.assign(d, 0.0);
11 for (size_t i = 0; i < n; ++i) {
12 const vector<double>& xi = X[i];
13 double yi = y[i];
14 for (size_t a = 0; a < d; ++a) {
15 Xty[a] += xi[a] * yi;
16 double xia = xi[a];
17 for (size_t b = 0; b < d; ++b) {
18 XtX[a][b] += xia * xi[b];
19 }
20 }
21 }
22}
23
24// Cholesky decomposition for SPD matrix A: A = L L^T (lower triangular L)
25static bool choleskyDecompose(const vector<vector<double>>& A, vector<vector<double>>& L) {
26 size_t n = A.size();
27 L.assign(n, vector<double>(n, 0.0));
28 for (size_t i = 0; i < n; ++i) {
29 for (size_t j = 0; j <= i; ++j) {
30 double sum = A[i][j];
31 for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
32 if (i == j) {
33 if (sum <= 0.0) return false; // not SPD
34 L[i][i] = sqrt(max(sum, 0.0));
35 } else {
36 L[i][j] = sum / L[j][j];
37 }
38 }
39 }
40 return true;
41}
42
43// Solve A x = b given Cholesky L such that A = L L^T
44static vector<double> choleskySolve(const vector<vector<double>>& L, const vector<double>& b) {
45 size_t n = L.size();
46 vector<double> y(n, 0.0), x(n, 0.0);
47 // Forward solve: L y = b
48 for (size_t i = 0; i < n; ++i) {
49 double sum = b[i];
50 for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
51 y[i] = sum / L[i][i];
52 }
53 // Backward solve: L^T x = y
54 for (int i = (int)n - 1; i >= 0; --i) {
55 double sum = y[i];
56 for (size_t k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
57 x[i] = sum / L[i][i];
58 }
59 return x;
60}
61
62// Fit ridge regression: minimize 1/2 ||X w - y||^2 + (lambda/2) ||w||^2
63// Optionally exclude an index (bias_index) from regularization by not adding lambda to its diagonal.
64static vector<double> ridge_fit(const vector<vector<double>>& X, const vector<double>& y,
65 double lambda, int bias_index = -1) {
66 vector<vector<double>> XtX; vector<double> Xty;
67 computeXtX_Xty(X, y, XtX, Xty);
68
69 size_t d = XtX.size();
70 for (size_t i = 0; i < d; ++i) {
71 if ((int)i == bias_index) continue; // do not penalize bias
72 XtX[i][i] += lambda;
73 }
74
75 vector<vector<double>> L;
76 if (!choleskyDecompose(XtX, L)) {
77 throw runtime_error("Matrix not SPD; increase lambda or check data.");
78 }
79 return choleskySolve(L, Xty);
80}
81
82// Predict y = X w
83static vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
84 vector<double> yhat(X.size(), 0.0);
85 for (size_t i = 0; i < X.size(); ++i) {
86 double s = 0.0;
87 for (size_t j = 0; j < w.size(); ++j) s += X[i][j] * w[j];
88 yhat[i] = s;
89 }
90 return yhat;
91}
92
93int main() {
94 ios::sync_with_stdio(false);
95 cin.tie(nullptr);
96
97 // Create a toy dataset with a bias column (first column of ones)
98 // True model: y = 3 + 2*x1 - x2 + noise
99 int n = 200, d = 3; // [1, x1, x2]
100 vector<vector<double>> X(n, vector<double>(d, 1.0));
101 vector<double> y(n);
102 mt19937 rng(42);
103 normal_distribution<double> noise(0.0, 1.0);
104 uniform_real_distribution<double> unif(-3.0, 3.0);
105 for (int i = 0; i < n; ++i) {
106 double x1 = unif(rng);
107 double x2 = unif(rng) + 0.5 * x1; // introduce correlation
108 X[i][1] = x1;
109 X[i][2] = x2;
110 y[i] = 3.0 + 2.0 * x1 - 1.0 * x2 + noise(rng);
111 }
112
113 double lambda = 10.0; // regularization strength
114 // bias is column 0; do not penalize it
115 vector<double> w = ridge_fit(X, y, lambda, /*bias_index=*/0);
116
117 // Report weights
118 cout << fixed << setprecision(4);
119 cout << "Ridge weights (bias, w1, w2): ";
120 for (double wi : w) cout << wi << ' ';
121 cout << "\n";
122
123 // Evaluate training RMSE
124 vector<double> yhat = predict(X, w);
125 double se = 0.0;
126 for (int i = 0; i < n; ++i) se += (yhat[i] - y[i]) * (yhat[i] - y[i]);
127 cout << "Train RMSE: " << sqrt(se / n) << "\n";
128}
129

This program fits ridge regression using the closed-form normal equations with a Cholesky factorization. We build X^T X and X^T y, add λ to the diagonal (skipping the bias index to avoid penalizing the intercept), and solve (X^T X + λI)w = X^T y. The toy data introduces correlated features to highlight ridge’s stabilizing effect.

Time: O(nd^2 + d^3)Space: O(nd + d^2)
Binary Logistic Regression with L2 via SGD (Weight Decay)
1#include <bits/stdc++.h>
2using namespace std;
3
4static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5
6// Train logistic regression with L2 using SGD/minibatch weight decay
7struct LogisticL2 {
8 vector<double> w; // weights for features (excluding bias)
9 double b = 0.0; // bias (unpenalized)
10
11 // Initialize weights to zeros
12 explicit LogisticL2(size_t d) : w(d, 0.0), b(0.0) {}
13
14 // One training epoch over data (minibatch SGD)
15 void train_epoch(const vector<vector<double>>& X, const vector<int>& y,
16 double lr, double lambda, size_t batch_size = 32, mt19937* rng = nullptr) {
17 size_t n = X.size();
18 vector<size_t> idx(n);
19 iota(idx.begin(), idx.end(), 0);
20 if (rng) shuffle(idx.begin(), idx.end(), *rng);
21
22 for (size_t start = 0; start < n; start += batch_size) {
23 size_t end = min(n, start + batch_size);
24 // Accumulate gradients on minibatch
25 vector<double> gw(w.size(), 0.0);
26 double gb = 0.0;
27 for (size_t ii = start; ii < end; ++ii) {
28 size_t i = idx[ii];
29 // Model prediction: p = sigma(w^T x + b)
30 double z = b;
31 for (size_t j = 0; j < w.size(); ++j) z += w[j] * X[i][j];
32 double p = sigmoid(z);
33 // Labels y are in {0,1}; gradient of average cross-entropy: (p - y)
34 double diff = p - static_cast<double>(y[i]);
35 for (size_t j = 0; j < w.size(); ++j) gw[j] += diff * X[i][j];
36 gb += diff;
37 }
38 double m = static_cast<double>(end - start);
39 for (size_t j = 0; j < w.size(); ++j) gw[j] /= m;
40 gb /= m;
41
42 // Decoupled weight decay (equivalent to L2 for SGD): shrink weights, not bias
43 for (size_t j = 0; j < w.size(); ++j) w[j] *= (1.0 - lr * lambda);
44
45 // Gradient step
46 for (size_t j = 0; j < w.size(); ++j) w[j] -= lr * gw[j];
47 b -= lr * gb; // do not decay bias
48 }
49 }
50
51 // Predict probability p(y=1|x)
52 double predict_proba(const vector<double>& x) const {
53 double z = b;
54 for (size_t j = 0; j < w.size(); ++j) z += w[j] * x[j];
55 return sigmoid(z);
56 }
57
58 int predict_label(const vector<double>& x, double thresh = 0.5) const {
59 return predict_proba(x) >= thresh ? 1 : 0;
60 }
61};
62
63int main() {
64 ios::sync_with_stdio(false);
65 cin.tie(nullptr);
66
67 // Generate a toy binary classification dataset in 2D
68 int n = 1000; int d = 2;
69 vector<vector<double>> X(n, vector<double>(d));
70 vector<int> y(n);
71
72 mt19937 rng(123);
73 normal_distribution<double> ga(0.0, 1.0), gb(0.0, 1.0), noise(0.0, 0.5);
74 // Two clusters separated roughly by a line
75 for (int i = 0; i < n; ++i) {
76 if (i < n/2) {
77 X[i][0] = ga(rng) - 2.0; X[i][1] = ga(rng);
78 y[i] = 0;
79 } else {
80 X[i][0] = gb(rng) + 2.0; X[i][1] = gb(rng);
81 y[i] = 1;
82 }
83 // Add some noise to make it non-trivial
84 X[i][0] += noise(rng); X[i][1] += noise(rng);
85 }
86
87 // Standardize features for fair L2 penalization
88 for (int j = 0; j < d; ++j) {
89 double mean = 0.0; for (int i = 0; i < n; ++i) mean += X[i][j]; mean /= n;
90 double var = 0.0; for (int i = 0; i < n; ++i) { double t = X[i][j]-mean; var += t*t; }
91 double stdv = sqrt(var / n + 1e-12);
92 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean) / stdv;
93 }
94
95 LogisticL2 clf(d);
96 double lr = 0.1, lambda = 0.01;
97 for (int epoch = 0; epoch < 30; ++epoch) {
98 clf.train_epoch(X, y, lr, lambda, 64, &rng);
99 }
100
101 // Evaluate accuracy
102 int correct = 0;
103 for (int i = 0; i < n; ++i) correct += (clf.predict_label(X[i]) == y[i]);
104 cout << fixed << setprecision(4);
105 cout << "Train accuracy: " << (100.0 * correct / n) << "%\n";
106 cout << "Weights: "; for (double wi : clf.w) cout << wi << ' '; cout << "| bias: " << clf.b << "\n";
107}
108

This example trains a logistic regression classifier using minibatch SGD with L2 regularization implemented as weight decay. We standardize features, shrink weights each step by (1−ηλ), then apply the gradient of the cross-entropy loss. The bias term is not penalized. The code demonstrates how L2 appears as a simple multiplicative decay in iterative optimization.

Time: O(E * (n * d)) for E epochs (or O(E * (n * d / batch_size)) per update)Space: O(n d + d)
#l2 regularization#ridge regression#weight decay#tikhonov regularization#gaussian prior#bias variance tradeoff#cholesky decomposition#logistic regression#cross validation#feature scaling#overfitting#multicollinearity#normal equations#sgd#map estimation