L1 Regularization (Lasso)
Key Points
- •L1 regularization (Lasso) adds a penalty \( \) to the loss, which pushes many coefficients exactly to zero and performs feature selection.
- •The key computational tool behind Lasso is the soft-thresholding (shrinkage) operator, which zeroes small coefficients and reduces large ones by a constant amount.
- •For squared-error regression, efficient solvers include coordinate descent and proximal gradient (ISTA), both of which have per-iteration cost roughly O(n p) for dense data.
- •Standardizing features and not penalizing the intercept are crucial practical steps to make Lasso behave well and to choose \(\) meaningfully.
- •L1 is convex but not differentiable at zero; we use subgradients or proximal operators to optimize it reliably.
- •Lasso works best when you expect only a small subset of features to be relevant, especially in high-dimensional settings.
- •Choosing \(\) is typically done via cross-validation or information criteria; larger \(\) yields sparser models but more bias.
- •If features are highly correlated, Lasso may arbitrarily pick one; elastic net (L1+L2) is more stable in such cases.
Prerequisites
- →Linear algebra (vectors, matrices, norms) — Understanding matrix–vector products, norms, and eigenvalues is essential for expressing the Lasso objective and gradients.
- →Convex optimization basics — Lasso is a convex but non-smooth problem; subgradients, proximal operators, and KKT conditions underpin the algorithms.
- →Calculus and gradients — Computing and interpreting gradients of the squared loss is necessary for ISTA and analysis.
- →Probability and statistics — Noise models, mean squared error, and cross-validation are important for modeling and hyperparameter selection.
- →C++ programming (vectors, loops, numerics) — Implementing iterative solvers efficiently and correctly requires familiarity with arrays, loops, and numeric stability.
- →Data preprocessing (centering, scaling) — Feature standardization strongly affects L1 behavior and the practical choice of \(\lambda\).
Detailed Explanation
Tap terms for definitions01Overview
L1 regularization, widely known as Lasso (Least Absolute Shrinkage and Selection Operator), augments a predictive model’s loss function with the sum of absolute values of the weights. For linear regression with parameters w and intercept b, Lasso solves a convex optimization problem that trades off goodness of fit with model simplicity. The penalty term (\lambda \lVert w \rVert_1 = \lambda \sum |w_i|) encourages sparsity: many coefficients become exactly zero. This property makes Lasso a powerful, built-in feature selection method, particularly useful when the number of features p is large compared to the number of samples n, or when interpretability matters.
Unlike L2 (ridge) regularization, which spreads shrinkage smoothly across all coefficients, L1 has a kink at zero, creating non-differentiability and enabling coefficients to be set exactly to zero. Optimization leverages subgradient methods or proximal operators, most notably the soft-thresholding function. Practical algorithms include coordinate descent, which updates one coefficient at a time via closed-form soft-thresholding, and proximal gradient (ISTA), which takes gradient steps on the smooth part followed by soft-thresholding. Careful preprocessing—centering/standardizing features and leaving the intercept unpenalized—makes solutions numerically stable and makes (\lambda) meaningfully comparable across problems.
02Intuition & Analogies
Imagine you’re packing a suitcase with a strict baggage fee that charges a fixed amount for every distinct item, regardless of its weight. If two small items add little value but each incurs a fee, you’ll drop many of them and keep only the most useful ones. L1 regularization acts similarly on model coefficients: it adds a fixed “tax” proportional to the absolute size of each coefficient. If a coefficient’s benefit to prediction is small, paying its tax is not worth it, and the optimizer sets it to zero—leaving only a handful of valuable features in the model.
Another analogy is cleaning background noise from an audio track using a gate. The gate mutes any signal below a threshold, but lets louder signals through (with a slight volume reduction). Soft-thresholding, the proximal operator of L1, does exactly this to coefficients: numbers with magnitude below a threshold become zero; larger ones are reduced by a constant amount. This is different from L2 (ridge), which is more like turning down the overall volume smoothly—everything gets a bit quieter but nothing is fully muted.
From a geometric point of view, the L1 penalty forms a diamond-shaped constraint region (an (\ell_1) ball). When this region touches the elliptical contours of the loss, the corners of the diamond often line up with axes, making it likely that some coordinates are exactly zero. That geometric “corner effect” explains why L1 favors sparse solutions while remaining convex and computationally tractable.
03Formal Definition
04When to Use
Use Lasso when you expect sparsity—only a small subset of features truly matters—or when interpretability and automatic feature selection are important. It is effective in high-dimensional settings (p comparable to or greater than n), such as genomics, text with bag-of-words features, or sensor selection. Lasso can reduce variance and improve generalization when unregularized models overfit.
Prefer Lasso over ridge when your goal is to zero out irrelevant features; prefer ridge when all features contribute a bit and multicollinearity is high. If features are strongly correlated and you want grouped behavior, consider elastic net (a mixture of L1 and L2) to avoid Lasso’s tendency to select a single representative feature from a correlated cluster. Lasso also integrates well with pipelines that require model compression or fast inference, since many weights become zero and can be pruned.
Choose (\lambda) by cross-validation or criteria like AIC/BIC; larger (\lambda) increases sparsity but also bias. Always standardize/center features and avoid penalizing the intercept. For non-squared losses (e.g., logistic), L1 still works via proximal gradient; coordinate descent remains effective when individual coordinate updates have closed forms or cheap 1D solves.
⚠️Common Mistakes
- Not standardizing features: Without centering and scaling, features with large units dominate the penalty, leading to arbitrary selections. Always center columns and scale to unit variance (or equal (\ell_2) norm).
- Penalizing the intercept: The bias term should typically be unpenalized; penalizing it can shift predictions and degrade fit.
- Mis-tuning (\lambda): Too small gives overfitting with little sparsity; too large collapses the model to zeros. Use cross-validation and, if necessary, warm starts along a (\lambda)-path.
- Using plain gradient descent: L1 is non-differentiable; use subgradient, coordinate descent, or proximal gradient (ISTA/FISTA). Forgetting the proximal step results in incorrect updates.
- Ignoring correlated features: Lasso may pick one feature from a correlated group unpredictably. If stability is desired, use elastic net or preprocessing (e.g., PCA or grouping).
- Early stopping or loose tolerances: Insufficient convergence checks yield biased estimates. Track objective decrease or parameter change norms.
- Numerical drift in coordinate descent: If you don’t maintain and update the residual efficiently, recomputing from scratch each step is slow and can accumulate floating-point error. Keep a residual vector and update it incrementally.
- Comparing coefficients across unscaled problems: The magnitude of (\lambda) is not directly comparable across datasets unless features are standardized and the loss is normalized by n.
Key Formulas
Lasso objective (squared loss)
Explanation: We minimize prediction error plus an L1 penalty on coefficients. The parameter \(\) controls the trade-off between fit and sparsity.
L1 norm
Explanation: The L1 norm sums the absolute values of all coefficients. It encourages many coefficients to be exactly zero when used as a penalty.
Soft-thresholding (prox of L1)
Explanation: This operator shrinks each component toward zero by \(\), setting it to zero if its magnitude is below \(\). It is the key building block of L1 optimization.
Subgradient of absolute value
Explanation: Because |w| is not differentiable at 0, we use its subgradient. Any value between -1 and 1 is valid at zero; elsewhere, the subgradient is the sign.
Stationarity (KKT) for Lasso
Explanation: At the optimum, the negative gradient of the loss must be balanced by a subgradient of the L1 norm scaled by \(\). This characterizes which coordinates are zero or nonzero.
Coordinate descent update
Explanation: Holding other coordinates fixed, the one-dimensional subproblem has a closed-form solution via soft-thresholding. The denominator rescales by the column’s squared norm.
ISTA update (no intercept shown)
Explanation: Each iteration takes a gradient step on the smooth squared loss and then applies soft-thresholding. The step size t must be at most the inverse Lipschitz constant of the gradient.
Lipschitz constant for squared loss
Explanation: The gradient of the squared loss has Lipschitz constant equal to the largest eigenvalue of X. Using guarantees convergence of ISTA.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Soft-thresholding operator S_tau(z) = sign(z) * max(|z|-tau, 0) 5 double soft_threshold(double z, double tau) { 6 if (z > tau) return z - tau; 7 if (z < -tau) return z + tau; 8 return 0.0; 9 } 10 11 int main() { 12 ios::sync_with_stdio(false); 13 cin.tie(nullptr); 14 15 // Synthetic dense dataset: n samples, p features 16 int n = 200, p = 50, k_true = 5; // k_true nonzero weights in ground truth 17 unsigned seed = 42; 18 mt19937 rng(seed); 19 normal_distribution<double> nd(0.0, 1.0); 20 21 vector<vector<double>> X(n, vector<double>(p)); 22 vector<double> w_true(p, 0.0), y(n, 0.0); 23 24 // Create sparse true weights 25 for (int i = 0; i < k_true; ++i) w_true[i] = (i % 2 ? -1.5 : 2.0); 26 27 // Generate features ~ N(0,1) 28 for (int i = 0; i < n; ++i) { 29 for (int j = 0; j < p; ++j) X[i][j] = nd(rng); 30 } 31 32 // Standardize each column: zero mean, unit variance 33 vector<double> meanX(p, 0.0), stdX(p, 0.0); 34 for (int j = 0; j < p; ++j) { 35 for (int i = 0; i < n; ++i) meanX[j] += X[i][j]; 36 meanX[j] /= n; 37 for (int i = 0; i < n; ++i) stdX[j] += (X[i][j] - meanX[j]) * (X[i][j] - meanX[j]); 38 stdX[j] = sqrt(stdX[j] / max(1, n - 1)); 39 if (stdX[j] == 0) stdX[j] = 1; // guard against zero variance 40 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - meanX[j]) / stdX[j]; 41 } 42 43 // Build noisy response: y = X w_true + b_true + noise 44 double b_true = 0.5; 45 normal_distribution<double> noise(0.0, 0.5); 46 for (int i = 0; i < n; ++i) { 47 double yi = b_true; 48 for (int j = 0; j < p; ++j) yi += X[i][j] * w_true[j]; 49 y[i] = yi + noise(rng); 50 } 51 52 // Lasso hyperparameters 53 double lambda = 0.2; // regularization strength 54 int max_sweeps = 200; // max coordinate descent sweeps 55 double tol = 1e-6; // stopping tolerance on max |delta w| 56 57 // Initialize weights and intercept (unpenalized) 58 vector<double> w(p, 0.0); 59 double b = 0.0; 60 61 // Residual r = y - (X w + b) 62 vector<double> r = y; // since w=0, b=0 initially 63 64 // Precompute column squared norms ||X_j||^2 65 vector<double> colNorm2(p, 0.0); 66 for (int j = 0; j < p; ++j) { 67 for (int i = 0; i < n; ++i) colNorm2[j] += X[i][j] * X[i][j]; 68 if (colNorm2[j] == 0) colNorm2[j] = 1; // guard 69 } 70 71 for (int sweep = 0; sweep < max_sweeps; ++sweep) { 72 double max_change = 0.0; 73 74 // Update intercept (unpenalized): b <- b + mean(r), r <- r - mean(r) 75 double mean_r = accumulate(r.begin(), r.end(), 0.0) / n; 76 b += mean_r; 77 for (int i = 0; i < n; ++i) r[i] -= mean_r; 78 79 // Update each coordinate with soft-thresholding 80 for (int j = 0; j < p; ++j) { 81 // rho_j = X_j^T r + ||X_j||^2 * w_j (partial residual trick) 82 double rho = 0.0; 83 for (int i = 0; i < n; ++i) rho += X[i][j] * r[i]; 84 rho += colNorm2[j] * w[j]; 85 86 double w_new = soft_threshold(rho, lambda) / colNorm2[j]; 87 double delta = w_new - w[j]; 88 if (delta != 0.0) { 89 // Update residual: r <- r - X_j * delta 90 for (int i = 0; i < n; ++i) r[i] -= X[i][j] * delta; 91 w[j] = w_new; 92 max_change = max(max_change, fabs(delta)); 93 } 94 } 95 96 if (max_change < tol) { 97 // Converged 98 break; 99 } 100 } 101 102 // Evaluate training MSE and sparsity 103 double mse = 0.0; int nnz = 0; 104 for (int i = 0; i < n; ++i) { 105 double pred = b; 106 for (int j = 0; j < p; ++j) pred += X[i][j] * w[j]; 107 double err = pred - y[i]; 108 mse += err * err; 109 } 110 mse /= n; 111 for (int j = 0; j < p; ++j) if (fabs(w[j]) > 1e-8) ++nnz; 112 113 cout << fixed << setprecision(6); 114 cout << "Coordinate Descent Lasso results\n"; 115 cout << "Intercept b: " << b << "\n"; 116 cout << "Nonzeros in w: " << nnz << " / " << p << "\n"; 117 cout << "Training MSE: " << mse << "\n"; 118 double l1 = 0.0; for (double wi : w) l1 += fabs(wi); 119 cout << "L1 norm of w: " << l1 << "\n"; 120 121 return 0; 122 } 123
This program generates a synthetic regression dataset, standardizes features, and solves the Lasso using coordinate descent. It maintains the residual vector to make each coordinate update O(n), uses the closed-form soft-thresholding update for each weight, and separately updates the intercept without penalty. The loop stops when parameter changes are sufficiently small. Finally, it reports sparsity, intercept, mean squared error, and the L1 norm of the solution.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Soft-thresholding applied elementwise to a vector 5 void soft_threshold_vec(vector<double>& v, double tau) { 6 for (double &z : v) { 7 if (z > tau) z -= tau; 8 else if (z < -tau) z += tau; 9 else z = 0.0; 10 } 11 } 12 13 // Multiply y = X * w (X: n x p, w: p) 14 void matvec_Xw(const vector<vector<double>>& X, const vector<double>& w, vector<double>& y) { 15 int n = (int)X.size(); 16 int p = (int)X[0].size(); 17 y.assign(n, 0.0); 18 for (int i = 0; i < n; ++i) { 19 double s = 0.0; 20 for (int j = 0; j < p; ++j) s += X[i][j] * w[j]; 21 y[i] = s; 22 } 23 } 24 25 // Multiply g = (1/n) X^T * e 26 void grad_XT_e_over_n(const vector<vector<double>>& X, const vector<double>& e, vector<double>& g) { 27 int n = (int)X.size(); 28 int p = (int)X[0].size(); 29 g.assign(p, 0.0); 30 for (int j = 0; j < p; ++j) { 31 double s = 0.0; 32 for (int i = 0; i < n; ++i) s += X[i][j] * e[i]; 33 g[j] = s / n; 34 } 35 } 36 37 // Power iteration to estimate largest eigenvalue of (1/n) X^T X 38 double estimate_L(const vector<vector<double>>& X, int iters = 50) { 39 int n = (int)X.size(); 40 int p = (int)X[0].size(); 41 vector<double> v(p, 0.0), Xv(n, 0.0), w(p, 0.0); 42 // random init 43 mt19937 rng(123); 44 normal_distribution<double> nd(0.0, 1.0); 45 for (int j = 0; j < p; ++j) v[j] = nd(rng); 46 // Normalize v 47 auto norm = [&](const vector<double>& a){ double s=0; for(double x:a) s+=x*x; return sqrt(s); }; 48 double nv = norm(v); if (nv==0) nv=1; for (double &x: v) x/=nv; 49 50 double lambda = 0.0; 51 for (int t = 0; t < iters; ++t) { 52 // w = (1/n) X^T (X v) 53 matvec_Xw(X, v, Xv); 54 vector<double> e = Xv; // reuse as temp 55 grad_XT_e_over_n(X, e, w); 56 // Rayleigh quotient approximation 57 lambda = 0.0; for (int j = 0; j < p; ++j) lambda += v[j] * w[j]; 58 // v <- w / ||w|| 59 nv = norm(w); if (nv == 0) break; 60 for (int j = 0; j < p; ++j) v[j] = w[j] / nv; 61 } 62 // lambda approximates largest eigenvalue of (1/n) X^T X 63 return max(lambda, 1e-12); // guard against zero 64 } 65 66 int main(){ 67 ios::sync_with_stdio(false); 68 cin.tie(nullptr); 69 70 // Synthetic data 71 int n = 300, p = 80, k_true = 7; 72 mt19937 rng(7); 73 normal_distribution<double> nd(0.0, 1.0); 74 vector<vector<double>> X(n, vector<double>(p)); 75 vector<double> y(n), w_true(p, 0.0); 76 77 for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng); 78 // Standardize columns 79 for (int j = 0; j < p; ++j) { 80 double mu=0, ss=0; for (int i = 0; i < n; ++i) mu += X[i][j]; mu/=n; 81 for (int i = 0; i < n; ++i) { X[i][j] -= mu; ss += X[i][j]*X[i][j]; } 82 double st = sqrt(ss / max(1, n-1)); if (st==0) st=1; 83 for (int i = 0; i < n; ++i) X[i][j] /= st; 84 } 85 // True sparse weights and noisy target 86 for (int j = 0; j < k_true; ++j) w_true[j] = (j%2? -2.0 : 2.5); 87 double b_true = -0.3; 88 normal_distribution<double> noise(0.0, 0.6); 89 for (int i = 0; i < n; ++i) { 90 double s = b_true; 91 for (int j = 0; j < p; ++j) s += X[i][j] * w_true[j]; 92 y[i] = s + noise(rng); 93 } 94 95 // ISTA hyperparameters 96 double lambda = 0.25; 97 int max_iters = 300; 98 double tol = 1e-6; 99 100 // Estimate Lipschitz constant L for f(w) = (1/2n)||Xw - y||^2 101 double L = estimate_L(X, 40); 102 double t = 1.0 / L; // safe fixed step size 103 104 vector<double> w(p, 0.0); 105 double b = 0.0; 106 107 vector<double> Xw(n, 0.0), e(n, 0.0), g(p, 0.0), w_new(p, 0.0); 108 109 auto l1_norm = [&](const vector<double>& a){ double s=0; for(double z:a) s += fabs(z); return s; }; 110 111 double prev_obj = numeric_limits<double>::infinity(); 112 for (int it = 0; it < max_iters; ++it) { 113 // Residual e = Xw + b - y 114 matvec_Xw(X, w, Xw); 115 for (int i = 0; i < n; ++i) e[i] = Xw[i] + b - y[i]; 116 // Gradient g = (1/n) X^T e 117 grad_XT_e_over_n(X, e, g); 118 // Gradient step then prox (soft-thresholding) 119 w_new = w; 120 for (int j = 0; j < p; ++j) w_new[j] = w[j] - t * g[j]; 121 soft_threshold_vec(w_new, t * lambda); 122 // Intercept update (unpenalized): b <- b - t * mean(e) 123 double mean_e = accumulate(e.begin(), e.end(), 0.0) / n; 124 double b_new = b - t * mean_e; 125 126 // Check convergence via parameter change 127 double diff = 0.0, normw = 0.0; 128 for (int j = 0; j < p; ++j) { double d = w_new[j] - w[j]; diff += d*d; normw += w[j]*w[j]; } 129 diff = sqrt(diff); 130 double denom = max(1e-12, sqrt(normw)); 131 if (diff/denom < tol && fabs(b_new - b) < 1e-8) { w = w_new; b = b_new; break; } 132 133 // Optional: objective monitoring 134 matvec_Xw(X, w_new, Xw); 135 double mse = 0.0; for (int i = 0; i < n; ++i) { double r = Xw[i] + b_new - y[i]; mse += r*r; } 136 double obj = 0.5 * mse / n + lambda * l1_norm(w_new); 137 if (prev_obj - obj < 1e-10) { 138 // tiny improvement; could break or continue a bit more 139 } 140 prev_obj = obj; 141 142 // Commit updates 143 w.swap(w_new); 144 b = b_new; 145 } 146 147 // Report results 148 int nnz = 0; for (double z : w) if (fabs(z) > 1e-8) ++nnz; 149 double mse = 0.0; matvec_Xw(X, w, Xw); 150 for (int i = 0; i < n; ++i) { double r = Xw[i] + b - y[i]; mse += r*r; } 151 mse /= n; 152 153 cout << fixed << setprecision(6); 154 cout << "ISTA Lasso results\n"; 155 cout << "Step size t: " << t << ", estimated L: " << L << "\n"; 156 cout << "Intercept b: " << b << "\n"; 157 cout << "Nonzeros in w: " << nnz << " / " << p << "\n"; 158 cout << "Training MSE: " << mse << "\n"; 159 cout << "L1 norm of w: " << l1_norm(w) << "\n"; 160 161 return 0; 162 } 163
This implementation solves the Lasso with proximal gradient descent (ISTA). It estimates a safe step size using power iteration to approximate the Lipschitz constant L of the squared-loss gradient, applies a gradient step, and then performs soft-thresholding. The intercept is updated via a standard gradient step and is not penalized. The algorithm stops when the parameter change is small or the objective stops improving.