∑MathIntermediate

L1 Regularization (Lasso)

Key Points

•
L1 regularization (Lasso) adds a penalty $ $λ$ $\sum_{i = 1}^{p}$ $∣ w_{i} ∣$ $ to the loss, which pushes many coefficients exactly to zero and performs feature selection.
•
The key computational tool behind Lasso is the soft-thresholding (shrinkage) operator, which zeroes small coefficients and reduces large ones by a constant amount.
•
For squared-error regression, efficient solvers include coordinate descent and proximal gradient (ISTA), both of which have per-iteration cost roughly O(n p) for dense data.
•
Standardizing features and not penalizing the intercept are crucial practical steps to make Lasso behave well and to choose $ $λ$ $ meaningfully.
•
L1 is convex but not differentiable at zero; we use subgradients or proximal operators to optimize it reliably.
•
Lasso works best when you expect only a small subset of features to be relevant, especially in high-dimensional settings.
•
Choosing $ $λ$ $ is typically done via cross-validation or information criteria; larger $ $λ$ $ yields sparser models but more bias.
•
If features are highly correlated, Lasso may arbitrarily pick one; elastic net (L1+L2) is more stable in such cases.

Prerequisites

→Linear algebra (vectors, matrices, norms) — Understanding matrix–vector products, norms, and eigenvalues is essential for expressing the Lasso objective and gradients.
→Convex optimization basics — Lasso is a convex but non-smooth problem; subgradients, proximal operators, and KKT conditions underpin the algorithms.
→Calculus and gradients — Computing and interpreting gradients of the squared loss is necessary for ISTA and analysis.
→Probability and statistics — Noise models, mean squared error, and cross-validation are important for modeling and hyperparameter selection.
→C++ programming (vectors, loops, numerics) — Implementing iterative solvers efficiently and correctly requires familiarity with arrays, loops, and numeric stability.
→Data preprocessing (centering, scaling) — Feature standardization strongly affects L1 behavior and the practical choice of $\lambda$.

Detailed Explanation

Tap terms for definitions

01Overview

L1 regularization, widely known as Lasso (Least Absolute Shrinkage and Selection Operator), augments a predictive model’s loss function with the sum of absolute values of the weights. For linear regression with parameters w and intercept b, Lasso solves a convex optimization problem that trades off goodness of fit with model simplicity. The penalty term (\lambda \lVert w \rVert_1 = \lambda \sum |w_i|) encourages sparsity: many coefficients become exactly zero. This property makes Lasso a powerful, built-in feature selection method, particularly useful when the number of features p is large compared to the number of samples n, or when interpretability matters.

Unlike L2 (ridge) regularization, which spreads shrinkage smoothly across all coefficients, L1 has a kink at zero, creating non-differentiability and enabling coefficients to be set exactly to zero. Optimization leverages subgradient methods or proximal operators, most notably the soft-thresholding function. Practical algorithms include coordinate descent, which updates one coefficient at a time via closed-form soft-thresholding, and proximal gradient (ISTA), which takes gradient steps on the smooth part followed by soft-thresholding. Careful preprocessing—centering/standardizing features and leaving the intercept unpenalized—makes solutions numerically stable and makes (\lambda) meaningfully comparable across problems.

02Intuition & Analogies

Imagine you’re packing a suitcase with a strict baggage fee that charges a fixed amount for every distinct item, regardless of its weight. If two small items add little value but each incurs a fee, you’ll drop many of them and keep only the most useful ones. L1 regularization acts similarly on model coefficients: it adds a fixed “tax” proportional to the absolute size of each coefficient. If a coefficient’s benefit to prediction is small, paying its tax is not worth it, and the optimizer sets it to zero—leaving only a handful of valuable features in the model.

Another analogy is cleaning background noise from an audio track using a gate. The gate mutes any signal below a threshold, but lets louder signals through (with a slight volume reduction). Soft-thresholding, the proximal operator of L1, does exactly this to coefficients: numbers with magnitude below a threshold become zero; larger ones are reduced by a constant amount. This is different from L2 (ridge), which is more like turning down the overall volume smoothly—everything gets a bit quieter but nothing is fully muted.

From a geometric point of view, the L1 penalty forms a diamond-shaped constraint region (an (\ell_1) ball). When this region touches the elliptical contours of the loss, the corners of the diamond often line up with axes, making it likely that some coordinates are exactly zero. That geometric “corner effect” explains why L1 favors sparse solutions while remaining convex and computationally tractable.

03Formal Definition

Given data matrix \(X

\in

R^{n \times p}

\) (rows are samples), response vector \(y

\in

R^{n}

\), parameters \(w

\in

R^{p}

\), and intercept \(b

\in

R

\), the Lasso for least squares solves \[

min_{w, b}

\frac{1}{2 n}

∥

y - (Xw + b

1

)

∥_{2}^{2}

\; + \;

λ

∥

∥_{1}

, \] where \(

∥

∥_{1}

\sum_{i = 1}^{p}

∣ w_{i} ∣

\), \(

λ

\geq

0\) is a regularization parameter, and \(

1

\) is an all-ones vector. The objective is convex but non-differentiable at coordinates where \(

w_i=0\

). The subgradient optimality condition is \[ 0

\in

\frac{1}{n}

X^{⊤}

(Xw + b

1

- y) +

λ

\partial

∥

∥_{1}

0 =

\frac{1}{n} 1^{⊤}

(Xw + b

1

- y). \] Here, \(

\partial

∥

∥_{1}

\) is the Cartesian product of subgradients for each \(

∣ w_{i} ∣

\). The proximal operator for \(

λ

∥

∥_{1}

\) is the soft-thresholding map \(

S_{λ}

(z)_i =

sign

(

z_{i}

)

max

(

∣ z_{i} ∣

λ

, 0)\). Standard solvers include coordinate descent with updates \[

w_{j}

\leftarrow

\frac{1}{∥ X _{j} ∥ _{2}^{2}}

S_{λ}

\left

(

\sum_{i = 1}^{n}

x_{ij}\Big

(

y_{i}

- b -

\sum_{k \neq = j}

x_{ik}

w_k\Big

)

\right

), \] and proximal gradient (ISTA): \(

w^{k + 1}

S_{λ t}

(

w^{k}

- t

\nabla

w^{k}

))\) with step size \(t

\leq

1/L\), where \(f(w)=

\frac{1}{2 n} ∥

y-Xw

∥_{2}^{2}

\) and $L$ is the Lipschitz constant of \(

\nabla

f\).

04When to Use

Use Lasso when you expect sparsity—only a small subset of features truly matters—or when interpretability and automatic feature selection are important. It is effective in high-dimensional settings (p comparable to or greater than n), such as genomics, text with bag-of-words features, or sensor selection. Lasso can reduce variance and improve generalization when unregularized models overfit.

Prefer Lasso over ridge when your goal is to zero out irrelevant features; prefer ridge when all features contribute a bit and multicollinearity is high. If features are strongly correlated and you want grouped behavior, consider elastic net (a mixture of L1 and L2) to avoid Lasso’s tendency to select a single representative feature from a correlated cluster. Lasso also integrates well with pipelines that require model compression or fast inference, since many weights become zero and can be pruned.

Choose (\lambda) by cross-validation or criteria like AIC/BIC; larger (\lambda) increases sparsity but also bias. Always standardize/center features and avoid penalizing the intercept. For non-squared losses (e.g., logistic), L1 still works via proximal gradient; coordinate descent remains effective when individual coordinate updates have closed forms or cheap 1D solves.

⚠️Common Mistakes

Not standardizing features: Without centering and scaling, features with large units dominate the penalty, leading to arbitrary selections. Always center columns and scale to unit variance (or equal (\ell_2) norm).
Penalizing the intercept: The bias term should typically be unpenalized; penalizing it can shift predictions and degrade fit.
Mis-tuning (\lambda): Too small gives overfitting with little sparsity; too large collapses the model to zeros. Use cross-validation and, if necessary, warm starts along a (\lambda)-path.
Using plain gradient descent: L1 is non-differentiable; use subgradient, coordinate descent, or proximal gradient (ISTA/FISTA). Forgetting the proximal step results in incorrect updates.
Ignoring correlated features: Lasso may pick one feature from a correlated group unpredictably. If stability is desired, use elastic net or preprocessing (e.g., PCA or grouping).
Early stopping or loose tolerances: Insufficient convergence checks yield biased estimates. Track objective decrease or parameter change norms.
Numerical drift in coordinate descent: If you don’t maintain and update the residual efficiently, recomputing from scratch each step is slow and can accumulate floating-point error. Keep a residual vector and update it incrementally.
Comparing coefficients across unscaled problems: The magnitude of (\lambda) is not directly comparable across datasets unless features are standardized and the loss is normalized by n.

Key Formulas

Lasso objective (squared loss)

w, b min \frac{1}{2 n} ∥ y - (Xw + b 1) ∥_{2}^{2} + λ ∥ w ∥_{1}

Explanation: We minimize prediction error plus an L1 penalty on coefficients. The parameter $ $λ$ $ controls the trade-off between fit and sparsity.

L1 norm

∥ w ∥_{1} = i = 1 \sum p ∣ w_{i} ∣

Explanation: The L1 norm sums the absolute values of all coefficients. It encourages many coefficients to be exactly zero when used as a penalty.

Soft-thresholding (prox of L1)

S_{τ} (z)_{i} = sign (z_{i}) max (∣ z_{i} ∣ - τ, 0)

Explanation: This operator shrinks each component toward zero by $ $τ$ $, setting it to zero if its magnitude is below $ $τ$ $. It is the key building block of L1 optimization.

Subgradient of absolute value

\partial ∣ w_{i} ∣ = {{sign (w_{i})}, [- 1, 1], w_{i} \neq = 0 w_{i} = 0

Explanation: Because |w| is not differentiable at 0, we use its subgradient. Any value between -1 and 1 is valid at zero; elsewhere, the subgradient is the sign.

Stationarity (KKT) for Lasso

0 \in \frac{1}{n} X^{⊤} (Xw + b 1 - y) + λ \partial ∥ w ∥_{1}

Explanation: At the optimum, the negative gradient of the loss must be balanced by a subgradient of the L1 norm scaled by $ $λ$ $. This characterizes which coordinates are zero or nonzero.

Coordinate descent update

ρ_{j} = i = 1 \sum n x_{ij} (y_{i} - b - k \neq = j \sum x_{ik} w_{k}), w_{j} \leftarrow \frac{1}{∥ X _{j} ∥ _{2}^{2}} S_{λ} (ρ_{j})

Explanation: Holding other coordinates fixed, the one-dimensional subproblem has a closed-form solution via soft-thresholding. The denominator rescales by the column’s squared norm.

ISTA update (no intercept shown)

w^{k + 1} = S_{λ t} (w^{k} - t \frac{1}{n} X^{⊤} (X w^{k} - y))

Explanation: Each iteration takes a gradient step on the smooth squared loss and then applies soft-thresholding. The step size t must be at most the inverse Lipschitz constant of the gradient.

Lipschitz constant for squared loss

L = λ_{m a x} (\frac{1}{n} X^{⊤} X)

Explanation: The gradient of the squared loss has Lipschitz constant equal to the largest eigenvalue of $\frac{1}{n}$ $X^{T}$ X. Using $t \leq 1/ L$ guarantees convergence of ISTA.

Complexity Analysis

For dense data with n samples and p features, both coordinate descent and proximal gradient (ISTA) have per-iteration cost dominated by matrix–vector operations. Coordinate descent: Maintaining the residual

r = y

− (Xw + b), each update of a single coordinate j requires computing ρ_

j = X_{j}^{T}

r + |

∣ X_{j} ∣

|_2^2

w_{j}

(O(n)) and then adjusting the residual by

r \leftarrow r

−

X_{j}

(w_

j^{n e w}

−

w_{j}

) (another O(n)). A full sweep over all p coordinates therefore costs O(n p). Updating the intercept is O(n). With

T_{c} d

sweeps, the total time is O(

T_{c} d

n p). Memory is O(n p) for X plus O(n + p) for r and w. If X is sparse with nnz nonzeros, the per-sweep cost becomes O(nnz). ISTA: Each iteration computes the residual

e = Xw + b

− y (O(n p)), the gradient g =

\frac{1}{n}

X^{T}

e (O(n p)), and applies soft-thresholding (O(p)), for a per-iteration cost of O(n p). If we estimate the Lipschitz constant L via power iteration, each iteration of power iteration also costs O(n p); running

I_{p} i

iterations adds O(

I_{p} i

n p) once at the start. With

T_{i} s t a

iterations, total time is O((

I_{p} i

T_{i} s t a

) n p). Memory is O(n p) for X and O(n + p) for working vectors. Convergence rates: ISTA has objective error O

\frac{1}{k}

, while accelerated variants (FISTA) achieve O(1/

k^{2}

). Coordinate descent often converges very quickly in practice due to efficient one-dimensional updates and can exploit feature standardization and warm starts. Stopping criteria commonly use relative parameter change, objective decrease, or duality gap. Overall runtime depends strongly on the sparsity of X and the solution; as more coefficients are driven to zero, caches and sparse updates can further reduce costs.

Code Examples

Lasso via Coordinate Descent with Soft-Thresholding (squared loss)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Soft-thresholding operator S_tau(z) = sign(z) * max(|z|-tau, 0)
5 double soft_threshold(double z, double tau) {
6     if (z > tau) return z - tau;
7     if (z < -tau) return z + tau;
8     return 0.0;
9 }
10 
11 int main() {
12     ios::sync_with_stdio(false);
13     cin.tie(nullptr);
14 
15     // Synthetic dense dataset: n samples, p features
16     int n = 200, p = 50, k_true = 5; // k_true nonzero weights in ground truth
17     unsigned seed = 42;
18     mt19937 rng(seed);
19     normal_distribution<double> nd(0.0, 1.0);
20 
21     vector<vector<double>> X(n, vector<double>(p));
22     vector<double> w_true(p, 0.0), y(n, 0.0);
23 
24     // Create sparse true weights
25     for (int i = 0; i < k_true; ++i) w_true[i] = (i % 2 ? -1.5 : 2.0);
26 
27     // Generate features ~ N(0,1)
28     for (int i = 0; i < n; ++i) {
29         for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
30     }
31 
32     // Standardize each column: zero mean, unit variance
33     vector<double> meanX(p, 0.0), stdX(p, 0.0);
34     for (int j = 0; j < p; ++j) {
35         for (int i = 0; i < n; ++i) meanX[j] += X[i][j];
36         meanX[j] /= n;
37         for (int i = 0; i < n; ++i) stdX[j] += (X[i][j] - meanX[j]) * (X[i][j] - meanX[j]);
38         stdX[j] = sqrt(stdX[j] / max(1, n - 1));
39         if (stdX[j] == 0) stdX[j] = 1; // guard against zero variance
40         for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - meanX[j]) / stdX[j];
41     }
42 
43     // Build noisy response: y = X w_true + b_true + noise
44     double b_true = 0.5;
45     normal_distribution<double> noise(0.0, 0.5);
46     for (int i = 0; i < n; ++i) {
47         double yi = b_true;
48         for (int j = 0; j < p; ++j) yi += X[i][j] * w_true[j];
49         y[i] = yi + noise(rng);
50     }
51 
52     // Lasso hyperparameters
53     double lambda = 0.2;           // regularization strength
54     int max_sweeps = 200;          // max coordinate descent sweeps
55     double tol = 1e-6;             // stopping tolerance on max |delta w|
56 
57     // Initialize weights and intercept (unpenalized)
58     vector<double> w(p, 0.0);
59     double b = 0.0;
60 
61     // Residual r = y - (X w + b)
62     vector<double> r = y; // since w=0, b=0 initially
63 
64     // Precompute column squared norms ||X_j||^2
65     vector<double> colNorm2(p, 0.0);
66     for (int j = 0; j < p; ++j) {
67         for (int i = 0; i < n; ++i) colNorm2[j] += X[i][j] * X[i][j];
68         if (colNorm2[j] == 0) colNorm2[j] = 1; // guard
69     }
70 
71     for (int sweep = 0; sweep < max_sweeps; ++sweep) {
72         double max_change = 0.0;
73 
74         // Update intercept (unpenalized): b <- b + mean(r), r <- r - mean(r)
75         double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
76         b += mean_r;
77         for (int i = 0; i < n; ++i) r[i] -= mean_r;
78 
79         // Update each coordinate with soft-thresholding
80         for (int j = 0; j < p; ++j) {
81             // rho_j = X_j^T r + ||X_j||^2 * w_j  (partial residual trick)
82             double rho = 0.0;
83             for (int i = 0; i < n; ++i) rho += X[i][j] * r[i];
84             rho += colNorm2[j] * w[j];
85 
86             double w_new = soft_threshold(rho, lambda) / colNorm2[j];
87             double delta = w_new - w[j];
88             if (delta != 0.0) {
89                 // Update residual: r <- r - X_j * delta
90                 for (int i = 0; i < n; ++i) r[i] -= X[i][j] * delta;
91                 w[j] = w_new;
92                 max_change = max(max_change, fabs(delta));
93             }
94         }
95 
96         if (max_change < tol) {
97             // Converged
98             break;
99         }
100     }
101 
102     // Evaluate training MSE and sparsity
103     double mse = 0.0; int nnz = 0;
104     for (int i = 0; i < n; ++i) {
105         double pred = b;
106         for (int j = 0; j < p; ++j) pred += X[i][j] * w[j];
107         double err = pred - y[i];
108         mse += err * err;
109     }
110     mse /= n;
111     for (int j = 0; j < p; ++j) if (fabs(w[j]) > 1e-8) ++nnz;
112 
113     cout << fixed << setprecision(6);
114     cout << "Coordinate Descent Lasso results\n";
115     cout << "Intercept b: " << b << "\n";
116     cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
117     cout << "Training MSE: " << mse << "\n";
118     double l1 = 0.0; for (double wi : w) l1 += fabs(wi);
119     cout << "L1 norm of w: " << l1 << "\n";
120 
121     return 0;
122 }
123

This program generates a synthetic regression dataset, standardizes features, and solves the Lasso using coordinate descent. It maintains the residual vector to make each coordinate update O(n), uses the closed-form soft-thresholding update for each weight, and separately updates the intercept without penalty. The loop stops when parameter changes are sufficiently small. Finally, it reports sparsity, intercept, mean squared error, and the L1 norm of the solution.

Time: O(T · n · p) where T is the number of coordinate sweeps until convergence.Space: O(n · p) to store X plus O(n + p) for residuals and weights.

Lasso via Proximal Gradient (ISTA) with Power-Iteration Step Size

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Soft-thresholding applied elementwise to a vector
5 void soft_threshold_vec(vector<double>& v, double tau) {
6     for (double &z : v) {
7         if (z > tau) z -= tau;
8         else if (z < -tau) z += tau;
9         else z = 0.0;
10     }
11 }
12 
13 // Multiply y = X * w  (X: n x p, w: p)
14 void matvec_Xw(const vector<vector<double>>& X, const vector<double>& w, vector<double>& y) {
15     int n = (int)X.size();
16     int p = (int)X[0].size();
17     y.assign(n, 0.0);
18     for (int i = 0; i < n; ++i) {
19         double s = 0.0;
20         for (int j = 0; j < p; ++j) s += X[i][j] * w[j];
21         y[i] = s;
22     }
23 }
24 
25 // Multiply g = (1/n) X^T * e
26 void grad_XT_e_over_n(const vector<vector<double>>& X, const vector<double>& e, vector<double>& g) {
27     int n = (int)X.size();
28     int p = (int)X[0].size();
29     g.assign(p, 0.0);
30     for (int j = 0; j < p; ++j) {
31         double s = 0.0;
32         for (int i = 0; i < n; ++i) s += X[i][j] * e[i];
33         g[j] = s / n;
34     }
35 }
36 
37 // Power iteration to estimate largest eigenvalue of (1/n) X^T X
38 double estimate_L(const vector<vector<double>>& X, int iters = 50) {
39     int n = (int)X.size();
40     int p = (int)X[0].size();
41     vector<double> v(p, 0.0), Xv(n, 0.0), w(p, 0.0);
42     // random init
43     mt19937 rng(123);
44     normal_distribution<double> nd(0.0, 1.0);
45     for (int j = 0; j < p; ++j) v[j] = nd(rng);
46     // Normalize v
47     auto norm = [&](const vector<double>& a){ double s=0; for(double x:a) s+=x*x; return sqrt(s); };
48     double nv = norm(v); if (nv==0) nv=1; for (double &x: v) x/=nv;
49 
50     double lambda = 0.0;
51     for (int t = 0; t < iters; ++t) {
52         // w = (1/n) X^T (X v)
53         matvec_Xw(X, v, Xv);
54         vector<double> e = Xv; // reuse as temp
55         grad_XT_e_over_n(X, e, w);
56         // Rayleigh quotient approximation
57         lambda = 0.0; for (int j = 0; j < p; ++j) lambda += v[j] * w[j];
58         // v <- w / ||w||
59         nv = norm(w); if (nv == 0) break;
60         for (int j = 0; j < p; ++j) v[j] = w[j] / nv;
61     }
62     // lambda approximates largest eigenvalue of (1/n) X^T X
63     return max(lambda, 1e-12); // guard against zero
64 }
65 
66 int main(){
67     ios::sync_with_stdio(false);
68     cin.tie(nullptr);
69 
70     // Synthetic data
71     int n = 300, p = 80, k_true = 7;
72     mt19937 rng(7);
73     normal_distribution<double> nd(0.0, 1.0);
74     vector<vector<double>> X(n, vector<double>(p));
75     vector<double> y(n), w_true(p, 0.0);
76 
77     for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
78     // Standardize columns
79     for (int j = 0; j < p; ++j) {
80         double mu=0, ss=0; for (int i = 0; i < n; ++i) mu += X[i][j]; mu/=n;
81         for (int i = 0; i < n; ++i) { X[i][j] -= mu; ss += X[i][j]*X[i][j]; }
82         double st = sqrt(ss / max(1, n-1)); if (st==0) st=1;
83         for (int i = 0; i < n; ++i) X[i][j] /= st;
84     }
85     // True sparse weights and noisy target
86     for (int j = 0; j < k_true; ++j) w_true[j] = (j%2? -2.0 : 2.5);
87     double b_true = -0.3;
88     normal_distribution<double> noise(0.0, 0.6);
89     for (int i = 0; i < n; ++i) {
90         double s = b_true;
91         for (int j = 0; j < p; ++j) s += X[i][j] * w_true[j];
92         y[i] = s + noise(rng);
93     }
94 
95     // ISTA hyperparameters
96     double lambda = 0.25;
97     int max_iters = 300;
98     double tol = 1e-6;
99 
100     // Estimate Lipschitz constant L for f(w) = (1/2n)||Xw - y||^2
101     double L = estimate_L(X, 40);
102     double t = 1.0 / L; // safe fixed step size
103 
104     vector<double> w(p, 0.0);
105     double b = 0.0;
106 
107     vector<double> Xw(n, 0.0), e(n, 0.0), g(p, 0.0), w_new(p, 0.0);
108 
109     auto l1_norm = [&](const vector<double>& a){ double s=0; for(double z:a) s += fabs(z); return s; };
110 
111     double prev_obj = numeric_limits<double>::infinity();
112     for (int it = 0; it < max_iters; ++it) {
113         // Residual e = Xw + b - y
114         matvec_Xw(X, w, Xw);
115         for (int i = 0; i < n; ++i) e[i] = Xw[i] + b - y[i];
116         // Gradient g = (1/n) X^T e
117         grad_XT_e_over_n(X, e, g);
118         // Gradient step then prox (soft-thresholding)
119         w_new = w;
120         for (int j = 0; j < p; ++j) w_new[j] = w[j] - t * g[j];
121         soft_threshold_vec(w_new, t * lambda);
122         // Intercept update (unpenalized): b <- b - t * mean(e)
123         double mean_e = accumulate(e.begin(), e.end(), 0.0) / n;
124         double b_new = b - t * mean_e;
125 
126         // Check convergence via parameter change
127         double diff = 0.0, normw = 0.0;
128         for (int j = 0; j < p; ++j) { double d = w_new[j] - w[j]; diff += d*d; normw += w[j]*w[j]; }
129         diff = sqrt(diff);
130         double denom = max(1e-12, sqrt(normw));
131         if (diff/denom < tol && fabs(b_new - b) < 1e-8) { w = w_new; b = b_new; break; }
132 
133         // Optional: objective monitoring
134         matvec_Xw(X, w_new, Xw);
135         double mse = 0.0; for (int i = 0; i < n; ++i) { double r = Xw[i] + b_new - y[i]; mse += r*r; }
136         double obj = 0.5 * mse / n + lambda * l1_norm(w_new);
137         if (prev_obj - obj < 1e-10) {
138             // tiny improvement; could break or continue a bit more
139         }
140         prev_obj = obj;
141 
142         // Commit updates
143         w.swap(w_new);
144         b = b_new;
145     }
146 
147     // Report results
148     int nnz = 0; for (double z : w) if (fabs(z) > 1e-8) ++nnz;
149     double mse = 0.0; matvec_Xw(X, w, Xw);
150     for (int i = 0; i < n; ++i) { double r = Xw[i] + b - y[i]; mse += r*r; }
151     mse /= n;
152 
153     cout << fixed << setprecision(6);
154     cout << "ISTA Lasso results\n";
155     cout << "Step size t: " << t << ", estimated L: " << L << "\n";
156     cout << "Intercept b: " << b << "\n";
157     cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
158     cout << "Training MSE: " << mse << "\n";
159     cout << "L1 norm of w: " << l1_norm(w) << "\n";
160 
161     return 0;
162 }
163

This implementation solves the Lasso with proximal gradient descent (ISTA). It estimates a safe step size using power iteration to approximate the Lipschitz constant L of the squared-loss gradient, applies a gradient step, and then performs soft-thresholding. The intercept is updated via a standard gradient step and is not penalized. The algorithm stops when the parameter change is small or the objective stops improving.

Time: O((I_pi + T) · n · p), where I_pi is the number of power-iteration steps and T is ISTA iterations.Space: O(n · p) for X plus O(n + p) for working vectors.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Soft-thresholding operator S_tau(z) = sign(z) * max(\|z\|-tau, 0)
5	double soft_threshold(double z, double tau) {
6	if (z > tau) return z - tau;
7	if (z < -tau) return z + tau;
8	return 0.0;
9	}
10
11	int main() {
12	ios::sync_with_stdio(false);
13	cin.tie(nullptr);
14
15	// Synthetic dense dataset: n samples, p features
16	int n = 200, p = 50, k_true = 5; // k_true nonzero weights in ground truth
17	unsigned seed = 42;
18	mt19937 rng(seed);
19	normal_distribution<double> nd(0.0, 1.0);
20
21	vector<vector<double>> X(n, vector<double>(p));
22	vector<double> w_true(p, 0.0), y(n, 0.0);
23
24	// Create sparse true weights
25	for (int i = 0; i < k_true; ++i) w_true[i] = (i % 2 ? -1.5 : 2.0);
26
27	// Generate features ~ N(0,1)
28	for (int i = 0; i < n; ++i) {
29	for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
30	}
31
32	// Standardize each column: zero mean, unit variance
33	vector<double> meanX(p, 0.0), stdX(p, 0.0);
34	for (int j = 0; j < p; ++j) {
35	for (int i = 0; i < n; ++i) meanX[j] += X[i][j];
36	meanX[j] /= n;
37	for (int i = 0; i < n; ++i) stdX[j] += (X[i][j] - meanX[j]) * (X[i][j] - meanX[j]);
38	stdX[j] = sqrt(stdX[j] / max(1, n - 1));
39	if (stdX[j] == 0) stdX[j] = 1; // guard against zero variance
40	for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - meanX[j]) / stdX[j];
41	}
42
43	// Build noisy response: y = X w_true + b_true + noise
44	double b_true = 0.5;
45	normal_distribution<double> noise(0.0, 0.5);
46	for (int i = 0; i < n; ++i) {
47	double yi = b_true;
48	for (int j = 0; j < p; ++j) yi += X[i][j] * w_true[j];
49	y[i] = yi + noise(rng);
50	}
51
52	// Lasso hyperparameters
53	double lambda = 0.2; // regularization strength
54	int max_sweeps = 200; // max coordinate descent sweeps
55	double tol = 1e-6; // stopping tolerance on max \|delta w\|
56
57	// Initialize weights and intercept (unpenalized)
58	vector<double> w(p, 0.0);
59	double b = 0.0;
60
61	// Residual r = y - (X w + b)
62	vector<double> r = y; // since w=0, b=0 initially
63
64	// Precompute column squared norms \|\|X_j\|\|^2
65	vector<double> colNorm2(p, 0.0);
66	for (int j = 0; j < p; ++j) {
67	for (int i = 0; i < n; ++i) colNorm2[j] += X[i][j] * X[i][j];
68	if (colNorm2[j] == 0) colNorm2[j] = 1; // guard
69	}
70
71	for (int sweep = 0; sweep < max_sweeps; ++sweep) {
72	double max_change = 0.0;
73
74	// Update intercept (unpenalized): b <- b + mean(r), r <- r - mean(r)
75	double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
76	b += mean_r;
77	for (int i = 0; i < n; ++i) r[i] -= mean_r;
78
79	// Update each coordinate with soft-thresholding
80	for (int j = 0; j < p; ++j) {
81	// rho_j = X_j^T r + \|\|X_j\|\|^2 * w_j (partial residual trick)
82	double rho = 0.0;
83	for (int i = 0; i < n; ++i) rho += X[i][j] * r[i];
84	rho += colNorm2[j] * w[j];
85
86	double w_new = soft_threshold(rho, lambda) / colNorm2[j];
87	double delta = w_new - w[j];
88	if (delta != 0.0) {
89	// Update residual: r <- r - X_j * delta
90	for (int i = 0; i < n; ++i) r[i] -= X[i][j] * delta;
91	w[j] = w_new;
92	max_change = max(max_change, fabs(delta));
93	}
94	}
95
96	if (max_change < tol) {
97	// Converged
98	break;
99	}
100	}
101
102	// Evaluate training MSE and sparsity
103	double mse = 0.0; int nnz = 0;
104	for (int i = 0; i < n; ++i) {
105	double pred = b;
106	for (int j = 0; j < p; ++j) pred += X[i][j] * w[j];
107	double err = pred - y[i];
108	mse += err * err;
109	}
110	mse /= n;
111	for (int j = 0; j < p; ++j) if (fabs(w[j]) > 1e-8) ++nnz;
112
113	cout << fixed << setprecision(6);
114	cout << "Coordinate Descent Lasso results\n";
115	cout << "Intercept b: " << b << "\n";
116	cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
117	cout << "Training MSE: " << mse << "\n";
118	double l1 = 0.0; for (double wi : w) l1 += fabs(wi);
119	cout << "L1 norm of w: " << l1 << "\n";
120
121	return 0;
122	}
123

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Soft-thresholding applied elementwise to a vector
5	void soft_threshold_vec(vector<double>& v, double tau) {
6	for (double &z : v) {
7	if (z > tau) z -= tau;
8	else if (z < -tau) z += tau;
9	else z = 0.0;
10	}
11	}
12
13	// Multiply y = X * w (X: n x p, w: p)
14	void matvec_Xw(const vector<vector<double>>& X, const vector<double>& w, vector<double>& y) {
15	int n = (int)X.size();
16	int p = (int)X[0].size();
17	y.assign(n, 0.0);
18	for (int i = 0; i < n; ++i) {
19	double s = 0.0;
20	for (int j = 0; j < p; ++j) s += X[i][j] * w[j];
21	y[i] = s;
22	}
23	}
24
25	// Multiply g = (1/n) X^T * e
26	void grad_XT_e_over_n(const vector<vector<double>>& X, const vector<double>& e, vector<double>& g) {
27	int n = (int)X.size();
28	int p = (int)X[0].size();
29	g.assign(p, 0.0);
30	for (int j = 0; j < p; ++j) {
31	double s = 0.0;
32	for (int i = 0; i < n; ++i) s += X[i][j] * e[i];
33	g[j] = s / n;
34	}
35	}
36
37	// Power iteration to estimate largest eigenvalue of (1/n) X^T X
38	double estimate_L(const vector<vector<double>>& X, int iters = 50) {
39	int n = (int)X.size();
40	int p = (int)X[0].size();
41	vector<double> v(p, 0.0), Xv(n, 0.0), w(p, 0.0);
42	// random init
43	mt19937 rng(123);
44	normal_distribution<double> nd(0.0, 1.0);
45	for (int j = 0; j < p; ++j) v[j] = nd(rng);
46	// Normalize v
47	auto norm = [&](const vector<double>& a){ double s=0; for(double x:a) s+=x*x; return sqrt(s); };
48	double nv = norm(v); if (nv==0) nv=1; for (double &x: v) x/=nv;
49
50	double lambda = 0.0;
51	for (int t = 0; t < iters; ++t) {
52	// w = (1/n) X^T (X v)
53	matvec_Xw(X, v, Xv);
54	vector<double> e = Xv; // reuse as temp
55	grad_XT_e_over_n(X, e, w);
56	// Rayleigh quotient approximation
57	lambda = 0.0; for (int j = 0; j < p; ++j) lambda += v[j] * w[j];
58	// v <- w / \|\|w\|\|
59	nv = norm(w); if (nv == 0) break;
60	for (int j = 0; j < p; ++j) v[j] = w[j] / nv;
61	}
62	// lambda approximates largest eigenvalue of (1/n) X^T X
63	return max(lambda, 1e-12); // guard against zero
64	}
65
66	int main(){
67	ios::sync_with_stdio(false);
68	cin.tie(nullptr);
69
70	// Synthetic data
71	int n = 300, p = 80, k_true = 7;
72	mt19937 rng(7);
73	normal_distribution<double> nd(0.0, 1.0);
74	vector<vector<double>> X(n, vector<double>(p));
75	vector<double> y(n), w_true(p, 0.0);
76
77	for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
78	// Standardize columns
79	for (int j = 0; j < p; ++j) {
80	double mu=0, ss=0; for (int i = 0; i < n; ++i) mu += X[i][j]; mu/=n;
81	for (int i = 0; i < n; ++i) { X[i][j] -= mu; ss += X[i][j]*X[i][j]; }
82	double st = sqrt(ss / max(1, n-1)); if (st==0) st=1;
83	for (int i = 0; i < n; ++i) X[i][j] /= st;
84	}
85	// True sparse weights and noisy target
86	for (int j = 0; j < k_true; ++j) w_true[j] = (j%2? -2.0 : 2.5);
87	double b_true = -0.3;
88	normal_distribution<double> noise(0.0, 0.6);
89	for (int i = 0; i < n; ++i) {
90	double s = b_true;
91	for (int j = 0; j < p; ++j) s += X[i][j] * w_true[j];
92	y[i] = s + noise(rng);
93	}
94
95	// ISTA hyperparameters
96	double lambda = 0.25;
97	int max_iters = 300;
98	double tol = 1e-6;
99
100	// Estimate Lipschitz constant L for f(w) = (1/2n)\|\|Xw - y\|\|^2
101	double L = estimate_L(X, 40);
102	double t = 1.0 / L; // safe fixed step size
103
104	vector<double> w(p, 0.0);
105	double b = 0.0;
106
107	vector<double> Xw(n, 0.0), e(n, 0.0), g(p, 0.0), w_new(p, 0.0);
108
109	auto l1_norm = [&](const vector<double>& a){ double s=0; for(double z:a) s += fabs(z); return s; };
110
111	double prev_obj = numeric_limits<double>::infinity();
112	for (int it = 0; it < max_iters; ++it) {
113	// Residual e = Xw + b - y
114	matvec_Xw(X, w, Xw);
115	for (int i = 0; i < n; ++i) e[i] = Xw[i] + b - y[i];
116	// Gradient g = (1/n) X^T e
117	grad_XT_e_over_n(X, e, g);
118	// Gradient step then prox (soft-thresholding)
119	w_new = w;
120	for (int j = 0; j < p; ++j) w_new[j] = w[j] - t * g[j];
121	soft_threshold_vec(w_new, t * lambda);
122	// Intercept update (unpenalized): b <- b - t * mean(e)
123	double mean_e = accumulate(e.begin(), e.end(), 0.0) / n;
124	double b_new = b - t * mean_e;
125
126	// Check convergence via parameter change
127	double diff = 0.0, normw = 0.0;
128	for (int j = 0; j < p; ++j) { double d = w_new[j] - w[j]; diff += dd; normw += w[j]w[j]; }
129	diff = sqrt(diff);
130	double denom = max(1e-12, sqrt(normw));
131	if (diff/denom < tol && fabs(b_new - b) < 1e-8) { w = w_new; b = b_new; break; }
132
133	// Optional: objective monitoring
134	matvec_Xw(X, w_new, Xw);
135	double mse = 0.0; for (int i = 0; i < n; ++i) { double r = Xw[i] + b_new - y[i]; mse += r*r; }
136	double obj = 0.5 * mse / n + lambda * l1_norm(w_new);
137	if (prev_obj - obj < 1e-10) {
138	// tiny improvement; could break or continue a bit more
139	}
140	prev_obj = obj;
141
142	// Commit updates
143	w.swap(w_new);
144	b = b_new;
145	}
146
147	// Report results
148	int nnz = 0; for (double z : w) if (fabs(z) > 1e-8) ++nnz;
149	double mse = 0.0; matvec_Xw(X, w, Xw);
150	for (int i = 0; i < n; ++i) { double r = Xw[i] + b - y[i]; mse += r*r; }
151	mse /= n;
152
153	cout << fixed << setprecision(6);
154	cout << "ISTA Lasso results\n";
155	cout << "Step size t: " << t << ", estimated L: " << L << "\n";
156	cout << "Intercept b: " << b << "\n";
157	cout << "Nonzeros in w: " << nnz << " / " << p << "\n";
158	cout << "Training MSE: " << mse << "\n";
159	cout << "L1 norm of w: " << l1_norm(w) << "\n";
160
161	return 0;
162	}
163