🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Elastic Net Regularization

Key Points

  • •
    Elastic Net regularization combines L1 (Lasso) and L2 (Ridge) penalties to produce models that are both sparse and stable.
  • •
    It is especially effective when features are correlated, encouraging grouped selection rather than picking just one feature from a correlated set.
  • •
    The optimization remains convex, so global minima exist and efficient algorithms like coordinate descent and proximal gradient work well.
  • •
    Standardizing features (zero mean, unit variance) is crucial so the penalty treats all coefficients fairly.
  • •
    The mixing of penalties can be written with two parameters (λ1, λ2) or one λ with a mixing factor α, where λ1 = λα and λ2 = λ(1−α).
  • •
    For linear regression, coordinate descent has simple closed-form updates using soft-thresholding.
  • •
    For logistic regression and other smooth losses, proximal gradient descent applies a gradient step (including L2) followed by soft-thresholding (for L1).
  • •
    Hyperparameters are typically tuned via cross-validation to balance bias, variance, and sparsity.

Prerequisites

  • →Linear and logistic regression — Elastic Net is applied to these models by adding penalties to their loss functions.
  • →Vector norms (L1 and L2) — Understanding how \(\lVert w \rVert_1\) induces sparsity and \(\lVert w \rVert_2^2\) provides shrinkage explains Elastic Net’s behavior.
  • →Convex optimization basics — Elastic Net uses convex objectives; knowing gradients, subgradients, and optimality conditions helps understand solvers.
  • →Gradient descent and proximal methods — Implementation for non-smooth penalties (L1) relies on proximal steps or coordinate descent.
  • →Feature standardization — Scaling ensures the penalty treats all coefficients comparably and improves numerical stability.
  • →Cross-validation — Choosing λ and α requires validation to balance bias, variance, and sparsity.

Detailed Explanation

Tap terms for definitions

01Overview

Elastic Net regularization is a technique for preventing overfitting by adding a penalty to the model’s loss function that blends the strengths of L1 (Lasso) and L2 (Ridge) penalties. In predictive modeling, especially with many features (possibly more than observations), models can fit noise rather than signal. L1 regularization encourages sparsity by driving many coefficients to exactly zero, effectively selecting features. L2 regularization spreads shrinkage across coefficients, stabilizing estimates and handling multicollinearity. Elastic Net combines these: it promotes sparsity while also grouping correlated features, often leading to better generalization than using L1 or L2 alone. The method is widely used in linear and logistic regression, and it extends to generalized linear models and beyond. Because the objective remains convex, a variety of efficient solvers—coordinate descent for squared loss and proximal gradient methods for smooth losses—can find globally optimal solutions. Hyperparameters controlling the balance and strength of the penalties are typically selected using k-fold cross-validation. In practice, Elastic Net has become a robust default when you have many, potentially correlated predictors and you need a model that is interpretable (sparse), stable, and accurate.

02Intuition & Analogies

Imagine packing a backpack for a hike. You want to carry only the items that matter (sparsity) but also distribute weight evenly so the bag is comfortable (stability). L1 regularization is like a strict baggage fee that forces you to drop many items entirely; it drives some weights to zero, leaving a light, simple pack. L2 regularization is like using elastic straps around the backpack: it doesn’t remove items, but it presses everything closer to the center, reducing the influence of any single item. Now think about carrying two highly similar items—say, two near-identical water bottles. L1 alone often makes you keep one and drop the other, which can be unstable if they’re nearly indistinguishable. L2 alone keeps both but shrinks them, which may keep redundancy. Elastic Net combines the baggage fee and the elastic straps. It removes truly unnecessary items but, when two items are redundant, it tends to keep them together at reduced weight (the “grouping effect”). This makes the final pack both lean and balanced. In data terms, when features are correlated, Elastic Net tends to keep correlated features together with similar coefficients, rather than arbitrarily selecting one. This leads to more stable models across different samples. The penalty strength (λ) controls how aggressively you pack light, and the mixing factor (α) controls how much you care about dropping items entirely (L1) versus compressing them (L2).

03Formal Definition

For a parameter vector w ∈ Rp and intercept b, Elastic Net minimizes a penalized loss. For linear regression with squared loss, the objective is: \[ minw,b​ \; n1​ ∑i=1n​ (yi​ - b - x_i⊤ w)^2 \, + \, λ1​ ∑j=1p​ ∣wj​∣ \, + \, 2λ2​​ ∥ w ∥22​. \] Here, \(λ1​ ≥ 0\) controls sparsity via the L1 norm (sum of absolute values), and \(λ2​ ≥ 0\) controls shrinkage via the squared L2 norm. A common reparameterization uses a single strength \(λ\) and a mixing factor \(α ∈ [0,1]\): \[ λ1​ = λ α, λ2​ = λ (1-α). \] The loss remains convex because both the squared loss and L2 term are smooth convex, and the L1 norm is convex (though non-smooth). For generalized linear models, replace the squared loss with a convex differentiable loss \(ℓ(w,b)\) (e.g., logistic loss), yielding: \[ minw,b​ \; n1​ ∑i=1n​ ℓ(yi​, b + x_i⊤ w) + λ1​ ∥ w ∥1​ + 2λ2​​ ∥ w ∥22​. \] Optimization typically uses coordinate descent (for squared loss) or proximal gradient methods (for smooth losses), leveraging the separability of the L1 penalty via soft-thresholding.

04When to Use

Use Elastic Net when you have many features, especially with substantial correlation among them, and you want both interpretability and robustness. Typical scenarios include genomics (many correlated gene expression measurements), text features (high-dimensional sparse vectors with co-occurring terms), and finance (highly correlated indicators). If pure Lasso is unstable because it arbitrarily selects one among correlated predictors, Elastic Net’s L2 component helps keep groups together, improving stability and predictive performance. When you need exact zeros for feature selection, keep a nonzero L1 component (α > 0). When overfitting is severe but you still value keeping related predictors, choose a moderate α (e.g., 0.2–0.8) and tune λ by cross-validation. If p > n (more features than samples), Elastic Net often outperforms Ridge and can outperform Lasso due to its grouping effect. It is also useful when you plan to compute regularization paths: sweeping λ while monitoring validation performance offers a clear bias–variance–sparsity trade-off. Remember to standardize features so that the same penalty scale applies across coefficients; otherwise, features with larger scales are penalized disproportionately.

⚠️Common Mistakes

  • Skipping feature standardization: Without zero-mean and unit-variance scaling, coefficients tied to large-scale features are over-penalized, biasing selection. Always standardize X (but do not penalize the intercept).\n- Penalizing the intercept: The intercept should be free; include it in the model but exclude it from L1/L2 penalties.\n- Using plain gradient descent for L1: The L1 term is non-smooth; use coordinate descent or proximal gradient with soft-thresholding. A vanilla gradient step on |w| is undefined at zero.\n- Ignoring correlation structure: Pure Lasso may drop useful correlated predictors; if you see instability across folds, increase the L2 share (lower α) to encourage grouping.\n- Mis-scaling λ with sample size: When comparing across datasets with different n, remember the loss is usually averaged by n; keep λ on a comparable scale or re-tune via cross-validation.\n- Early stopping without convergence checks: For coordinate descent, monitor max coefficient change; for proximal methods, track objective decrease or gradient mapping norm.\n- Data leakage during scaling or CV: Fit scalers on training folds only, then transform validation folds; re-fit after model selection on all training data.

Key Formulas

Elastic Net Objective (Linear)

w,bmin​n1​i=1∑n​(yi​−b−xi⊤​w)2+λ1​∥w∥1​+2λ2​​∥w∥22​

Explanation: Penalizes the mean squared error with a combination of L1 and L2 terms. The L1 induces sparsity while the L2 stabilizes and shrinks coefficients.

Elastic Net Objective (General)

w,bmin​n1​i=1∑n​ℓ(yi​,b+xi⊤​w)+λ1​∥w∥1​+2λ2​​∥w∥22​

Explanation: Generalizes to any convex differentiable loss such as logistic loss. Optimization uses proximal methods due to the non-smooth L1 term.

Mixing Parameterization

λ1​=λα,λ2​=λ(1−α),α∈[0,1]

Explanation: Uses a single overall strength λ and a mixing factor α to balance L1 vs L2. α = 1 gives Lasso, α = 0 gives Ridge, and in-between gives Elastic Net.

Soft-thresholding Operator

S(z,τ)=sgn(z)max(∣z∣−τ,0)

Explanation: This is the proximal operator of the L1 norm. It zeroes small values and shrinks large ones by τ, enabling sparse solutions.

Coordinate Descent Update (Squared Loss)

wj​←(n1​∑i=1n​xij2​)+λ2​S(n1​∑i=1n​xij​(yi​−b−∑k=j​xik​wk​),λ1​)​

Explanation: Closed-form update for a single coefficient given others fixed. The numerator is a soft-thresholded partial residual correlation; the denominator includes data curvature plus the L2 penalty.

Logistic Loss Gradient

∇w​[n1​i=1∑n​log(1+e−yi​(b+xi⊤​w))]=−n1​i=1∑n​yi​xi​σ(−yi​(b+xi⊤​w))

Explanation: Gives the gradient of the smooth logistic data term. For Elastic Net, add \(λ2​ w\) and handle \(λ1​\) via the proximal soft-thresholding step.

Proximal Operator for L1

proxηλ1​∥⋅∥1​​(v)=argwmin​2η1​∥w−v∥22​+λ1​∥w∥1​=S(v,ηλ1​)

Explanation: In proximal gradient, after a gradient step to v, apply soft-thresholding with threshold ηλ1 to obtain sparsity.

Optimality (KKT) Condition

0∈∇f(w)+λ2​w+λ1​∂∥w∥1​

Explanation: At the solution, the gradient of the smooth part plus ridge term is balanced by a subgradient of the L1 norm. This characterizes optimality in convex Elastic Net problems.

Complexity Analysis

Let n be the number of samples, p the number of features, and T the number of outer iterations (epochs). For linear regression with squared loss, cyclic coordinate descent updates each coefficient with O(n) work if you maintain the residual vector and column-wise statistics. A full epoch costs O(n p). Precomputing per-column squared norms costs O(n p) once. Memory is O(n + p) for dense data (storing residuals and coefficients), plus O(n p) to store X unless streamed or sparse. For sparse feature matrices with nnz nonzeros, the per-epoch cost becomes O(nnz). Convergence is typically linear for strongly convex objectives (λ2 > 0), and the number of epochs T needed for a given tolerance depends on conditioning and feature correlations.\n\nFor logistic regression or other smooth losses, proximal gradient descent performs one gradient computation per iteration. Computing the gradient requires matrix–vector products: O(n p) for dense X (or O(nnz) for sparse). Each proximal step (soft-thresholding) is O(p). Thus each iteration is O(n p + p), dominated by O(n p). If you use a conservative fixed step size based on a Lipschitz constant (e.g., L ≈ 4n1​ maxj​ \∣xj​∥^2 + λ2), the method converges at rate Ok1​ in objective gap for basic proximal gradient, or O(1/k2) with acceleration (FISTA). Memory remains O(n p) for X plus O(p) for parameters and O(n) for intermediate vectors.\n\nIn practice, coordinate descent often outperforms proximal methods on squared loss due to cheap closed-form updates and better cache locality. For generalized losses, proximal methods are preferred unless specialized coordinate updates are available. Cross-validation multiplies runtime roughly by the number of folds times the number of (λ, α) settings explored; warm starts along a regularization path can substantially reduce total cost.

Code Examples

Elastic Net Linear Regression via Coordinate Descent (Squared Loss)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Soft-thresholding operator S(z, tau) = sign(z) * max(|z| - tau, 0)
5static inline double soft_threshold(double z, double tau) {
6 if (z > tau) return z - tau;
7 if (z < -tau) return z + tau;
8 return 0.0;
9}
10
11struct StandardScaler {
12 vector<double> mean, stdv;
13 // Fit on X (n x p) and transform in-place (exclude intercept)
14 void fit_transform(vector<vector<double>>& X) {
15 int n = (int)X.size();
16 int p = (int)X[0].size();
17 mean.assign(p, 0.0);
18 stdv.assign(p, 0.0);
19 for (int j = 0; j < p; ++j) {
20 for (int i = 0; i < n; ++i) mean[j] += X[i][j];
21 mean[j] /= n;
22 for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
23 stdv[j] = sqrt(stdv[j] / max(1, n - 1));
24 if (stdv[j] == 0.0) stdv[j] = 1.0; // avoid divide-by-zero
25 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
26 }
27 }
28};
29
30struct ElasticNetCD {
31 int max_epochs = 2000;
32 double tol = 1e-6;
33 double lambda1; // L1 strength
34 double lambda2; // L2 strength
35
36 ElasticNetCD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
37
38 // Fits: minimize (1/n) * ||y - b - X w||^2 + lambda1 * ||w||_1 + (lambda2/2) * ||w||_2^2
39 // Returns pair {intercept b, weights w}
40 pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<double>& y) {
41 int n = (int)X.size();
42 int p = (int)X[0].size();
43 vector<double> w(p, 0.0);
44 vector<double> r = y; // residual r = y - b - Xw ; start with b=mean(y), but we'll incorporate that via centering below
45 double b = 0.0;
46
47 // Initialize intercept as mean(y)
48 double y_mean = accumulate(y.begin(), y.end(), 0.0) / n;
49 b = y_mean;
50 for (int i = 0; i < n; ++i) r[i] = y[i] - b; // since w=0 initially
51
52 // Precompute per-column squared norms: (1/n) * sum x_{ij}^2
53 vector<double> col_norm_sq(p, 0.0);
54 for (int j = 0; j < p; ++j) {
55 double s2 = 0.0;
56 for (int i = 0; i < n; ++i) s2 += X[i][j] * X[i][j];
57 col_norm_sq[j] = s2 / n;
58 }
59
60 for (int epoch = 0; epoch < max_epochs; ++epoch) {
61 double max_change = 0.0;
62
63 // Coordinate-wise updates
64 for (int j = 0; j < p; ++j) {
65 if (col_norm_sq[j] == 0.0) { w[j] = 0.0; continue; }
66 // Add back old contribution of feature j to residual
67 for (int i = 0; i < n; ++i) r[i] += X[i][j] * w[j];
68
69 // Partial residual correlation: (1/n) * sum x_{ij} * r_i
70 double zj = 0.0;
71 for (int i = 0; i < n; ++i) zj += X[i][j] * r[i];
72 zj /= n;
73
74 // Closed-form update with soft-thresholding
75 double new_wj = soft_threshold(zj, lambda1) / (col_norm_sq[j] + lambda2);
76 double delta = new_wj - w[j];
77 w[j] = new_wj;
78
79 // Subtract new contribution to residual
80 for (int i = 0; i < n; ++i) r[i] -= X[i][j] * w[j];
81
82 max_change = max(max_change, fabs(delta));
83 }
84
85 // Optimal intercept for squared loss: b = mean(y - Xw)
86 // Given r = y - b - Xw, we have mean(r) = mean(y - b - Xw)
87 double mean_r = accumulate(r.begin(), r.end(), 0.0) / n;
88 b += mean_r;
89 for (int i = 0; i < n; ++i) r[i] -= mean_r; // keep r consistent
90
91 if (max_change < tol) break;
92 }
93
94 return {b, w};
95 }
96};
97
98int main() {
99 ios::sync_with_stdio(false);
100 cin.tie(nullptr);
101
102 // Synthetic dense dataset: n samples, p features
103 int n = 200, p = 12;
104 std::mt19937 rng(42);
105 std::normal_distribution<double> nd(0.0, 1.0);
106
107 vector<vector<double>> X(n, vector<double>(p));
108 for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
109
110 // True sparse weights with correlated features
111 vector<double> w_true(p, 0.0);
112 w_true[0] = 3.0; w_true[1] = -2.0; w_true[5] = 1.5; // sparse support
113 double b_true = 1.0;
114
115 // Create correlation between feature 0 and 1 (add small fraction of 0 into 1)
116 for (int i = 0; i < n; ++i) X[i][1] = 0.8 * X[i][0] + 0.2 * X[i][1];
117
118 // Generate y with noise
119 vector<double> y(n, b_true);
120 for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) y[i] += X[i][j] * w_true[j];
121 for (int i = 0; i < n; ++i) y[i] += 0.5 * nd(rng); // noise
122
123 // Standardize features (important!)
124 StandardScaler scaler;
125 scaler.fit_transform(X);
126
127 // Fit Elastic Net with moderate mixing
128 double lambda = 0.2; // overall strength (tune via CV in practice)
129 double alpha = 0.6; // mixing: 0 -> ridge, 1 -> lasso
130 double lambda1 = lambda * alpha;
131 double lambda2 = lambda * (1.0 - alpha);
132
133 ElasticNetCD model(lambda1, lambda2);
134 auto [b_hat, w_hat] = model.fit(X, y);
135
136 cout << fixed << setprecision(4);
137 cout << "Intercept: " << b_hat << "\nWeights:\n";
138 for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
139
140 return 0;
141}
142

This program fits an Elastic Net linear regression model using cyclic coordinate descent. Features are standardized so the L1/L2 penalties treat coefficients comparably. The algorithm maintains the residual vector to update each coefficient in O(n) time via a soft-thresholded closed-form. The intercept is updated to the mean of residual-adjusted targets and is not penalized.

Time: O(n p T) for T epochs; O(nnz T) if X is sparse.Space: O(n + p) additional memory beyond the input matrix; O(n p) to store X if dense.
Elastic Net Logistic Regression via Proximal Gradient (Full-Batch)
1#include <bits/stdc++.h>
2using namespace std;
3
4static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
5
6static inline double soft_threshold(double z, double tau) {
7 if (z > tau) return z - tau;
8 if (z < -tau) return z + tau;
9 return 0.0;
10}
11
12struct StandardScaler {
13 vector<double> mean, stdv;
14 void fit_transform(vector<vector<double>>& X) {
15 int n = (int)X.size(), p = (int)X[0].size();
16 mean.assign(p, 0.0); stdv.assign(p, 0.0);
17 for (int j = 0; j < p; ++j) {
18 for (int i = 0; i < n; ++i) mean[j] += X[i][j];
19 mean[j] /= n;
20 for (int i = 0; i < n; ++i) stdv[j] += (X[i][j] - mean[j]) * (X[i][j] - mean[j]);
21 stdv[j] = sqrt(stdv[j] / max(1, n - 1));
22 if (stdv[j] == 0.0) stdv[j] = 1.0;
23 for (int i = 0; i < n; ++i) X[i][j] = (X[i][j] - mean[j]) / stdv[j];
24 }
25 }
26};
27
28struct ElasticNetLogisticPGD {
29 int max_iters = 2000;
30 double tol = 1e-6;
31 double lambda1; // L1
32 double lambda2; // L2
33
34 ElasticNetLogisticPGD(double l1, double l2) : lambda1(l1), lambda2(l2) {}
35
36 pair<double, vector<double>> fit(const vector<vector<double>>& X, const vector<int>& y01) {
37 int n = (int)X.size();
38 int p = (int)X[0].size();
39 vector<double> w(p, 0.0);
40 double b = 0.0; // intercept (unpenalized)
41
42 // Compute step size via Lipschitz bound: L <= (1/4n) * max_j ||x_j||^2 + lambda2
43 vector<double> col_norm_sq(p, 0.0);
44 for (int j = 0; j < p; ++j) {
45 double s2 = 0.0; for (int i = 0; i < n; ++i) s2 += X[i][j]*X[i][j];
46 col_norm_sq[j] = s2 / n;
47 }
48 double max_col = 0.0; for (double v : col_norm_sq) max_col = max(max_col, v);
49 double L = 0.25 * max_col + lambda2 + 1e-12; // avoid divide-by-zero
50 double eta = 1.0 / L;
51
52 vector<double> grad_w(p), p_hat(n);
53 for (int it = 0; it < max_iters; ++it) {
54 // Predictions and residuals
55 for (int i = 0; i < n; ++i) {
56 double z = b;
57 for (int j = 0; j < p; ++j) z += X[i][j] * w[j];
58 p_hat[i] = sigmoid(z);
59 }
60
61 // Gradient of data term: (1/n) X^T (p - y)
62 fill(grad_w.begin(), grad_w.end(), 0.0);
63 double grad_b = 0.0;
64 for (int i = 0; i < n; ++i) {
65 double diff = p_hat[i] - (double)y01[i];
66 grad_b += diff;
67 for (int j = 0; j < p; ++j) grad_w[j] += X[i][j] * diff;
68 }
69 grad_b /= n;
70 for (int j = 0; j < p; ++j) grad_w[j] = grad_w[j] / n + lambda2 * w[j];
71
72 // Gradient step on smooth part (data + L2)
73 vector<double> w_temp(p);
74 for (int j = 0; j < p; ++j) w_temp[j] = w[j] - eta * grad_w[j];
75 double b_new = b - eta * grad_b; // intercept not penalized
76
77 // Proximal step for L1
78 double thresh = eta * lambda1;
79 double max_change = 0.0;
80 for (int j = 0; j < p; ++j) {
81 double new_wj = soft_threshold(w_temp[j], thresh);
82 max_change = max(max_change, fabs(new_wj - w[j]));
83 w[j] = new_wj;
84 }
85 max_change = max(max_change, fabs(b_new - b));
86 b = b_new;
87
88 if (max_change < tol) break;
89 }
90 return {b, w};
91 }
92};
93
94int main() {
95 ios::sync_with_stdio(false);
96 cin.tie(nullptr);
97
98 int n = 400, p = 10;
99 std::mt19937 rng(123);
100 std::normal_distribution<double> nd(0.0, 1.0);
101
102 vector<vector<double>> X(n, vector<double>(p));
103 for (int i = 0; i < n; ++i) for (int j = 0; j < p; ++j) X[i][j] = nd(rng);
104
105 // True weights and intercept
106 vector<double> w_true(p, 0.0); w_true[0] = 2.0; w_true[1] = -1.5; w_true[3] = 1.0; w_true[7] = 0.5;
107 double b_true = -0.3;
108
109 // Correlate features 0 and 1
110 for (int i = 0; i < n; ++i) X[i][1] = 0.7 * X[i][0] + 0.3 * X[i][1];
111
112 // Generate probabilities and labels
113 vector<int> y(n);
114 for (int i = 0; i < n; ++i) {
115 double z = b_true;
116 for (int j = 0; j < p; ++j) z += X[i][j] * w_true[j];
117 double pr = sigmoid(z);
118 std::bernoulli_distribution bd(pr);
119 y[i] = bd(rng);
120 }
121
122 // Standardize features
123 StandardScaler scaler; scaler.fit_transform(X);
124
125 double lambda = 0.15, alpha = 0.5;
126 double lambda1 = lambda * alpha;
127 double lambda2 = lambda * (1.0 - alpha);
128
129 ElasticNetLogisticPGD clf(lambda1, lambda2);
130 auto [b_hat, w_hat] = clf.fit(X, y);
131
132 cout << fixed << setprecision(4);
133 cout << "Intercept: " << b_hat << "\nWeights:\n";
134 for (int j = 0; j < p; ++j) cout << "w[" << j << "] = " << w_hat[j] << "\n";
135
136 // Quick accuracy check
137 int correct = 0;
138 for (int i = 0; i < n; ++i) {
139 double z = b_hat; for (int j = 0; j < p; ++j) z += X[i][j] * w_hat[j];
140 int pred = sigmoid(z) >= 0.5 ? 1 : 0;
141 if (pred == y[i]) ++correct;
142 }
143 cout << "Training accuracy: " << (double)correct / n << "\n";
144 return 0;
145}
146

This program trains an Elastic Net logistic regression classifier using proximal gradient descent. The data term gradient (plus λ2w) is followed by the L1 proximal step (soft-thresholding), which produces sparsity. The step size is set from a Lipschitz bound using the maximum column norm to ensure convergence. The intercept is not penalized.

Time: O(n p T) for T proximal iterations; O(nnz T) for sparse data.Space: O(n p) to store X if dense, plus O(p) for parameters and O(n) for working vectors.
#elastic net#lasso#ridge regression#coordinate descent#proximal gradient#soft-thresholding#feature selection#multicollinearity#regularization path#cross-validation#logistic regression#sparse models#convex optimization#hyperparameter tuning#standardization