🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Convex Optimization Problems

Key Points

  • •
    A convex optimization problem minimizes a convex function over a convex set, guaranteeing that every local minimum is a global minimum.
  • •
    Convex problems are shaped like bowls; moving downhill with gradient-based methods reliably reaches the bottom.
  • •
    Standard form uses convex inequality constraints and affine equality constraints, enabling powerful duality theory and KKT optimality conditions.
  • •
    First-order (gradient or subgradient) methods scale to large problems and have provable convergence rates under mild smoothness assumptions.
  • •
    Projected and proximal gradients handle constraints and nonsmooth terms like L1 norms via simple geometric/proximal operations.
  • •
    Newton and quasi-Newton methods converge faster when second-order curvature information is affordable and the problem is well-conditioned.
  • •
    Strong convexity and Lipschitz-smoothness constants determine step sizes and convergence speed in practice.
  • •
    C++ implementations rely on efficient vector/matrix operations and careful line search or step-size selection for stability.

Prerequisites

  • →Multivariable calculus (gradients and Hessians) — Understanding derivatives in multiple dimensions is essential for defining convexity, gradients, and Newton steps.
  • →Linear algebra (vectors, matrices, norms) — Optimization over R^n relies on matrix-vector operations, eigenvalues, and norms used in complexity and convergence analysis.
  • →Basic real analysis/inequalities — Properties like Jensen's inequality and continuity support convexity proofs and bounds.
  • →Numerical methods and conditioning — Practical solvers require stable linear algebra and awareness of conditioning for convergence speed.
  • →Probability and statistics (optional for ML) — Interpreting loss functions and regularization in machine learning contexts benefits from probabilistic modeling.
  • →Algorithm design and analysis — Selecting algorithms and analyzing time/space complexity is key for scalable implementations.

Detailed Explanation

Tap terms for definitions

01Overview

Convex optimization is the study and practice of minimizing a convex objective function over a convex set of feasible points. A set is convex if for any two points in the set, the straight line segment connecting them remains entirely within the set. A function is convex if its graph always lies below the straight line between any two points on the graph. This structure leads to a remarkable property: every local minimum is a global minimum. That means algorithms that take small, local improving steps—like gradient descent—are much more reliable than in non-convex settings. The general form includes convex inequality constraints and affine equality constraints, unifying a wide range of practical problems in machine learning (e.g., LASSO, logistic regression), engineering design, finance (portfolio optimization), and operations research (resource allocation). The theory provides tools like duality and Karush–Kuhn–Tucker (KKT) conditions to certify optimality. Practically, we choose methods based on problem size and smoothness: first-order methods for massive data, proximal methods for nonsmooth regularization, and Newton-type methods for smaller problems needing high accuracy. Good implementations hinge on numerically stable linear algebra, step-size strategies, and exploiting problem structure (sparsity, separability).

02Intuition & Analogies

Picture a smooth, wide bowl on a table. Drop a marble anywhere in the bowl: if you let it roll downhill, it will eventually settle at the very bottom. That bowl shape is the essence of convexity. There aren’t hidden pockets or multiple valleys to trap your marble; no matter where you begin, the lowest point is the same. Now add fences around the bowl to restrict where the marble is allowed to roll—those fences are constraints. If the fences form a convex region (like a circular ring or a rectangle), the marble can still slide along the surface but will stop either at the bottom or somewhere on the fence where it can’t go any lower while staying feasible. The slope at each point (the gradient) tells you the steepest downhill direction; following that slope reduces height, just like descending in a landscape. If the bowl is steeper in some directions than others (ill-conditioned), your steps may zig-zag; preconditioning or second-order curvature (Newton’s method) helps you stride directly toward the bottom. When the surface has sharp edges (like adding an L1 penalty that creates a ridged floor), you can’t use plain gradients everywhere; instead, you use subgradients or a proximal step—think of it as a smart move that both goes downhill and snaps you to a simpler shape (like encouraging exact zeros). Projected gradient methods are like taking a downhill step and then nudging the marble back inside the fences by the shortest possible move. This geometric picture explains why convex optimization is both powerful and dependable: the landscape’s shape makes finding the bottom tractable.

03Formal Definition

A set C ⊂ Rn is convex if for any x, y ∈ C and any θ ∈ [0,1], we have θ x + (1-θ) y ∈ C. A function f: Rn → R is convex if for all x, y and θ ∈ [0,1], f(θ x + (1-θ) y) ≤ θ f(x) + (1-θ) f(y). A standard convex optimization problem is minimize f0​(x) subject to fi​(x) ≤ 0 for i=1,…,m and A x=b, where each fi​ is convex and A x=b is affine. If f0​ is differentiable and convex, the first-order condition (subgradient inequality) gives f(y) ≥ f(x) + ∇ f(x)^{⊤}(y - x). For constrained problems, the Lagrangian is L(x, λ, ν) = f0​(x) + ∑i=1m​ λi​ fi​(x) + ν⊤(A x - b), with λ ≥ 0. Under Slater’s condition (strict feasibility for inequalities), strong duality holds; optimal (x⋆, λ⋆, ν⋆) satisfy the KKT conditions: primal feasibility, dual feasibility, stationarity, and complementary slackness. Smoothness and curvature are characterized by Lipschitz gradients (\∣∇f(x)−∇f(y)∥ ≤ L \∣x−y∥) and strong convexity (f(y) ≥ f(x) + ∇ f(x)^{⊤}(y - x) + 2μ​\∣y−x∥^{2}). These constants L and μ govern algorithm step sizes and convergence rates.

04When to Use

Use convex optimization when your objective is convex and constraints define a convex feasible region, or when you can reformulate the problem into a convex one. Common scenarios include: training linear models (ridge regression, LASSO, logistic regression), designing portfolios that balance risk and return (quadratic programs), fitting signals with sparsity or smoothness priors (L1/L2 regularization), resource allocation and scheduling with linear or convex costs (linear and second-order cone programs), and control or robotics problems with convex safety margins. If the dimension is large and data are dense, prefer first-order methods (gradient, accelerated gradient, ADMM, proximal gradient) due to low per-iteration cost and easy parallelization. If variables are few to moderate and high accuracy is needed, Newton or quasi-Newton methods converge faster. If constraints are simple sets (boxes, balls, simplices), projected gradient is efficient because projections are cheap. If your objective includes nonsmooth terms like |x|_{1} or indicator functions, use proximal methods that admit closed-form proximal operators (soft-thresholding, clipping). When integer or combinatorial variables are essential, the problem is not convex; consider relaxations to get bounds or approximate solutions, or switch to mixed-integer methods.

⚠️Common Mistakes

• Assuming convexity without verification: Many losses are convex only after specific transformations (e.g., squared loss is convex; some robust losses are not). Always check the Hessian is positive semidefinite or apply definition tests. • Ignoring conditioning: Poorly scaled data lead to slow zig-zagging in gradient descent. Center and scale features, or use preconditioning/second-order methods. • Bad step-size selection: Fixed steps too large can diverge; too small stagnate. Use backtracking line search or steps tied to Lipschitz constants. • Mishandling constraints: Forgetting to project can yield infeasible iterates; an incorrect projection can break convergence. For complicated sets, use proximal or splitting methods (ADMM) instead of ad hoc clipping. • Misusing KKT: KKT conditions are sufficient for global optimality in convex problems, but not in non-convex ones; don’t over-interpret multipliers outside convex settings. • Overlooking nonsmoothness: Applying vanilla gradient descent to L1-regularized problems can fail; use subgradients or proximal operators like soft-thresholding. • Numerical instability: Forming normal equations (A^{\top} A) can square the condition number; prefer matrix-vector products, iterative solvers, or QR/Cholesky when exact solves are required. • Expecting convex methods to solve integer problems: Convex solvers provide relaxations or heuristics; exact integrality requires mixed-integer techniques.

Key Formulas

Convexity of a function

f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y),θ∈[0,1]

Explanation: This inequality states that the function lies below the straight line between any two points. It is the defining property of convex functions.

Lipschitz gradient

∥∇f(x)−∇f(y)∥≤L∥x−y∥

Explanation: The gradient does not change faster than linearly with distance. L bounds curvature and guides step-size choices like 1/L.

First-order condition

f(y)≥f(x)+∇f(x)⊤(y−x)

Explanation: For differentiable convex f, the tangent plane is a global underestimator. This underpins descent methods and optimality checks.

Subgradient inequality

f(y)≥f(x)+g⊤(y−x),g∈∂f(x)

Explanation: Extends the first-order condition to nonsmooth convex functions using subgradients. It is key for L1 penalties and hinge losses.

Strong convexity

f(y)≥f(x)+∇f(x)⊤(y−x)+2μ​∥y−x∥2

Explanation: Adds a quadratic term that lower-bounds curvature. Strong convexity yields faster (linear) convergence for gradient methods.

Standard convex program

x∈Cmin​f(x)s.t.fi​(x)≤0,Ax=b

Explanation: A canonical form with convex inequalities and affine equalities. Many problems can be reformulated into this structure.

Lagrangian

L(x,λ,ν)=f0​(x)+i=1∑m​λi​fi​(x)+ν⊤(Ax−b)

Explanation: Combines objective and constraints with multipliers to form the dual problem. Dual variables have economic interpretations.

KKT conditions

​Primal feasibility: fi​(x∗)≤0,Ax∗=bDual feasibility: λ∗≥0Stationarity: 0∈∇f0​(x∗)+i∑​λi∗​∇fi​(x∗)+A⊤ν∗Complementary slackness: λi∗​fi​(x∗)=0​

Explanation: Necessary and sufficient for optimality in convex problems under mild conditions. They provide certificates of optimal solutions.

Gradient descent update

xk+1​=xk​−αk​∇f(xk​)

Explanation: Moves opposite the gradient by a step size. For convex smooth functions, proper step sizes guarantee monotone decrease.

Proximal gradient (ISTA)

xk+1​=proxαg​(xk​−α∇h(xk​))

Explanation: Splits objective into smooth part h and nonsmooth part g, taking a gradient step on h and a proximal step on g. Ideal for L1 regularization.

Projected gradient

xk+1​=ΠC​(xk​−αk​∇f(xk​))

Explanation: Takes a gradient step then projects onto feasible set C. Works well when projection is simple (boxes, balls, simplices).

Newton step

xk+1​=xk​−[∇2f(xk​)]−1∇f(xk​)

Explanation: Uses second-order curvature to adjust the step. Quadratically convergent near the optimum when Hessian is positive definite.

Sublinear rate (GD)

f(xk​)−f(x∗)≤2kL∥x0​−x∗∥2​

Explanation: For L-smooth convex f, gradient descent with constant step achieves Ok1​ decrease in suboptimality. This sets expectations for large-scale runs.

Linear rate (strongly convex)

∥xk​−x∗∥2≤(1−Lμ​)k∥x0​−x∗∥2

Explanation: With strong convexity (μ) and smoothness (L), gradient descent converges linearly. The rate depends on the condition number L/μ.

Complexity Analysis

The computational complexity of convex optimization algorithms hinges on the per-iteration cost and the number of iterations required to reach a target accuracy. For first-order methods (gradient, projected gradient, proximal gradient), the dominant cost is evaluating gradients and, if needed, performing projections or proximal mappings. For dense least squares with data matrix A ∈ Rm×n, one gradient evaluation requires O(mn) time (one Ax and one A⊤y-like product), with O(mn) space to store A (or less if streaming). Projections onto simple sets like boxes are O(n); proximal steps for ℓ1​ norms are O(n) via soft-thresholding. These methods typically require O(1/ϵ) iterations to reach ϵ suboptimality for general smooth convex objectives, improving to O(log(1/ϵ)) when strong convexity is present and step sizes are chosen with curvature constants. Accelerated variants (not implemented here) reach O(1/ϵ​). Second-order methods (Newton) incur higher per-iteration costs due to solving linear systems involving the Hessian: forming and factoring an n× n Hessian is O(n3) for dense problems, with O(n2) memory. However, they can achieve quadratic convergence near the optimum, often reaching high precision in few iterations for moderate n. Line search (backtracking) usually adds a small constant factor by re-evaluating the objective/gradient a few times. In well-structured sparse problems, exploiting sparsity can reduce costs dramatically, e.g., O(nnz(A)) per gradient step. Overall, for large-scale machine learning with millions of samples, first-order or stochastic methods are preferred; for smaller, high-accuracy engineering designs, Newton or interior-point methods are competitive.

Code Examples

Gradient Descent with Backtracking for a Convex Quadratic f(x) = 0.5 x^T Q x + b^T x
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute y = Qx for symmetric positive definite Q
5vector<double> matvec(const vector<vector<double>>& Q, const vector<double>& x) {
6 int n = (int)Q.size();
7 vector<double> y(n, 0.0);
8 for (int i = 0; i < n; ++i) {
9 double sum = 0.0;
10 for (int j = 0; j < n; ++j) sum += Q[i][j] * x[j];
11 y[i] = sum;
12 }
13 return y;
14}
15
16double dot(const vector<double>& a, const vector<double>& b) {
17 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
18}
19
20double norm2(const vector<double>& x) {
21 return sqrt(dot(x, x));
22}
23
24// f(x) = 0.5 x^T Q x + b^T x
25struct Quadratic {
26 vector<vector<double>> Q; // SPD matrix
27 vector<double> b;
28
29 double value(const vector<double>& x) const {
30 vector<double> Qx = matvec(Q, x);
31 return 0.5 * dot(x, Qx) + dot(b, x);
32 }
33
34 vector<double> grad(const vector<double>& x) const {
35 // ∇f(x) = Qx + b
36 vector<double> Qx = matvec(Q, x);
37 vector<double> g = Qx;
38 for (size_t i = 0; i < g.size(); ++i) g[i] += b[i];
39 return g;
40 }
41};
42
43vector<double> gradient_descent_backtracking(const Quadratic& prob, vector<double> x0,
44 double tol=1e-8, int max_iters=1000,
45 double alpha0=1.0, double c=1e-4, double rho=0.5) {
46 vector<double> x = x0;
47 for (int k = 0; k < max_iters; ++k) {
48 vector<double> g = prob.grad(x);
49 double gnorm = norm2(g);
50 if (gnorm < tol) {
51 cout << "Converged in " << k << " iterations.\n";
52 break;
53 }
54 // Descent direction is -g
55 vector<double> d = g;
56 for (double &v : d) v = -v;
57
58 // Backtracking line search (Armijo)
59 double t = alpha0;
60 double fx = prob.value(x);
61 double dg = dot(g, d); // should be negative
62 while (true) {
63 vector<double> x_new = x;
64 for (size_t i = 0; i < x.size(); ++i) x_new[i] += t * d[i];
65 double fx_new = prob.value(x_new);
66 if (fx_new <= fx + c * t * dg) {
67 x = std::move(x_new);
68 break;
69 }
70 t *= rho;
71 if (t < 1e-20) { // safeguard
72 // Take a very small step to ensure progress
73 for (size_t i = 0; i < x.size(); ++i) x[i] += 1e-20 * d[i];
74 break;
75 }
76 }
77 if (k % 20 == 0) {
78 cout << "Iter " << k << ": f = " << prob.value(x) << ", ||g|| = " << gnorm << "\n";
79 }
80 }
81 return x;
82}
83
84int main() {
85 // Example SPD Q and vector b for n=3
86 vector<vector<double>> Q = {
87 {4.0, 1.0, 0.0},
88 {1.0, 3.0, 0.5},
89 {0.0, 0.5, 2.0}
90 };
91 vector<double> b = {-1.0, 2.0, -3.0};
92
93 Quadratic prob{Q, b};
94 vector<double> x0 = {0.0, 0.0, 0.0};
95 vector<double> x_star = gradient_descent_backtracking(prob, x0);
96
97 cout << fixed << setprecision(6);
98 cout << "Solution: ";
99 for (double xi : x_star) cout << xi << ' ';
100 cout << "\nFinal f(x): " << prob.value(x_star) << "\n";
101 return 0;
102}
103

This program minimizes a strictly convex quadratic using gradient descent with Armijo backtracking line search. The gradient is Qx + b and the line search ensures sufficient decrease each step. Because the objective is convex with Lipschitz gradient, the method converges to the unique global minimizer.

Time: O(n^2) per iteration for dense Q (matrix-vector product), with a small constant factor for backtracking evaluations.Space: O(n^2) to store Q and O(n) for vectors.
Proximal Gradient (ISTA) for LASSO: min 0.5||Ax − y||^2 + λ||x||_1
1#include <bits/stdc++.h>
2using namespace std;
3
4struct DenseMatrix {
5 int m, n; // m rows, n cols
6 vector<vector<double>> A; // A[m][n]
7
8 DenseMatrix(int m_, int n_): m(m_), n(n_), A(m_, vector<double>(n_, 0.0)) {}
9
10 vector<double> matvec(const vector<double>& x) const {
11 vector<double> y(m, 0.0);
12 for (int i = 0; i < m; ++i) {
13 double s = 0.0; for (int j = 0; j < n; ++j) s += A[i][j] * x[j];
14 y[i] = s;
15 }
16 return y;
17 }
18 vector<double> matTvec(const vector<double>& y) const {
19 vector<double> x(n, 0.0);
20 for (int j = 0; j < n; ++j) {
21 double s = 0.0; for (int i = 0; i < m; ++i) s += A[i][j] * y[i];
22 x[j] = s;
23 }
24 return x;
25 }
26};
27
28vector<double> soft_threshold(const vector<double>& v, double tau) {
29 vector<double> x(v.size());
30 for (size_t i = 0; i < v.size(); ++i) {
31 double z = v[i];
32 if (z > tau) x[i] = z - tau;
33 else if (z < -tau) x[i] = z + tau;
34 else x[i] = 0.0;
35 }
36 return x;
37}
38
39double objective(const DenseMatrix& A, const vector<double>& x, const vector<double>& y, double lambda) {
40 vector<double> r = A.matvec(x);
41 for (size_t i = 0; i < r.size(); ++i) r[i] -= y[i];
42 double data = 0.0; for (double ri : r) data += 0.5 * ri * ri;
43 double l1 = 0.0; for (double xi : x) l1 += fabs(xi);
44 return data + lambda * l1;
45}
46
47// Power iteration to estimate L = largest eigenvalue of A^T A (i.e., ||A||_2^2)
48double estimate_L(const DenseMatrix& A, int iters = 50) {
49 int n = A.n;
50 vector<double> x(n, 0.0);
51 std::mt19937 rng(42);
52 std::uniform_real_distribution<double> dist(-1.0, 1.0);
53 for (int i = 0; i < n; ++i) x[i] = dist(rng);
54 auto normalize = [](vector<double>& v){ double s=0; for(double vi:v)s+=vi*vi; s=sqrt(s); if(s>0)for(double&vi:v)vi/=s; };
55 normalize(x);
56 double lam = 0.0;
57 for (int k = 0; k < iters; ++k) {
58 vector<double> y = A.matvec(x);
59 vector<double> z = A.matTvec(y); // z = A^T A x
60 lam = 0.0; for (size_t i = 0; i < z.size(); ++i) lam += x[i]*z[i];
61 x = z; normalize(x);
62 }
63 return max(lam, 1e-12); // avoid zero
64}
65
66vector<double> ista_lasso(const DenseMatrix& A, const vector<double>& y, double lambda,
67 int max_iters=1000, double tol=1e-6) {
68 int n = A.n;
69 vector<double> x(n, 0.0), x_old(n, 0.0);
70 // Lipschitz constant for gradient of 0.5||Ax - y||^2 is L = ||A||_2^2
71 double L = estimate_L(A, 80);
72 double t = 1.0 / L;
73 for (int k = 0; k < max_iters; ++k) {
74 x_old = x;
75 // Gradient: A^T (A x - y)
76 vector<double> Ax = A.matvec(x);
77 for (size_t i = 0; i < Ax.size(); ++i) Ax[i] -= y[i];
78 vector<double> g = A.matTvec(Ax);
79 // Gradient step then proximal step (soft-thresholding)
80 vector<double> v(n);
81 for (int i = 0; i < n; ++i) v[i] = x[i] - t * g[i];
82 x = soft_threshold(v, lambda * t);
83
84 // Check convergence by relative change
85 double num = 0.0, den = 0.0;
86 for (int i = 0; i < n; ++i) { double d = x[i]-x_old[i]; num += d*d; den += x_old[i]*x_old[i]; }
87 double rel = sqrt(num) / (sqrt(den) + 1e-12);
88 if (k % 50 == 0) {
89 cout << "Iter " << k << ": obj = " << objective(A, x, y, lambda) << ", rel_change = " << rel << "\n";
90 }
91 if (rel < tol) {
92 cout << "Converged in " << k << " iterations.\n";
93 break;
94 }
95 }
96 return x;
97}
98
99int main(){
100 // Create a small synthetic LASSO instance (m = 50 samples, n = 20 features)
101 int m = 50, n = 20;
102 DenseMatrix A(m, n);
103 std::mt19937 rng(123);
104 std::normal_distribution<double> nd(0.0, 1.0);
105 for (int i = 0; i < m; ++i)
106 for (int j = 0; j < n; ++j)
107 A.A[i][j] = nd(rng) / sqrt((double)m); // scale for conditioning
108
109 vector<double> x_true(n, 0.0);
110 // Make a sparse ground truth with 5 nonzeros
111 for (int j = 0; j < 5; ++j) x_true[j] = (j % 2 ? -1.5 : 2.0);
112
113 vector<double> y = A.matvec(x_true);
114 // Add small noise
115 for (double &yi : y) yi += 0.05 * nd(rng);
116
117 double lambda = 0.1;
118 vector<double> x_est = ista_lasso(A, y, lambda, 2000, 1e-6);
119
120 cout << fixed << setprecision(6);
121 cout << "Recovered x (first 10 entries): ";
122 for (int i = 0; i < min(10, (int)x_est.size()); ++i) cout << x_est[i] << ' ';
123 cout << "\nObjective: " << objective(A, x_est, y, lambda) << "\n";
124 return 0;
125}
126

This implementation solves the LASSO problem with ISTA: a gradient step on the quadratic loss followed by a soft-thresholding proximal step for the L1 penalty. The step size is chosen as 1/L where L approximates the largest eigenvalue of A^T A via power iteration. The method converges because the quadratic loss is smooth and convex, and the L1 norm is convex with a simple proximal operator.

Time: O(mn) per iteration for dense A (one Ax and one A^T·vector), plus O(n) for soft-thresholding; power iteration to estimate L costs O(mn·T) for T iterations once.Space: O(mn) to store A and O(m + n) for vectors.
Projected Gradient Descent for Box-Constrained Least Squares: min 0.5||Ax − y||^2 s.t. 0 ≤ x ≤ 1
1#include <bits/stdc++.h>
2using namespace std;
3
4struct DenseMat {
5 int m, n; vector<vector<double>> A;
6 DenseMat(int m_, int n_): m(m_), n(n_), A(m_, vector<double>(n_, 0.0)) {}
7 vector<double> matvec(const vector<double>& x) const {
8 vector<double> y(m, 0.0);
9 for (int i = 0; i < m; ++i) {
10 double s = 0.0; for (int j = 0; j < n; ++j) s += A[i][j] * x[j];
11 y[i] = s;
12 }
13 return y;
14 }
15 vector<double> matTvec(const vector<double>& y) const {
16 vector<double> x(n, 0.0);
17 for (int j = 0; j < n; ++j) {
18 double s = 0.0; for (int i = 0; i < m; ++i) s += A[i][j] * y[i];
19 x[j] = s;
20 }
21 return x;
22 }
23};
24
25void project_box(vector<double>& x, double l=0.0, double u=1.0) {
26 for (double &xi : x) {
27 if (xi < l) xi = l; else if (xi > u) xi = u;
28 }
29}
30
31double obj_ls(const DenseMat& A, const vector<double>& x, const vector<double>& y) {
32 vector<double> r = A.matvec(x);
33 for (size_t i = 0; i < r.size(); ++i) r[i] -= y[i];
34 double s = 0.0; for (double ri : r) s += 0.5 * ri * ri; return s;
35}
36
37vector<double> projected_gradient_ls_box(const DenseMat& A, const vector<double>& y, int max_iters=2000,
38 double tol=1e-6, double alpha0=1.0, double c=1e-4, double rho=0.5) {
39 int n = A.n;
40 vector<double> x(n, 0.5); // start at center of box
41 for (int k = 0; k < max_iters; ++k) {
42 // Gradient: g = A^T (A x - y)
43 vector<double> Ax = A.matvec(x);
44 for (size_t i = 0; i < Ax.size(); ++i) Ax[i] -= y[i];
45 vector<double> g = A.matTvec(Ax);
46
47 double gnorm = 0.0; for (double gi : g) gnorm += gi*gi; gnorm = sqrt(gnorm);
48 if (gnorm < tol) { cout << "Converged in " << k << " iterations.\n"; break; }
49
50 // Backtracking on projected step
51 double t = alpha0;
52 double fx = obj_ls(A, x, y);
53 while (true) {
54 vector<double> x_trial(n);
55 for (int i = 0; i < n; ++i) x_trial[i] = x[i] - t * g[i];
56 project_box(x_trial, 0.0, 1.0);
57 double fx_new = obj_ls(A, x_trial, y);
58
59 // Use a generalized Armijo condition with projected point
60 // Compute directional derivative approximation: d = x_trial - x
61 vector<double> d(n);
62 for (int i = 0; i < n; ++i) d[i] = x_trial[i] - x[i];
63 double dg = 0.0; for (int i = 0; i < n; ++i) dg += g[i] * d[i];
64
65 if (fx_new <= fx + c * dg) { x = std::move(x_trial); break; }
66 t *= rho;
67 if (t < 1e-20) { x = std::move(x_trial); break; }
68 }
69 if (k % 50 == 0) {
70 cout << "Iter " << k << ": f = " << obj_ls(A, x, y) << ", ||g|| = " << gnorm << "\n";
71 }
72 }
73 return x;
74}
75
76int main(){
77 int m = 80, n = 30;
78 DenseMat A(m, n);
79 std::mt19937 rng(7);
80 std::normal_distribution<double> nd(0.0, 1.0);
81 for (int i = 0; i < m; ++i)
82 for (int j = 0; j < n; ++j)
83 A.A[i][j] = nd(rng) / sqrt((double)m);
84
85 vector<double> x_true(n, 0.0);
86 for (int j = 0; j < n; ++j) x_true[j] = min(1.0, max(0.0, 0.6 + 0.2 * nd(rng)));
87 vector<double> y = A.matvec(x_true);
88 for (double &yi : y) yi += 0.02 * nd(rng);
89
90 vector<double> x = projected_gradient_ls_box(A, y, 1500, 1e-6);
91 cout << fixed << setprecision(6);
92 cout << "First 10 variables (should be in [0,1]): ";
93 for (int i = 0; i < min(10, (int)x.size()); ++i) cout << x[i] << ' ';
94 cout << "\nObjective: " << obj_ls(A, x, y) << "\n";
95 return 0;
96}
97

This example minimizes a convex least-squares objective subject to simple box constraints using projected gradient descent. Each step takes a gradient move and then projects back onto [0,1]^n by clipping, followed by a backtracking rule to guarantee sufficient decrease. The projection is cheap and preserves feasibility at every iteration.

Time: O(mn) per iteration for dense matrices, plus O(n) for projection.Space: O(mn) to store A and O(m + n) for vectors.
#convex optimization#gradient descent#projected gradient#proximal gradient#lasso#kkt conditions#duality#line search#strong convexity#lipschitz gradient#newton method#first-order methods#box constraints#soft-thresholding#least squares