ADMM (Alternating Direction Method of Multipliers)
Key Points
- •ADMM splits a hard optimization problem into two easier subproblems that communicate through simple averaging-like steps.
- •It alternates minimizing with respect to each block of variables and then updates a running “penalty-adjusted” dual variable.
- •The method is especially effective for large, structured, and distributed convex problems, such as L1-regularized learning (LASSO) and consensus averaging.
- •ADMM’s core engine is the augmented Lagrangian, which stabilizes constraints with a quadratic penalty controlled by a parameter \(\).
- •Stopping is based on primal and dual residuals that measure constraint satisfaction and update stability.
- •Per-iteration cost is usually dominated by linear algebra in each subproblem; warm-starts and precomputed factorizations make ADMM fast in practice.
- •ADMM is robust and easy to implement with proximal operators (like soft-thresholding) but typically converges sublinearly (often \(O\)).
- •Choosing \(\) and good scaling strongly affects speed; heuristic adjustments of \(\) can help convergence.
Prerequisites
- →Convex optimization basics — ADMM assumes convexity for convergence guarantees and uses concepts like convex functions and optimality.
- →Linear algebra (matrix factorizations) — Efficient x-updates rely on solving linear systems using Cholesky/QR and understanding A^T A.
- →Lagrangian duality — The augmented Lagrangian and dual variables are central to ADMM’s design and convergence.
- →Proximal operators — Many ADMM subproblems reduce to evaluating proximal mappings like soft-thresholding.
- →Numerical stability and conditioning — Choosing formulations and factorizations that are stable is key for performance and accuracy.
Detailed Explanation
Tap terms for definitions01Overview
The Alternating Direction Method of Multipliers (ADMM) is a powerful algorithm for solving optimization problems that can be decomposed into parts. Think of problems where an objective naturally splits into two terms, each depending on a different block of variables, tied together by linear constraints. ADMM leverages this structure by alternating between optimizing each block while keeping the other fixed, and then enforcing agreement through a dual update. This combines the decomposability of dual ascent with the stability of the augmented Lagrangian. In practice, ADMM shines in large-scale convex optimization, especially when the cost function involves non-smooth penalties like the L1 norm. Examples include LASSO for sparse regression, total variation denoising, and distributed consensus problems. Because each subproblem can often be solved with simple operations (like soft-thresholding) or fast linear algebra (like solving a system with a precomputed factorization), ADMM can handle very large datasets efficiently and can be distributed across machines. While its theoretical convergence rate is typically sublinear, its robustness, simplicity, and ability to exploit problem structure make it a go-to method in modern optimization and machine learning.
02Intuition & Analogies
Imagine two teams trying to write a joint report. Team X writes the content (x), Team Z designs the layout (z). They must agree on a final version (the constraint ties x and z). Working together on every sentence is slow, so they adopt a routine: Team X drafts content given the current layout, Team Z refines the layout given the new draft, and a coordinator checks how different their versions are, adding a “penalty note” if they disagree. Over time, the fear of accumulating penalties pushes both teams toward agreement while each focuses on their specialty. That is ADMM. More concretely, ADMM introduces a soft “rubber band” between the two variable blocks: if x and z drift apart (violating the constraint), the rubber band (quadratic penalty) pulls them back. The dual variable u is like an accountant tracking the disagreement; when disagreement persists, u tightens the rubber band by increasing the effective cost of misalignment. Because each team only solves their own subproblem (often easy), the overall method scales well. In distributed settings, think of multiple sensors estimating a common quantity: each computes a local update using its data, then everyone averages their answers and adjusts their internal bias (u). Repeat this, and the network reaches consensus without any single node solving the entire problem. The alternating, penalty-aided negotiation is what makes ADMM both stable and modular.
03Formal Definition
04When to Use
Use ADMM when your problem naturally splits into two (or more) parts connected by simple linear constraints, and each part is easy to optimize on its own. Classic scenarios include: (1) Sparse learning, like LASSO, where the smooth least-squares term couples with a non-smooth L1 penalty; (2) Total variation and fused lasso problems where proximal operators are simple; (3) Distributed or federated optimization where data is partitioned across machines—ADMM allows local updates and lightweight global coordination; (4) Consensus and sharing problems where many local variables must agree on a common value; (5) Large-scale problems where precomputations (like a Cholesky factor) can be reused across many iterations. ADMM is less ideal when subproblems are themselves expensive or ill-conditioned without a good preconditioner, when ultra-fast high-accuracy is needed (second-order interior-point methods may win there), or when the problem does not decompose in a way that yields simple proximal steps. It is a strong default for convex, structured, and separable objectives where modest accuracy suffices and scalability matters.
⚠️Common Mistakes
- Ignoring scaling and normalization: Poorly scaled A, variables, or penalties can slow convergence dramatically. Normalize features and tune (\rho) (or adapt it during iterations).
- Not precomputing reusable factorizations: Re-solving the same linear system from scratch each iteration wastes time. Factorize (A^{\top}A + \rho I) once and reuse triangular solves.
- Loose or incorrect stopping criteria: Failing to monitor both primal and dual residuals can cause premature stopping or endless looping. Use standard absolute/relative tolerances that scale with problem size.
- Using too small or too large (\rho): Too small makes constraints weakly enforced (slow primal convergence); too large causes tiny steps (slow dual convergence). Implement heuristic (\rho) adjustment based on residual magnitudes.
- Forgetting warm starts: ADMM benefits from warm-starting x, z, and u, especially when solving a sequence of related problems (e.g., along a regularization path).
- Overlooking numerical stability: For ill-conditioned matrices, prefer numerically stable factorizations (Cholesky for SPD, QR otherwise) and avoid forming A^{\top}A explicitly when possible (use CG or normal equations with care).
Key Formulas
ADMM canonical form
Explanation: This is the standard problem structure ADMM targets: a sum of two functions with a linear constraint tying variables together. Many practical problems can be written this way.
Augmented Lagrangian
Explanation: Adds a quadratic penalty to the classical Lagrangian to stabilize dual ascent. The parameter \(\) controls how strongly constraint violations are penalized.
Scaled ADMM updates
Explanation: These are the core iteration steps using the scaled dual variable u = y/. Each step solves a simpler subproblem focusing on one variable block while the other is fixed.
Residuals (general form)
Explanation: The primal residual r measures constraint violation; the dual residual s measures dual feasibility progress. They are used to decide when to stop.
Residuals (x = z splitting)
Explanation: For problems written with A = I, B = -I, c = 0, the residuals simplify. This is the common form used in LASSO and many proximal splitting problems.
Soft-thresholding
Explanation: The proximal operator for the L1 norm. Apply it elementwise with threshold \( = /\) in LASSO-like problems.
Stopping tolerances
Explanation: Absolute and relative tolerances scale with dimension to make stopping criteria robust. Stop when the residual norms are below these thresholds.
LASSO x-update normal equations
Explanation: For the LASSO splitting with x = z, the x-subproblem reduces to solving a linear system. Precomputing a factorization makes each iteration cheap.
Consensus z-update
Explanation: In consensus ADMM, the global variable is the average of local primal-plus-dual terms. This step communicates minimal information across workers.
Ergodic convergence rate
Explanation: Under convexity, averages of iterates often converge at a rate proportional to 1/k. This describes typical sublinear convergence for first-order splitting methods.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility: 2-norm of a vector 5 static double norm2(const vector<double>& v){ 6 double s=0; for(double x: v) s += x*x; return sqrt(s); 7 } 8 9 // Matrix-vector multiply: y = A x, A is m x n 10 static vector<double> matvec(const vector<vector<double>>& A, const vector<double>& x){ 11 int m = (int)A.size(), n = (int)A[0].size(); 12 vector<double> y(m,0.0); 13 for(int i=0;i<m;++i){ 14 double s=0.0; for(int j=0;j<n;++j) s += A[i][j]*x[j]; 15 y[i]=s; 16 } 17 return y; 18 } 19 20 // Compute AtA (n x n) and Atb (n) 21 static void AtA_Atb(const vector<vector<double>>& A, const vector<double>& b, 22 vector<vector<double>>& AtA, vector<double>& Atb){ 23 int m = (int)A.size(), n = (int)A[0].size(); 24 AtA.assign(n, vector<double>(n, 0.0)); 25 Atb.assign(n, 0.0); 26 for(int i=0;i<m;++i){ 27 for(int j=0;j<n;++j){ 28 Atb[j] += A[i][j]*b[i]; 29 for(int k=0;k<n;++k){ 30 AtA[j][k] += A[i][j]*A[i][k]; 31 } 32 } 33 } 34 } 35 36 // Cholesky decomposition for SPD matrix M (n x n): M = L L^T 37 static bool cholesky(vector<vector<double>> M, vector<vector<double>>& L){ 38 int n = (int)M.size(); 39 L.assign(n, vector<double>(n, 0.0)); 40 for(int i=0;i<n;++i){ 41 for(int j=0;j<=i;++j){ 42 double s = M[i][j]; 43 for(int k=0;k<j;++k) s -= L[i][k]*L[j][k]; 44 if(i==j){ 45 if(s <= 0.0) return false; 46 L[i][i] = sqrt(max(s, 0.0)); 47 }else{ 48 L[i][j] = s / L[j][j]; 49 } 50 } 51 } 52 return true; 53 } 54 55 // Solve (L L^T) x = b 56 static vector<double> chol_solve(const vector<vector<double>>& L, const vector<double>& b){ 57 int n = (int)L.size(); 58 vector<double> y(n,0.0), x(n,0.0); 59 // Forward solve: L y = b 60 for(int i=0;i<n;++i){ 61 double s = b[i]; 62 for(int k=0;k<i;++k) s -= L[i][k]*y[k]; 63 y[i] = s / L[i][i]; 64 } 65 // Backward solve: L^T x = y 66 for(int i=n-1;i>=0;--i){ 67 double s = y[i]; 68 for(int k=i+1;k<n;++k) s -= L[k][i]*x[k]; 69 x[i] = s / L[i][i]; 70 } 71 return x; 72 } 73 74 // Soft-thresholding (prox of L1): z_i = sign(v_i) * max(|v_i| - kappa, 0) 75 static void soft_threshold(vector<double>& v, double kappa){ 76 for(double &x : v){ 77 double a = fabs(x) - kappa; 78 if(a > 0) x = copysign(a, x); 79 else x = 0.0; 80 } 81 } 82 83 int main(){ 84 ios::sync_with_stdio(false); 85 cin.tie(nullptr); 86 87 // Example data: small LASSO instance (m samples, n features) 88 int m = 6, n = 4; 89 vector<vector<double>> A = { 90 {0.50, -0.10, 0.30, 0.00}, 91 {0.20, 0.40, -0.20, 0.10}, 92 {0.00, 0.10, 0.50, -0.30}, 93 {0.40, -0.30, 0.10, 0.20}, 94 {-0.10, 0.20, 0.00, 0.50}, 95 {0.30, 0.10, -0.40, 0.10} 96 }; 97 vector<double> b = { 0.9, 0.1, 0.4, 0.7, -0.2, 0.3 }; 98 99 // ADMM parameters 100 double lambda = 0.2; // L1 regularization strength 101 double rho = 1.0; // Augmented Lagrangian penalty 102 double abs_tol = 1e-4, rel_tol = 1e-3; 103 int max_iters = 200; 104 105 // Precompute normal equations: (A^T A + rho I) 106 vector<vector<double>> AtA; vector<double> Atb; 107 AtA_Atb(A, b, AtA, Atb); 108 for(int i=0;i<n;++i) AtA[i][i] += rho; // add rho I 109 110 // Cholesky factor of (A^T A + rho I) 111 vector<vector<double>> L; 112 if(!cholesky(AtA, L)){ 113 cerr << "Matrix not SPD; try increasing rho or using QR/CG.\n"; 114 return 0; 115 } 116 117 // Initialize x, z, u (warm start with zeros) 118 vector<double> x(n,0.0), z(n,0.0), u(n,0.0); 119 vector<double> z_old(n,0.0); 120 121 // Precompute Atb (constant) 122 // Note: we already have Atb from AtA_Atb 123 124 cout << fixed << setprecision(6); 125 for(int k=0;k<max_iters;++k){ 126 // x-update: solve (A^T A + rho I) x = Atb + rho (z - u) 127 vector<double> rhs(n,0.0); 128 for(int i=0;i<n;++i) rhs[i] = Atb[i] + rho*(z[i] - u[i]); 129 x = chol_solve(L, rhs); 130 131 // z-update: soft-thresholding of x + u with kappa = lambda/rho 132 z_old = z; 133 z = x; // start from x 134 for(int i=0;i<n;++i) z[i] += u[i]; 135 soft_threshold(z, lambda / rho); 136 137 // u-update: u = u + x - z 138 for(int i=0;i<n;++i) u[i] += x[i] - z[i]; 139 140 // Residuals for stopping 141 // Primal: r = x - z; Dual: s = rho * (z - z_old) 142 vector<double> r(n,0.0), s(n,0.0); 143 for(int i=0;i<n;++i){ r[i] = x[i] - z[i]; s[i] = rho*(z[i] - z_old[i]); } 144 145 double r_norm = norm2(r); 146 double s_norm = norm2(s); 147 148 // Tolerances 149 double x_norm = norm2(x), z_norm = norm2(z), u_norm = norm2(u); 150 double eps_pri = sqrt((double)n)*abs_tol + rel_tol*max(x_norm, z_norm); 151 double eps_dual = sqrt((double)n)*abs_tol + rel_tol*norm2(u)*rho; // approximate form 152 153 if(k % 10 == 0){ 154 cout << "iter=" << setw(3) << k 155 << " r_norm=" << setw(10) << r_norm 156 << " s_norm=" << setw(10) << s_norm 157 << " eps_pri=" << setw(10) << eps_pri 158 << " eps_dual=" << setw(10) << eps_dual << "\n"; 159 } 160 161 if(r_norm <= eps_pri && s_norm <= eps_dual){ 162 cout << "Converged at iter " << k << "\n"; 163 break; 164 } 165 } 166 167 // Output solution 168 cout << "x (sparse): "; 169 for(double xi: x) cout << xi << ' '; 170 cout << "\n"; 171 cout << "z (prox result): "; 172 for(double zi: z) cout << zi << ' '; 173 cout << "\n"; 174 175 return 0; 176 } 177
Solves the LASSO problem by splitting x and z with the constraint x = z. The x-update solves a linear system with a precomputed Cholesky factor of (A^T A + ρ I); the z-update applies elementwise soft-thresholding (the proximal operator of the L1 norm). The dual variable u accumulates constraint disagreement. Stopping uses primal and dual residual norms relative to absolute/relative tolerances. Precomputing the factorization makes each iteration fast and numerically stable for moderately sized n.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute 2-norm 5 static double norm2(const vector<double>& v){ double s=0; for(double x: v) s+=x*x; return sqrt(s); } 6 7 int main(){ 8 ios::sync_with_stdio(false); 9 cin.tie(nullptr); 10 11 // Suppose N workers each hold a d-dimensional observation a_i. 12 int N = 5, d = 3; 13 vector<vector<double>> a = { 14 { 1.0, 2.0, 0.0}, 15 { 0.5, 1.5, -1.0}, 16 {-0.5, 2.5, 1.0}, 17 { 2.0, -1.0, 0.5}, 18 { 1.5, 0.0, -0.5} 19 }; 20 21 // ADMM variables per worker: x_i, u_i; and global z 22 vector<vector<double>> x(N, vector<double>(d, 0.0)); 23 vector<vector<double>> u(N, vector<double>(d, 0.0)); 24 vector<double> z(d, 0.0); 25 26 double rho = 1.0; // penalty parameter 27 double abs_tol = 1e-5, rel_tol = 1e-3; 28 int max_iters = 200; 29 30 auto vec_add = [&](vector<double>& y, const vector<double>& v){ 31 for(int j=0;j<d;++j) y[j] += v[j]; 32 }; 33 34 for(int k=0;k<max_iters;++k){ 35 // x-update (local, closed-form): x_i = (a_i + rho z - u_i) / (1 + rho) 36 for(int i=0;i<N;++i){ 37 for(int j=0;j<d;++j){ 38 x[i][j] = (a[i][j] + rho*z[j] - u[i][j]) / (1.0 + rho); 39 } 40 } 41 42 // z-update (global averaging): z = (1/N) * sum_i (x_i + u_i) 43 vector<double> sum(d, 0.0); 44 for(int i=0;i<N;++i){ 45 for(int j=0;j<d;++j){ sum[j] += x[i][j] + u[i][j]; } 46 } 47 for(int j=0;j<d;++j) z[j] = sum[j] / (double)N; 48 49 // u-update: u_i = u_i + x_i - z 50 for(int i=0;i<N;++i){ 51 for(int j=0;j<d;++j){ u[i][j] += x[i][j] - z[j]; } 52 } 53 54 // Residuals: primal r_i = x_i - z, dual s = rho * (z - z_prev) aggregated 55 static vector<double> z_prev; if(k==0) z_prev = vector<double>(d, 0.0); 56 vector<double> r_flat; r_flat.reserve(N*d); 57 for(int i=0;i<N;++i){ 58 for(int j=0;j<d;++j) r_flat.push_back(x[i][j] - z[j]); 59 } 60 vector<double> s_vec(d, 0.0); 61 for(int j=0;j<d;++j) s_vec[j] = rho * (z[j] - z_prev[j]); 62 63 double r_norm = norm2(r_flat); 64 double s_norm = norm2(s_vec); 65 66 double eps_pri = sqrt((double)N*d)*abs_tol + rel_tol*max(norm2(z), 0.0); 67 double eps_dual = sqrt((double)N*d)*abs_tol + rel_tol*norm2(u[0])*rho; // heuristic; u magnitudes comparable 68 69 if(k % 10 == 0){ 70 cout << "iter=" << setw(3) << k 71 << " r_norm=" << r_norm 72 << " s_norm=" << s_norm << "\n"; 73 } 74 75 if(r_norm <= eps_pri && s_norm <= eps_dual){ 76 cout << "Converged at iter " << k << "\n"; 77 break; 78 } 79 z_prev = z; 80 } 81 82 // Report result and compare to direct average of a_i 83 vector<double> avg(d, 0.0); 84 for(int i=0;i<N;++i){ for(int j=0;j<d;++j) avg[j] += a[i][j]; } 85 for(int j=0;j<d;++j) avg[j] /= (double)N; 86 87 cout << fixed << setprecision(6); 88 cout << "Consensus z: "; for(double v: z) cout << v << ' '; cout << "\n"; 89 cout << "Direct avg : "; for(double v: avg) cout << v << ' '; cout << "\n"; 90 91 return 0; 92 } 93
Solves a consensus problem where each worker i wants x_i close to its local vector a_i, with the constraint that all x_i equal a global consensus z. ADMM yields simple closed-form local updates, a global averaging step, and dual accumulation. The final z matches the direct average of the local data, illustrating distributed averaging with lightweight communication.