Iterative Methods for Linear Systems
Key Points
- •The Conjugate Gradient (CG) method solves large, sparse, symmetric positive definite (SPD) linear systems Ax = b using only matrix–vector products and dot products.
- •CG follows the gradient of a quadratic energy but uses smarter, A-conjugate directions to reach the solution in at most n steps in exact arithmetic.
- •Its convergence rate depends on the condition number \((A)\); preconditioning can dramatically speed it up by reducing \(\).
- •Each iteration is cheap for sparse matrices: one SpMV (sparse matrix–vector multiply), a few vector updates, and two dot products.
- •CG is matrix-free friendly: you only need a routine to compute y = A x, not the matrix entries themselves.
- •Use Preconditioned CG (PCG) with simple preconditioners like Jacobi or incomplete Cholesky for tough problems.
- •Stop based on the relative residual \(\_2/\_2\) and guard against breakdowns (e.g., division by zero) in finite precision.
- •Do not use CG on non-symmetric or indefinite systems; choose MINRES (symmetric indefinite) or GMRES (general) instead.
Prerequisites
- →Vector and matrix operations — CG uses dot products, norms, and matrix–vector products at every iteration.
- →Symmetry and positive definiteness — Correctness and convergence of CG require A to be SPD.
- →Quadratic optimization — CG minimizes a quadratic energy; understanding gradients explains the updates.
- →Eigenvalues and condition number — They determine convergence rates and motivate preconditioning.
- →Sparse matrix storage (CSR/CSC) — Efficient SpMV is crucial for large-scale performance.
- →Floating-point arithmetic and stability — Finite precision affects conjugacy and stopping criteria.
- →C++ STL vectors and algorithms — Implementations rely on vector containers, inner_product, and lambdas.
- →Boundary-value problems and discretization (optional) — Common SPD systems arise from PDE discretizations like the Poisson equation.
Detailed Explanation
Tap terms for definitions01Overview
The Conjugate Gradient (CG) method is an iterative algorithm for solving linear systems Ax = b where A is symmetric positive definite (SPD). Instead of directly factoring A (as in Gaussian elimination or Cholesky), CG builds a sequence of search directions that are mutually A-conjugate (a generalized form of orthogonality). Along each direction, it performs an exact line search on a quadratic energy function and updates the solution. For large, sparse systems, CG is attractive because each iteration needs only: one matrix–vector multiplication, a few vector operations, and a handful of dot products. Memory use is light—only a small, fixed number of vectors are stored. In exact arithmetic, CG would converge in at most n iterations for an n×n system, but in practice, rounding errors and conditioning determine how many steps are needed. The convergence rate is governed by the condition number of A; well-conditioned problems converge fast, while ill-conditioned ones may stall. Preconditioning transforms the system into one with the same solution but with improved conditioning. CG is widely used in scientific computing, computer graphics, machine learning (e.g., kernel methods), and anywhere large SPD systems arise, especially from discretized partial differential equations. It also works in matrix-free settings where A is not explicitly formed but applied as an operator.
02Intuition & Analogies
Imagine you are hiking to the lowest point of a bowl-shaped valley. Plain gradient descent walks straight downhill, but if the bowl is squashed (elongated like a valley), you zig-zag and progress is slow. CG is like a hiker who remembers where they’ve walked and picks smarter directions that avoid undoing previous progress. The valley here is the quadratic energy f(x) = 1/2 x^T A x − b^T x whose minimum is the solution to Ax = b. The gradient of this energy is r = Ax − b, the residual. CG takes a step along a direction p that is not only downhill but is chosen to be A-conjugate to previous directions; that is, directions are arranged so that optimizing along the new one does not spoil the optimality achieved along the old ones—like walking along one axis of the valley, then along another perfectly coordinated axis. This coordination turns the skinny valley into an easy path of orthogonal (in an A-weighted sense) moves. Preconditioning is like changing your hiking boots to ones that adapt to the terrain: the same physical valley becomes more circular in your coordinate system, so each step cuts the remaining distance by a larger factor. Matrix-free application is like following a guide who knows the direction to the bottom without handing you a map: you don’t need to see the matrix A itself; it’s enough to be told how the terrain tilts at any point (i.e., how A acts on a vector).
03Formal Definition
04When to Use
Use CG when A is symmetric positive definite, especially if it is large and sparse. Typical sources include discretized elliptic PDEs (Poisson, diffusion), graph Laplacians, covariance kernels in Gaussian processes (after regularization), and normal equations A^T A x = A^T b when A has full column rank (though prefer LSQR/LSMR to avoid squaring the condition number). Choose CG if forming or factorizing A is too expensive, but you can cheaply compute y = A x (e.g., via sparse storage or matrix-free operators). Prefer Preconditioned CG whenever the condition number is moderate-to-large; simple choices like Jacobi (diagonal), SSOR, or incomplete Cholesky often pay off massively. Avoid CG if A is not symmetric or not positive definite; for symmetric but indefinite matrices, consider MINRES or SYMMLQ; for general nonsymmetric systems, use GMRES or BiCGSTAB. In real-time or memory-limited settings (graphics, physics simulation, optimization inner loops), CG’s low memory footprint and matrix-free nature are compelling. It’s also effective as an inner solver inside Newton or proximal methods, provided the SPD assumption holds at each step.
⚠️Common Mistakes
- Applying CG to non-symmetric or indefinite matrices: this can cause stagnation or breakdown. Always verify symmetry (A = A^T) and positive definiteness (e.g., via x^T A x > 0 tests on random x or theory).
- Poor stopping criteria: using only an absolute residual can stop too early or too late. Prefer relative criteria like (|r_k|_2/|b|_2 \leq \text{tol}) and optionally monitor the A-norm or energy decrease.
- Ignoring preconditioning: on ill-conditioned problems, plain CG may require thousands of iterations. Even a simple Jacobi preconditioner (inverse diagonal) can reduce iterations drastically.
- Numerical breakdowns: divisions by tiny p_k^T A p_k or negative denominators due to round-off (indicating non-SPD or loss of conjugacy). Add guards and consider restarting.
- Building dense A for large problems: this destroys sparsity and performance. Use sparse formats (CSR/CSC) or matrix-free operators to keep per-iteration cost O(nnz).
- Recomputing A p inefficiently: always compute Ap once per iteration and reuse it; avoid redundant SpMVs.
- Overly tight tolerances in single precision: demanding 1e-12 with float is unrealistic. Match tolerance to precision and problem scale.
- Using CG on normal equations without care: (\kappa(A^T A) = \kappa(A)^2) can slow convergence; prefer CG on the normal equations only with strong preconditioning or use LSQR/LSMR.
Key Formulas
Quadratic minimization
Explanation: Solving Ax = b with SPD A is equivalent to minimizing a convex quadratic function. The unique minimizer x* is the solution of the linear system.
Gradient of quadratic
Explanation: The gradient of the quadratic objective equals the residual with opposite sign. Driving the gradient to zero solves the system.
Step length
Explanation: This exact line-search step minimizes the quadratic along the current direction . The denominator is positive for SPD A and nonzero .
Residual update
Explanation: After moving along by step , the new residual is obtained by subtracting the component explained by A . This keeps residuals orthogonal in exact arithmetic.
Direction update factor
Explanation: This scalar ensures new direction is A-conjugate to previous ones. It uses only inner products of residuals.
Direction recurrence
Explanation: The next search direction is a combination of the new residual and the previous direction, ensuring mutual A-conjugacy.
Krylov subspace
Explanation: CG’s k-th iterate minimizes the quadratic over the affine space + (A, ). This links CG to Krylov methods more broadly.
Condition number
Explanation: For SPD A, the ratio of largest to smallest eigenvalues governs how fast CG converges. Smaller means faster convergence.
CG convergence bound
Explanation: The A-norm error decreases at least geometrically with a rate determined by the condition number. This explains why preconditioning is crucial.
PCG updates
Explanation: Preconditioned CG replaces residual inner products with preconditioned residuals. Good preconditioners make A better conditioned.
Relative residual stopping rule
Explanation: Terminate when the residual is small relative to the right-hand side. This is a scale-invariant and practical stopping criterion.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Apply 2D Poisson (Dirichlet) operator on an N x N interior grid using a 5-point stencil. 5 // u is a flattened vector of size n = N*N; result y = A u. 6 struct Laplace2D { 7 int N; // interior points per dimension 8 double h2inv; // 1 / h^2, with h = 1/(N+1) 9 Laplace2D(int N_) : N(N_) { double h = 1.0 / (N + 1); h2inv = 1.0 / (h * h); } 10 void apply(const vector<double>& x, vector<double>& y) const { 11 int n = N * N; 12 y.assign(n, 0.0); 13 auto idx = [this](int i, int j){ return i * N + j; }; // i=row, j=col, 0..N-1 14 for (int i = 0; i < N; ++i) { 15 for (int j = 0; j < N; ++j) { 16 int k = idx(i,j); 17 double center = 4.0 * x[k]; 18 double north = (i > 0) ? -x[idx(i-1,j)] : 0.0; 19 double south = (i < N-1) ? -x[idx(i+1,j)] : 0.0; 20 double west = (j > 0) ? -x[idx(i,j-1)] : 0.0; 21 double east = (j < N-1) ? -x[idx(i,j+1)] : 0.0; 22 y[k] = h2inv * (center + north + south + west + east); 23 } 24 } 25 } 26 }; 27 28 // Conjugate Gradient using a matrix-free apply() routine. Assumes SPD. 29 struct CGResult { int iters; double rel_res; bool converged; }; 30 31 CGResult conjugate_gradient( 32 function<void(const vector<double>&, vector<double>&)> applyA, 33 const vector<double>& b, 34 vector<double>& x, 35 int maxIters = 1000, 36 double tol = 1e-8 37 ){ 38 int n = (int)b.size(); 39 vector<double> r(n), p(n), Ap(n); 40 41 // r0 = b - A x0 42 applyA(x, Ap); 43 for (int i = 0; i < n; ++i) r[i] = b[i] - Ap[i]; 44 45 double bnorm = max(1e-32, sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0))); 46 double rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0)); 47 if (rnorm / bnorm <= tol) return {0, rnorm / bnorm, true}; 48 49 p = r; // p0 = r0 50 double rr_old = inner_product(r.begin(), r.end(), r.begin(), 0.0); 51 52 int k = 0; bool ok = false; 53 for (; k < maxIters; ++k) { 54 applyA(p, Ap); // Ap = A p 55 double denom = inner_product(p.begin(), p.end(), Ap.begin(), 0.0); 56 if (!(denom > 0)) { // Guard: A must be SPD and p != 0 57 break; // breakdown or non-SPD 58 } 59 double alpha = rr_old / denom; 60 // x_{k+1} = x_k + alpha * p_k, r_{k+1} = r_k - alpha * Ap 61 for (int i = 0; i < n; ++i) { 62 x[i] += alpha * p[i]; 63 r[i] -= alpha * Ap[i]; 64 } 65 double rr_new = inner_product(r.begin(), r.end(), r.begin(), 0.0); 66 rnorm = sqrt(rr_new); 67 double rel = rnorm / bnorm; 68 if (rel <= tol) { ok = true; ++k; break; } 69 double beta = rr_new / rr_old; 70 for (int i = 0; i < n; ++i) p[i] = r[i] + beta * p[i]; 71 rr_old = rr_new; 72 } 73 double rel = rnorm / bnorm; 74 return {k, rel, ok}; 75 } 76 77 int main(){ 78 ios::sync_with_stdio(false); 79 cin.tie(nullptr); 80 81 int N = 64; // interior grid size -> n = N*N unknowns 82 int n = N * N; 83 Laplace2D A(N); 84 85 // Right-hand side b: f = 1 everywhere -> scaled by 1 (already included in apply) 86 vector<double> b(n, 1.0); 87 vector<double> x(n, 0.0); // initial guess 88 89 auto applyA = [&](const vector<double>& v, vector<double>& y){ A.apply(v, y); }; 90 91 CGResult res = conjugate_gradient(applyA, b, x, 1000, 1e-8); 92 93 cout << "Converged: " << boolalpha << res.converged 94 << ", iters = " << res.iters 95 << ", rel_res = " << setprecision(3) << scientific << res.rel_res << "\n"; 96 97 // Sanity: print norm of x 98 double xn = sqrt(inner_product(x.begin(), x.end(), x.begin(), 0.0)); 99 cout << "||x||_2 = " << setprecision(3) << scientific << xn << "\n"; 100 return 0; 101 } 102
This program solves a 2D Poisson-like SPD system on an N×N interior grid without explicitly forming the matrix. The Laplace2D::apply method implements y = A x with a five-point stencil. The conjugate_gradient function is a standard CG loop that uses only applyA, dot products, and axpy-style updates. It stops when the relative residual falls below tolerance or when numerical breakdown is detected (e.g., nonpositive p^T A p). This matrix-free approach keeps memory low and demonstrates CG’s operator-only requirement.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct CSR { 5 int n; // dimension (n x n) 6 vector<int> row_ptr; // size n+1 7 vector<int> col_ind; // size nnz 8 vector<double> val; // size nnz 9 }; 10 11 // y = A x for CSR A 12 void spmv(const CSR& A, const vector<double>& x, vector<double>& y){ 13 int n = A.n; y.assign(n, 0.0); 14 for(int i=0;i<n;++i){ 15 double sum = 0.0; 16 for(int e=A.row_ptr[i]; e<A.row_ptr[i+1]; ++e){ 17 sum += A.val[e] * x[A.col_ind[e]]; 18 } 19 y[i] = sum; 20 } 21 } 22 23 // Build tridiagonal SPD matrix from 1D Laplacian: diag=2, offdiag=-1 (Dirichlet) 24 CSR build_tridiag_laplacian(int n){ 25 CSR A; A.n = n; 26 A.row_ptr.resize(n+1); A.col_ind.reserve(3*n-2); A.val.reserve(3*n-2); 27 int nnz = 0; A.row_ptr[0] = 0; 28 for(int i=0;i<n;++i){ 29 if(i-1>=0){ A.col_ind.push_back(i-1); A.val.push_back(-1.0); ++nnz; } 30 A.col_ind.push_back(i); A.val.push_back( 2.0); ++nnz; 31 if(i+1<n){ A.col_ind.push_back(i+1); A.val.push_back(-1.0); ++nnz; } 32 A.row_ptr[i+1] = nnz; 33 } 34 return A; 35 } 36 37 struct PCGResult { int iters; double rel_res; bool converged; }; 38 39 // Preconditioned CG with Jacobi (M = diag(A)) 40 PCGResult pcg(const CSR& A, const vector<double>& b, vector<double>& x, 41 int maxIters=1000, double tol=1e-8){ 42 int n = A.n; 43 vector<double> r(n), z(n), p(n), Ap(n), Minv(n); 44 45 // Build Jacobi preconditioner: Minv[i] = 1 / A[i,i] 46 for(int i=0;i<n;++i){ 47 double di = 0.0; 48 for(int e=A.row_ptr[i]; e<A.row_ptr[i+1]; ++e){ if(A.col_ind[e]==i){ di = A.val[e]; break; } } 49 if(!(di>0)) di = 1.0; // guard: SPD should have positive diagonal 50 Minv[i] = 1.0 / di; 51 } 52 53 // r0 = b - A x0 54 spmv(A, x, Ap); 55 for(int i=0;i<n;++i) r[i] = b[i] - Ap[i]; 56 57 double bnorm = max(1e-32, sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0))); 58 double rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0)); 59 if(rnorm / bnorm <= tol) return {0, rnorm/bnorm, true}; 60 61 // z0 = M^{-1} r0, p0 = z0 62 for(int i=0;i<n;++i) z[i] = Minv[i] * r[i]; 63 p = z; 64 65 double rz_old = inner_product(r.begin(), r.end(), z.begin(), 0.0); 66 67 int k=0; bool ok=false; 68 for(; k<maxIters; ++k){ 69 spmv(A, p, Ap); 70 double denom = inner_product(p.begin(), p.end(), Ap.begin(), 0.0); 71 if(!(denom>0)) break; // breakdown or non-SPD 72 double alpha = rz_old / denom; 73 for(int i=0;i<n;++i){ 74 x[i] += alpha * p[i]; 75 r[i] -= alpha * Ap[i]; 76 } 77 rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0)); 78 double rel = rnorm / bnorm; 79 if(rel <= tol){ ok=true; ++k; break; } 80 for(int i=0;i<n;++i) z[i] = Minv[i] * r[i]; 81 double rz_new = inner_product(r.begin(), r.end(), z.begin(), 0.0); 82 double beta = rz_new / rz_old; 83 for(int i=0;i<n;++i) p[i] = z[i] + beta * p[i]; 84 rz_old = rz_new; 85 } 86 return {k, rnorm/bnorm, ok}; 87 } 88 89 int main(){ 90 ios::sync_with_stdio(false); 91 cin.tie(nullptr); 92 93 int n = 10000; // 1D problem size (large, sparse, SPD) 94 CSR A = build_tridiag_laplacian(n); 95 96 vector<double> b(n, 1.0); // RHS 97 vector<double> x(n, 0.0); // initial guess 98 99 PCGResult res = pcg(A, b, x, 10000, 1e-8); 100 cout << "Converged: " << boolalpha << res.converged 101 << ", iters = " << res.iters 102 << ", rel_res = " << setprecision(3) << scientific << res.rel_res << "\n"; 103 104 // Quick check: A is diagonally dominant; solution components should be O(1) 105 double xn = sqrt(inner_product(x.begin(), x.end(), x.begin(), 0.0)); 106 cout << "||x||_2 = " << setprecision(3) << scientific << xn << "\n"; 107 return 0; 108 } 109
This program builds a large sparse SPD tridiagonal matrix in CSR format and solves Ax = b using Preconditioned CG with a Jacobi preconditioner. The CSR representation enables O(nnz) SpMVs per iteration. Preconditioning uses only the inverse diagonal, which is cheap and often effective for diagonally dominant problems. Guards detect breakdown if p^T A p ≤ 0, which would indicate non-SPD behavior or numerical issues.