🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Iterative Methods for Linear Systems

Key Points

  • •
    The Conjugate Gradient (CG) method solves large, sparse, symmetric positive definite (SPD) linear systems Ax = b using only matrix–vector products and dot products.
  • •
    CG follows the gradient of a quadratic energy but uses smarter, A-conjugate directions to reach the solution in at most n steps in exact arithmetic.
  • •
    Its convergence rate depends on the condition number \(κ(A)\); preconditioning can dramatically speed it up by reducing \(κ\).
  • •
    Each iteration is cheap for sparse matrices: one SpMV (sparse matrix–vector multiply), a few vector updates, and two dot products.
  • •
    CG is matrix-free friendly: you only need a routine to compute y = A x, not the matrix entries themselves.
  • •
    Use Preconditioned CG (PCG) with simple preconditioners like Jacobi or incomplete Cholesky for tough problems.
  • •
    Stop based on the relative residual \(\∣rk​∥_2/\∣b∥_2\) and guard against breakdowns (e.g., division by zero) in finite precision.
  • •
    Do not use CG on non-symmetric or indefinite systems; choose MINRES (symmetric indefinite) or GMRES (general) instead.

Prerequisites

  • →Vector and matrix operations — CG uses dot products, norms, and matrix–vector products at every iteration.
  • →Symmetry and positive definiteness — Correctness and convergence of CG require A to be SPD.
  • →Quadratic optimization — CG minimizes a quadratic energy; understanding gradients explains the updates.
  • →Eigenvalues and condition number — They determine convergence rates and motivate preconditioning.
  • →Sparse matrix storage (CSR/CSC) — Efficient SpMV is crucial for large-scale performance.
  • →Floating-point arithmetic and stability — Finite precision affects conjugacy and stopping criteria.
  • →C++ STL vectors and algorithms — Implementations rely on vector containers, inner_product, and lambdas.
  • →Boundary-value problems and discretization (optional) — Common SPD systems arise from PDE discretizations like the Poisson equation.

Detailed Explanation

Tap terms for definitions

01Overview

The Conjugate Gradient (CG) method is an iterative algorithm for solving linear systems Ax = b where A is symmetric positive definite (SPD). Instead of directly factoring A (as in Gaussian elimination or Cholesky), CG builds a sequence of search directions that are mutually A-conjugate (a generalized form of orthogonality). Along each direction, it performs an exact line search on a quadratic energy function and updates the solution. For large, sparse systems, CG is attractive because each iteration needs only: one matrix–vector multiplication, a few vector operations, and a handful of dot products. Memory use is light—only a small, fixed number of vectors are stored. In exact arithmetic, CG would converge in at most n iterations for an n×n system, but in practice, rounding errors and conditioning determine how many steps are needed. The convergence rate is governed by the condition number of A; well-conditioned problems converge fast, while ill-conditioned ones may stall. Preconditioning transforms the system into one with the same solution but with improved conditioning. CG is widely used in scientific computing, computer graphics, machine learning (e.g., kernel methods), and anywhere large SPD systems arise, especially from discretized partial differential equations. It also works in matrix-free settings where A is not explicitly formed but applied as an operator.

02Intuition & Analogies

Imagine you are hiking to the lowest point of a bowl-shaped valley. Plain gradient descent walks straight downhill, but if the bowl is squashed (elongated like a valley), you zig-zag and progress is slow. CG is like a hiker who remembers where they’ve walked and picks smarter directions that avoid undoing previous progress. The valley here is the quadratic energy f(x) = 1/2 x^T A x − b^T x whose minimum is the solution to Ax = b. The gradient of this energy is r = Ax − b, the residual. CG takes a step along a direction p that is not only downhill but is chosen to be A-conjugate to previous directions; that is, directions are arranged so that optimizing along the new one does not spoil the optimality achieved along the old ones—like walking along one axis of the valley, then along another perfectly coordinated axis. This coordination turns the skinny valley into an easy path of orthogonal (in an A-weighted sense) moves. Preconditioning is like changing your hiking boots to ones that adapt to the terrain: the same physical valley becomes more circular in your coordinate system, so each step cuts the remaining distance by a larger factor. Matrix-free application is like following a guide who knows the direction to the bottom without handing you a map: you don’t need to see the matrix A itself; it’s enough to be told how the terrain tilts at any point (i.e., how A acts on a vector).

03Formal Definition

Let A ∈ Rn×n be symmetric positive definite (SPD): xT A x>0 for all nonzero x. The linear system Ax=b is equivalent to minimizing the strictly convex quadratic f(x) = 21​ xT A x - bT x. Starting from x0​, define residual r0​=b−A x0​ and search direction p0​=r0​. At iteration k (k=0,1,…), compute \(αk​ = pkT​Apk​rkT​rk​​\), update xk+1​=xk​ + αk​ pk​ and rk+1​=rk​ - αk​ A pk​. Then set \(βk+1​ = rkT​rk​rk+1T​rk+1​​\) and define pk+1​=rk+1​ + βk+1​ pk​. The directions \{pk​\} are A-conjugate: pi​^T A pj​=0 for i = j, and the residuals are mutually orthogonal: ri​^T rj​=0 for i = j. The iterate xk​ lies in the affine space x0​ + Kk​(A, r0​), where \(Kk​(A, r0​) = span\{r0​, A r0​, …, Ak−1 r0​\}\) is the k-th Krylov subspace. In exact arithmetic, CG finds the minimizer in at most n steps. For preconditioned CG, pick SPD M ≈ A (easy to invert). Replace the inner products rk​^T rk​ by rk​^T zk​ with zk​=M−1 rk​, and update pk+1​=zk+1​ + βk+1​ pk​ with \(βk+1​ = rkT​zk​rk+1T​zk+1​​\). Convergence depends on the eigenvalue distribution and \(κ(M−1A)\).

04When to Use

Use CG when A is symmetric positive definite, especially if it is large and sparse. Typical sources include discretized elliptic PDEs (Poisson, diffusion), graph Laplacians, covariance kernels in Gaussian processes (after regularization), and normal equations A^T A x = A^T b when A has full column rank (though prefer LSQR/LSMR to avoid squaring the condition number). Choose CG if forming or factorizing A is too expensive, but you can cheaply compute y = A x (e.g., via sparse storage or matrix-free operators). Prefer Preconditioned CG whenever the condition number is moderate-to-large; simple choices like Jacobi (diagonal), SSOR, or incomplete Cholesky often pay off massively. Avoid CG if A is not symmetric or not positive definite; for symmetric but indefinite matrices, consider MINRES or SYMMLQ; for general nonsymmetric systems, use GMRES or BiCGSTAB. In real-time or memory-limited settings (graphics, physics simulation, optimization inner loops), CG’s low memory footprint and matrix-free nature are compelling. It’s also effective as an inner solver inside Newton or proximal methods, provided the SPD assumption holds at each step.

⚠️Common Mistakes

  • Applying CG to non-symmetric or indefinite matrices: this can cause stagnation or breakdown. Always verify symmetry (A = A^T) and positive definiteness (e.g., via x^T A x > 0 tests on random x or theory).
  • Poor stopping criteria: using only an absolute residual can stop too early or too late. Prefer relative criteria like (|r_k|_2/|b|_2 \leq \text{tol}) and optionally monitor the A-norm or energy decrease.
  • Ignoring preconditioning: on ill-conditioned problems, plain CG may require thousands of iterations. Even a simple Jacobi preconditioner (inverse diagonal) can reduce iterations drastically.
  • Numerical breakdowns: divisions by tiny p_k^T A p_k or negative denominators due to round-off (indicating non-SPD or loss of conjugacy). Add guards and consider restarting.
  • Building dense A for large problems: this destroys sparsity and performance. Use sparse formats (CSR/CSC) or matrix-free operators to keep per-iteration cost O(nnz).
  • Recomputing A p inefficiently: always compute Ap once per iteration and reuse it; avoid redundant SpMVs.
  • Overly tight tolerances in single precision: demanding 1e-12 with float is unrealistic. Match tolerance to precision and problem scale.
  • Using CG on normal equations without care: (\kappa(A^T A) = \kappa(A)^2) can slow convergence; prefer CG on the normal equations only with strong preconditioning or use LSQR/LSMR.

Key Formulas

Quadratic minimization

x∗=x∈Rnargmin​(21​xTAx−bTx)

Explanation: Solving Ax = b with SPD A is equivalent to minimizing a convex quadratic function. The unique minimizer x* is the solution of the linear system.

Gradient of quadratic

∇f(x)=Ax−b

Explanation: The gradient of the quadratic objective equals the residual with opposite sign. Driving the gradient to zero solves the system.

Step length

αk​=pkT​Apk​rkT​rk​​

Explanation: This exact line-search step minimizes the quadratic along the current direction pk​. The denominator is positive for SPD A and nonzero pk​.

Residual update

rk+1​=rk​−αk​Apk​

Explanation: After moving along pk​ by step αk​, the new residual is obtained by subtracting the component explained by A pk​. This keeps residuals orthogonal in exact arithmetic.

Direction update factor

βk+1​=rkT​rk​rk+1T​rk+1​​

Explanation: This scalar ensures new direction pk+1​ is A-conjugate to previous ones. It uses only inner products of residuals.

Direction recurrence

pk+1​=rk+1​+βk+1​pk​

Explanation: The next search direction is a combination of the new residual and the previous direction, ensuring mutual A-conjugacy.

Krylov subspace

Kk​(A,r0​)=span{r0​,Ar0​,…,Ak−1r0​}

Explanation: CG’s k-th iterate minimizes the quadratic over the affine space x0​ + Kk​(A, r0​). This links CG to Krylov methods more broadly.

Condition number

κ(A)=λmin​(A)λmax​(A)​

Explanation: For SPD A, the ratio of largest to smallest eigenvalues governs how fast CG converges. Smaller κ means faster convergence.

CG convergence bound

∥e0​∥A​∥ek​∥A​​≤2(κ(A)​+1κ(A)​−1​)k

Explanation: The A-norm error decreases at least geometrically with a rate determined by the condition number. This explains why preconditioning is crucial.

PCG updates

αkPCG​=pkT​Apk​rkT​zk​​,βk+1PCG​=rkT​zk​rk+1T​zk+1​​,zk​=M−1rk​

Explanation: Preconditioned CG replaces residual inner products with preconditioned residuals. Good preconditioners make M−1A better conditioned.

Relative residual stopping rule

∥rk​∥2​/∥b∥2​≤ε

Explanation: Terminate when the residual is small relative to the right-hand side. This is a scale-invariant and practical stopping criterion.

Complexity Analysis

Per iteration, CG performs one matrix–vector product y=A p, two or three axpy-style vector updates (x←x + α p, r←r − α y, p←r + β p), and 1–2 dot products. For a sparse matrix with nnz nonzeros, the SpMV costs O(nnz), while each vector operation costs O(n); thus the total per-iteration time is O(nnz + n) ≈ O(nnz). For dense matrices, the dominant cost is the MatVec at O(n2). Memory use is modest: CG stores a fixed number of n-length vectors (e.g., x, r, p, Ap, and possibly z in PCG), giving O(n) auxiliary memory beyond A. If A is stored in CSR, its memory is O(nnz). The total runtime to reach a tolerance ε depends on the iteration count k. In exact arithmetic, CG converges in at most n iterations, but practically k is much smaller and is governed by the eigenvalue spread via the condition number κ(A). The A-norm error typically contracts like ((√κ − 1)/(√κ + 1))^k, so k=O(√κ log(1/ε)). Preconditioning changes the effective matrix to M−1A, reducing κ and hence k. Parallel considerations: the SpMV is bandwidth-bound and parallelizes over rows; dot products require global reductions (synchronization), which can limit scalability on distributed systems. Matrix-free implementations have identical asymptotic costs but avoid storing A, trading arithmetic for memory. Numerical stability is generally good, but finite precision can cause loss of conjugacy; occasional restarts or reorthogonalization can mitigate this with small overhead.

Code Examples

Matrix-free Conjugate Gradient for 2D Poisson (five-point stencil)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Apply 2D Poisson (Dirichlet) operator on an N x N interior grid using a 5-point stencil.
5// u is a flattened vector of size n = N*N; result y = A u.
6struct Laplace2D {
7 int N; // interior points per dimension
8 double h2inv; // 1 / h^2, with h = 1/(N+1)
9 Laplace2D(int N_) : N(N_) { double h = 1.0 / (N + 1); h2inv = 1.0 / (h * h); }
10 void apply(const vector<double>& x, vector<double>& y) const {
11 int n = N * N;
12 y.assign(n, 0.0);
13 auto idx = [this](int i, int j){ return i * N + j; }; // i=row, j=col, 0..N-1
14 for (int i = 0; i < N; ++i) {
15 for (int j = 0; j < N; ++j) {
16 int k = idx(i,j);
17 double center = 4.0 * x[k];
18 double north = (i > 0) ? -x[idx(i-1,j)] : 0.0;
19 double south = (i < N-1) ? -x[idx(i+1,j)] : 0.0;
20 double west = (j > 0) ? -x[idx(i,j-1)] : 0.0;
21 double east = (j < N-1) ? -x[idx(i,j+1)] : 0.0;
22 y[k] = h2inv * (center + north + south + west + east);
23 }
24 }
25 }
26};
27
28// Conjugate Gradient using a matrix-free apply() routine. Assumes SPD.
29struct CGResult { int iters; double rel_res; bool converged; };
30
31CGResult conjugate_gradient(
32 function<void(const vector<double>&, vector<double>&)> applyA,
33 const vector<double>& b,
34 vector<double>& x,
35 int maxIters = 1000,
36 double tol = 1e-8
37){
38 int n = (int)b.size();
39 vector<double> r(n), p(n), Ap(n);
40
41 // r0 = b - A x0
42 applyA(x, Ap);
43 for (int i = 0; i < n; ++i) r[i] = b[i] - Ap[i];
44
45 double bnorm = max(1e-32, sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0)));
46 double rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0));
47 if (rnorm / bnorm <= tol) return {0, rnorm / bnorm, true};
48
49 p = r; // p0 = r0
50 double rr_old = inner_product(r.begin(), r.end(), r.begin(), 0.0);
51
52 int k = 0; bool ok = false;
53 for (; k < maxIters; ++k) {
54 applyA(p, Ap); // Ap = A p
55 double denom = inner_product(p.begin(), p.end(), Ap.begin(), 0.0);
56 if (!(denom > 0)) { // Guard: A must be SPD and p != 0
57 break; // breakdown or non-SPD
58 }
59 double alpha = rr_old / denom;
60 // x_{k+1} = x_k + alpha * p_k, r_{k+1} = r_k - alpha * Ap
61 for (int i = 0; i < n; ++i) {
62 x[i] += alpha * p[i];
63 r[i] -= alpha * Ap[i];
64 }
65 double rr_new = inner_product(r.begin(), r.end(), r.begin(), 0.0);
66 rnorm = sqrt(rr_new);
67 double rel = rnorm / bnorm;
68 if (rel <= tol) { ok = true; ++k; break; }
69 double beta = rr_new / rr_old;
70 for (int i = 0; i < n; ++i) p[i] = r[i] + beta * p[i];
71 rr_old = rr_new;
72 }
73 double rel = rnorm / bnorm;
74 return {k, rel, ok};
75}
76
77int main(){
78 ios::sync_with_stdio(false);
79 cin.tie(nullptr);
80
81 int N = 64; // interior grid size -> n = N*N unknowns
82 int n = N * N;
83 Laplace2D A(N);
84
85 // Right-hand side b: f = 1 everywhere -> scaled by 1 (already included in apply)
86 vector<double> b(n, 1.0);
87 vector<double> x(n, 0.0); // initial guess
88
89 auto applyA = [&](const vector<double>& v, vector<double>& y){ A.apply(v, y); };
90
91 CGResult res = conjugate_gradient(applyA, b, x, 1000, 1e-8);
92
93 cout << "Converged: " << boolalpha << res.converged
94 << ", iters = " << res.iters
95 << ", rel_res = " << setprecision(3) << scientific << res.rel_res << "\n";
96
97 // Sanity: print norm of x
98 double xn = sqrt(inner_product(x.begin(), x.end(), x.begin(), 0.0));
99 cout << "||x||_2 = " << setprecision(3) << scientific << xn << "\n";
100 return 0;
101}
102

This program solves a 2D Poisson-like SPD system on an N×N interior grid without explicitly forming the matrix. The Laplace2D::apply method implements y = A x with a five-point stencil. The conjugate_gradient function is a standard CG loop that uses only applyA, dot products, and axpy-style updates. It stops when the relative residual falls below tolerance or when numerical breakdown is detected (e.g., nonpositive p^T A p). This matrix-free approach keeps memory low and demonstrates CG’s operator-only requirement.

Time: O(n k) with a small constant for the 5-point stencil (≈ 5n per iteration), where k is the iteration countSpace: O(n) for a fixed number of work vectors (x, r, p, Ap)
Preconditioned CG (PCG) with CSR matrix and Jacobi preconditioner
1#include <bits/stdc++.h>
2using namespace std;
3
4struct CSR {
5 int n; // dimension (n x n)
6 vector<int> row_ptr; // size n+1
7 vector<int> col_ind; // size nnz
8 vector<double> val; // size nnz
9};
10
11// y = A x for CSR A
12void spmv(const CSR& A, const vector<double>& x, vector<double>& y){
13 int n = A.n; y.assign(n, 0.0);
14 for(int i=0;i<n;++i){
15 double sum = 0.0;
16 for(int e=A.row_ptr[i]; e<A.row_ptr[i+1]; ++e){
17 sum += A.val[e] * x[A.col_ind[e]];
18 }
19 y[i] = sum;
20 }
21}
22
23// Build tridiagonal SPD matrix from 1D Laplacian: diag=2, offdiag=-1 (Dirichlet)
24CSR build_tridiag_laplacian(int n){
25 CSR A; A.n = n;
26 A.row_ptr.resize(n+1); A.col_ind.reserve(3*n-2); A.val.reserve(3*n-2);
27 int nnz = 0; A.row_ptr[0] = 0;
28 for(int i=0;i<n;++i){
29 if(i-1>=0){ A.col_ind.push_back(i-1); A.val.push_back(-1.0); ++nnz; }
30 A.col_ind.push_back(i); A.val.push_back( 2.0); ++nnz;
31 if(i+1<n){ A.col_ind.push_back(i+1); A.val.push_back(-1.0); ++nnz; }
32 A.row_ptr[i+1] = nnz;
33 }
34 return A;
35}
36
37struct PCGResult { int iters; double rel_res; bool converged; };
38
39// Preconditioned CG with Jacobi (M = diag(A))
40PCGResult pcg(const CSR& A, const vector<double>& b, vector<double>& x,
41 int maxIters=1000, double tol=1e-8){
42 int n = A.n;
43 vector<double> r(n), z(n), p(n), Ap(n), Minv(n);
44
45 // Build Jacobi preconditioner: Minv[i] = 1 / A[i,i]
46 for(int i=0;i<n;++i){
47 double di = 0.0;
48 for(int e=A.row_ptr[i]; e<A.row_ptr[i+1]; ++e){ if(A.col_ind[e]==i){ di = A.val[e]; break; } }
49 if(!(di>0)) di = 1.0; // guard: SPD should have positive diagonal
50 Minv[i] = 1.0 / di;
51 }
52
53 // r0 = b - A x0
54 spmv(A, x, Ap);
55 for(int i=0;i<n;++i) r[i] = b[i] - Ap[i];
56
57 double bnorm = max(1e-32, sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0)));
58 double rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0));
59 if(rnorm / bnorm <= tol) return {0, rnorm/bnorm, true};
60
61 // z0 = M^{-1} r0, p0 = z0
62 for(int i=0;i<n;++i) z[i] = Minv[i] * r[i];
63 p = z;
64
65 double rz_old = inner_product(r.begin(), r.end(), z.begin(), 0.0);
66
67 int k=0; bool ok=false;
68 for(; k<maxIters; ++k){
69 spmv(A, p, Ap);
70 double denom = inner_product(p.begin(), p.end(), Ap.begin(), 0.0);
71 if(!(denom>0)) break; // breakdown or non-SPD
72 double alpha = rz_old / denom;
73 for(int i=0;i<n;++i){
74 x[i] += alpha * p[i];
75 r[i] -= alpha * Ap[i];
76 }
77 rnorm = sqrt(inner_product(r.begin(), r.end(), r.begin(), 0.0));
78 double rel = rnorm / bnorm;
79 if(rel <= tol){ ok=true; ++k; break; }
80 for(int i=0;i<n;++i) z[i] = Minv[i] * r[i];
81 double rz_new = inner_product(r.begin(), r.end(), z.begin(), 0.0);
82 double beta = rz_new / rz_old;
83 for(int i=0;i<n;++i) p[i] = z[i] + beta * p[i];
84 rz_old = rz_new;
85 }
86 return {k, rnorm/bnorm, ok};
87}
88
89int main(){
90 ios::sync_with_stdio(false);
91 cin.tie(nullptr);
92
93 int n = 10000; // 1D problem size (large, sparse, SPD)
94 CSR A = build_tridiag_laplacian(n);
95
96 vector<double> b(n, 1.0); // RHS
97 vector<double> x(n, 0.0); // initial guess
98
99 PCGResult res = pcg(A, b, x, 10000, 1e-8);
100 cout << "Converged: " << boolalpha << res.converged
101 << ", iters = " << res.iters
102 << ", rel_res = " << setprecision(3) << scientific << res.rel_res << "\n";
103
104 // Quick check: A is diagonally dominant; solution components should be O(1)
105 double xn = sqrt(inner_product(x.begin(), x.end(), x.begin(), 0.0));
106 cout << "||x||_2 = " << setprecision(3) << scientific << xn << "\n";
107 return 0;
108}
109

This program builds a large sparse SPD tridiagonal matrix in CSR format and solves Ax = b using Preconditioned CG with a Jacobi preconditioner. The CSR representation enables O(nnz) SpMVs per iteration. Preconditioning uses only the inverse diagonal, which is cheap and often effective for diagonally dominant problems. Guards detect breakdown if p^T A p ≤ 0, which would indicate non-SPD behavior or numerical issues.

Time: O(nnz · k), with nnz ≈ 3n − 2 for the tridiagonal case, where k is the iteration countSpace: O(nnz + n) for CSR storage plus a constant number of n-length work vectors
#conjugate gradient#iterative solver#krylov subspace#symmetric positive definite#preconditioned cg#jacobi preconditioner#sparse linear systems#matrix-free#poisson equation#csr format#residual#a-norm#condition number#spmv#numerical linear algebra