Positive Definite Matrices
Key Points
- •A real symmetric matrix A is positive definite if and only if A for every nonzero vector x, and positive semidefinite if A .
- •All eigenvalues of a positive definite matrix are strictly positive; for semidefinite they are nonnegative.
- •A positive definite matrix admits a unique Cholesky factorization with L lower triangular and positive diagonal.
- •Sylvester’s criterion says A is positive definite iff all leading principal minors det() are positive.
- •Cholesky-based solves are faster and more stable than generic LU for symmetric positive definite systems Ax = b.
- •Any Gram matrix B is positive semidefinite; it is positive definite if B has full column rank.
- •Numerically, tiny negative pivots can appear due to roundoff; use tolerances and prefer Cholesky to naive minor tests.
- •In optimization, covariance, kernels, and energy functions naturally produce (semi)definite matrices, enabling efficient algorithms.
Prerequisites
- →Matrices and vectors — Positive definiteness is defined via quadratic forms x^T A x and requires familiarity with matrix–vector operations.
- →Eigenvalues and eigenvectors — Eigenvalue characterizations (λ_i > 0 or ≥ 0) are central to understanding (semi)definiteness.
- →Matrix factorizations (LU/QR) — Cholesky is a specialized factorization; comparing it with LU/QR clarifies when and why to use it.
- →Convexity and Hessians — Definiteness of the Hessian determines local curvature and convexity in optimization problems.
Detailed Explanation
Tap terms for definitions01Overview
Positive definite (PD) and positive semidefinite (PSD) matrices are special classes of real symmetric matrices that behave like positive numbers in many algebraic and geometric ways. They guarantee strictly positive (or nonnegative) quadratic forms x^T A x, which implies important properties: invertibility for PD matrices, all real eigenvalues, and the existence of efficient factorizations such as the Cholesky decomposition A = L L^T. These properties are fundamental in numerical linear algebra, optimization, statistics, and machine learning. For example, covariance matrices are PSD; Hessians of strictly convex functions are PD at minima; kernel Gram matrices are PSD by construction. Algorithmically, PD structure enables faster and more stable linear solves and log-determinant computations. The eigenvalue characterization (all eigenvalues > 0 for PD, ≥ 0 for PSD) ties geometry (curvature) to algebra. Sylvester’s criterion provides a determinant-based test using leading principal minors. In practice, we often test PD by attempting a Cholesky factorization; success with positive diagonals confirms PD. Understanding (semi)definiteness helps diagnose model validity (e.g., ensuring covariance matrices are well-posed) and choose efficient computational strategies.
02Intuition & Analogies
Think of a matrix as a machine that bends space. Feed in a vector x, and measure the “energy” it creates as x^T A x. For a PD matrix, no matter which nonzero direction you push, the energy is strictly positive—like compressing a perfect spring that always resists and stores energy. For PSD, the energy is never negative; it may be zero for some special directions (like pushing along a hinge that moves freely). The eigenvectors are the principal directions of bending, and the eigenvalues tell you how strongly each direction is resisted. PD means every direction has strictly positive resistance; PSD allows some “flat” directions with zero resistance. Cholesky factorization is like finding a coordinate system (L) where this energy is just the sum of squared coordinates: x^T A x = ||L^T x||_2^2. This looks like Pythagoras—no cross terms—so computations become simpler. Gram matrices A = B^T B encode dot products between columns of B; energies become squared lengths after a linear transform, which can never be negative. Sylvester’s criterion resembles checking whether every top-left submachine behaves like a good spring—if each smaller system stores positive energy, the whole one does. In practice, attempting Cholesky is like trying to take a stable square root of the matrix: if the attempt fails (you need the square root of a negative number), the matrix wasn’t PD.
03Formal Definition
04When to Use
- Solving symmetric positive definite linear systems Ax = b: use Cholesky instead of LU to gain speed (about half the flops) and numerical stability.
- Statistical modeling: covariance matrices must be PSD (often PD for strict invertibility). Cholesky is used to simulate correlated normals and compute log-likelihoods via log det(A).
- Optimization: Hessians of strictly convex quadratic objectives are PD; more general convex Hessians are PSD. Newton and trust-region methods exploit PD structure.
- Machine learning: kernel (Gram) matrices from inner products are PSD; enforcing PSD ensures valid kernels and stable training.
- Control and signal processing: Lyapunov and Riccati equations rely on PD solutions; energy and stability analyses often reduce to checking definiteness.
- Regularization: adding \lambda I (\lambda > 0) to a PSD matrix yields PD, improving conditioning and guaranteeing unique solutions. Choose PD methods when your matrix is symmetric and you can argue x^T A x is strictly positive (or enforce it via regularization). Use PSD-aware logic when zero-energy directions may exist (e.g., underdetermined problems, rank deficiency).
⚠️Common Mistakes
- Confusing elementwise positivity (all a_{ij} > 0) with positive definiteness; they are unrelated. A matrix can have negative entries and still be PD, or have all positive entries and be indefinite.
- Forgetting symmetry: x^T A x depends only on the symmetric part (\tfrac{1}{2}(A + A^T)). Non-symmetric A cannot be PD in the standard sense; always symmetrize before testing.
- Numerical pitfalls in Sylvester’s criterion: determinants of leading minors can under/overflow and are extremely sensitive to roundoff. Prefer Cholesky as a practical PD test.
- Ignoring tolerances: tiny negative pivots during Cholesky can arise from rounding even for theoretically PD A. Use a small \varepsilon and treat pivots below \varepsilon as failure.
- Using generic LU/QR to solve SPD systems: this wastes structure and may reduce stability. Cholesky is simpler, faster, and more accurate for SPD.
- Misinterpreting PSD vs PD: PSD allows zero eigenvalues, so A may be singular; attempting to invert or Cholesky-factor a merely PSD matrix (without pivoting or jitter) will fail.
- Assuming positive trace or determinant implies PD; they are necessary (det > 0 for PD) but not sufficient alone. Always use rigorous tests (Cholesky, eigenvalues, or Sylvester’s full criterion).
Key Formulas
PD via quadratic form
Explanation: This is the defining property of positive definite matrices: every nonzero direction yields strictly positive energy. for semidefinite.
Symmetry requirement
Explanation: Definiteness is defined for symmetric matrices; the quadratic form depends only on the symmetric part of A.
Eigenvalue characterization
Explanation: All eigenvalues of a PD matrix are strictly positive; for PSD they are nonnegative. This ties curvature to spectral properties.
Cholesky factorization
Explanation: Every PD matrix has a unique Cholesky factor L with positive diagonal. It is a numerically stable and efficient factorization.
Sylvester’s criterion
Explanation: All leading principal minors must be positive for positive definiteness. It provides a determinant-based test, though less stable numerically.
Rayleigh quotient bounds
Explanation: The Rayleigh quotient lies between the smallest and largest eigenvalues. It provides variational characterizations of eigenvalues.
Energy bounds
Explanation: For PD A, the quadratic form is bounded above and below by eigenvalues times the squared norm, ensuring coercivity and stability.
Gram representation
Explanation: Any matrix of the form B is PSD. If B has full column rank n, then A is PD.
Cholesky complexity
Explanation: Cholesky requires roughly one-third n cubed floating-point operations and stores O() numbers. It is about twice as fast as LU for SPD.
Log-determinant via Cholesky
Explanation: For with positive diagonal, the determinant is the square of the product of diagonal entries. Taking logs turns products into sums for numerical stability.
Condition number (2-norm)
Explanation: For PD matrices, the 2-norm condition number is the ratio of extreme eigenvalues. It predicts error amplification in linear solves.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Check if a matrix is (approximately) symmetric within tolerance 5 bool isSymmetric(const vector<vector<double>>& A, double tol = 1e-10) { 6 int n = (int)A.size(); 7 for (int i = 0; i < n; ++i) { 8 if ((int)A[i].size() != n) return false; 9 for (int j = i+1; j < n; ++j) { 10 if (fabs(A[i][j] - A[j][i]) > tol * (1.0 + max(fabs(A[i][j]), fabs(A[j][i])))) 11 return false; 12 } 13 } 14 return true; 15 } 16 17 // Computes the Cholesky factorization A = L * L^T for SPD A. 18 // Returns true on success and fills L (lower triangular). 19 // On failure (not PD), returns false and L is partial. 20 bool choleskyDecompose(const vector<vector<double>>& A, 21 vector<vector<double>>& L, 22 double eps = 1e-12) { 23 int n = (int)A.size(); 24 L.assign(n, vector<double>(n, 0.0)); 25 26 for (int i = 0; i < n; ++i) { 27 for (int j = 0; j <= i; ++j) { 28 long double sum = A[i][j]; 29 for (int k = 0; k < j; ++k) sum -= (long double)L[i][k] * L[j][k]; 30 31 if (i == j) { 32 // Diagonal element: must be positive for PD 33 if (sum <= eps) return false; // not PD (or numerically singular) 34 L[i][i] = sqrt((double)sum); 35 } else { 36 L[i][j] = (double)(sum / L[j][j]); 37 } 38 } 39 } 40 return true; 41 } 42 43 int main() { 44 // Example SPD matrix 45 vector<vector<double>> A = { 46 { 4.0, 2.0, 2.0 }, 47 { 2.0, 10.0, 4.0 }, 48 { 2.0, 4.0, 9.0 } 49 }; 50 51 if (!isSymmetric(A)) { 52 cerr << "Matrix is not symmetric; definiteness not defined in the standard sense.\n"; 53 return 0; 54 } 55 56 vector<vector<double>> L; 57 bool ok = choleskyDecompose(A, L); 58 if (!ok) { 59 cout << "Cholesky failed: A is not positive definite (or ill-conditioned).\n"; 60 return 0; 61 } 62 63 cout << "Cholesky factor L (lower triangular):\n"; 64 cout.setf(std::ios::fixed); cout << setprecision(6); 65 for (int i = 0; i < (int)L.size(); ++i) { 66 for (int j = 0; j < (int)L.size(); ++j) cout << setw(10) << L[i][j] << ' '; 67 cout << '\n'; 68 } 69 70 // Optional: compute log(det(A)) = 2 * sum log L_{ii} 71 double logdet = 0.0; 72 for (int i = 0; i < (int)L.size(); ++i) logdet += 2.0 * log(L[i][i]); 73 cout << "log(det(A)) = " << logdet << "\n"; 74 75 return 0; 76 } 77
Attempts a Cholesky factorization A = L L^T. If any diagonal pivot becomes nonpositive, A is not PD (or is numerically singular), and the routine reports failure. On success, L has positive diagonal entries and can be used for efficient solves and determinant computations.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 bool choleskyDecompose(const vector<vector<double>>& A, 5 vector<vector<double>>& L, 6 double eps = 1e-12) { 7 int n = (int)A.size(); 8 L.assign(n, vector<double>(n, 0.0)); 9 for (int i = 0; i < n; ++i) { 10 for (int j = 0; j <= i; ++j) { 11 long double sum = A[i][j]; 12 for (int k = 0; k < j; ++k) sum -= (long double)L[i][k] * L[j][k]; 13 if (i == j) { 14 if (sum <= eps) return false; 15 L[i][i] = sqrt((double)sum); 16 } else { 17 L[i][j] = (double)(sum / L[j][j]); 18 } 19 } 20 } 21 return true; 22 } 23 24 // Solve L y = b (L lower triangular with positive diagonal) 25 vector<double> forwardSubstitution(const vector<vector<double>>& L, const vector<double>& b) { 26 int n = (int)L.size(); 27 vector<double> y(n, 0.0); 28 for (int i = 0; i < n; ++i) { 29 long double sum = b[i]; 30 for (int j = 0; j < i; ++j) sum -= (long double)L[i][j] * y[j]; 31 y[i] = (double)(sum / L[i][i]); 32 } 33 return y; 34 } 35 36 // Solve L^T x = y (back substitution) 37 vector<double> backSubstitutionLt(const vector<vector<double>>& L, const vector<double>& y) { 38 int n = (int)L.size(); 39 vector<double> x(n, 0.0); 40 for (int i = n - 1; i >= 0; --i) { 41 long double sum = y[i]; 42 for (int j = i + 1; j < n; ++j) sum -= (long double)L[j][i] * x[j]; 43 x[i] = (double)(sum / L[i][i]); 44 } 45 return x; 46 } 47 48 int main() { 49 // SPD system Ax = b 50 vector<vector<double>> A = { 51 {25.0, 15.0, -5.0}, 52 {15.0, 18.0, 0.0}, 53 {-5.0, 0.0, 11.0} 54 }; 55 vector<double> b = {35.0, 33.0, 6.0}; 56 57 vector<vector<double>> L; 58 if (!choleskyDecompose(A, L)) { 59 cerr << "Matrix is not PD; cannot use Cholesky.\n"; 60 return 0; 61 } 62 63 // Solve A x = b via L L^T x = b 64 vector<double> y = forwardSubstitution(L, b); 65 vector<double> x = backSubstitutionLt(L, y); 66 67 cout.setf(std::ios::fixed); cout << setprecision(6); 68 cout << "Solution x:\n"; 69 for (double xi : x) cout << xi << " "; 70 cout << "\n"; 71 72 // Residual check 73 int n = (int)A.size(); 74 vector<double> r(n, 0.0); 75 for (int i = 0; i < n; ++i) 76 for (int j = 0; j < n; ++j) 77 r[i] += A[i][j] * x[j]; 78 for (int i = 0; i < n; ++i) r[i] -= b[i]; 79 80 double rnorm = 0.0; for (double ri : r) rnorm = max(rnorm, fabs(ri)); 81 cout << "Max absolute residual: " << rnorm << "\n"; 82 83 return 0; 84 } 85
Factors A = L L^T, then solves L y = b and L^T x = y to obtain x. This is the standard SPD solve pipeline and is both efficient and numerically stable compared to generic methods.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Multiply A = B^T * B for B of size m x n (rows m, cols n) 5 vector<vector<double>> gramMatrix(const vector<vector<double>>& B) { 6 int m = (int)B.size(); 7 int n = (int)B[0].size(); 8 vector<vector<double>> A(n, vector<double>(n, 0.0)); 9 for (int i = 0; i < n; ++i) { 10 for (int j = 0; j <= i; ++j) { 11 long double sum = 0.0L; 12 for (int k = 0; k < m; ++k) sum += (long double)B[k][i] * B[k][j]; 13 A[i][j] = A[j][i] = (double)sum; 14 } 15 } 16 return A; 17 } 18 19 int main() { 20 // Build B (m x n) with m >= n, then A = B^T B is PSD; PD if B has full column rank 21 int m = 5, n = 3; 22 vector<vector<double>> B(m, vector<double>(n)); 23 mt19937 rng(42); 24 normal_distribution<double> N(0.0, 1.0); 25 for (int i = 0; i < m; ++i) for (int j = 0; j < n; ++j) B[i][j] = N(rng); 26 27 vector<vector<double>> A = gramMatrix(B); 28 29 // Empirical check: sample random x and confirm x^T A x >= 0 30 bool ok = true; 31 for (int t = 0; t < 1000; ++t) { 32 vector<double> x(n); 33 for (int i = 0; i < n; ++i) x[i] = N(rng); 34 // Compute q = x^T A x 35 long double q = 0.0L; 36 for (int i = 0; i < n; ++i) { 37 long double row = 0.0L; 38 for (int j = 0; j < n; ++j) row += (long double)A[i][j] * x[j]; 39 q += row * x[i]; 40 } 41 if (q < -1e-10L) { ok = false; break; } 42 } 43 44 cout << (ok ? "A appears PSD by sampling (x^T A x >= 0)." : "Potential issue: negative quadratic form found.") << "\n"; 45 46 // If B has full column rank, A is PD. Add small ridge to enforce PD if needed. 47 for (int i = 0; i < n; ++i) A[i][i] += 1e-8; // regularization (optional) 48 49 return 0; 50 } 51
Constructs a Gram matrix A = B^T B, which is always PSD by theory. It then empirically verifies nonnegativity of the quadratic form by random sampling. Adding a small diagonal “ridge” can enforce strict PD in practice.