Pseudoinverse (Moore-Penrose)
Key Points
- •The Moore–Penrose pseudoinverse generalizes matrix inversion to rectangular or singular matrices and is denoted A⁺.
- •For full column rank tall matrices (), the pseudoinverse is A⁺ = (AᵀA)^{-1}Aᵀ and solves least-squares problems.
- •Geometrically, A⁺ maps a vector to the best-fit solution and projects data onto A’s column space.
- •Numerically, forming AᵀA can amplify conditioning; QR or SVD-based methods are usually more stable.
- •The pseudoinverse yields the minimum-norm solution among all least-squares solutions.
- •You can compute A⁺b without forming A⁺ explicitly by solving normal equations or via QR: R x = Qᵀ b.
- •When A is rank-deficient, use SVD with thresholding or Tikhonov regularization instead of (AᵀA)^{-1}.
- •Time to compute A⁺ for m × n full-rank tall A is roughly O(m ); memory is O(m n + ).
Prerequisites
- →Matrix multiplication and transpose — Understanding AᵀA and Aᵀb requires comfort with basic matrix operations.
- →Linear independence and rank — Full column rank is essential for using A⁺ = (AᵀA)^{-1}Aᵀ and guarantees invertibility of AᵀA.
- →Least squares and projections — Pseudoinverse solves least-squares problems and represents orthogonal projections.
- →Cholesky and triangular solves — Efficiently applying (AᵀA)^{-1} without forming an explicit inverse uses these techniques.
- →QR and SVD basics — Stable computation of least-squares solutions uses QR or SVD instead of normal equations.
Detailed Explanation
Tap terms for definitions01Overview
The Moore–Penrose pseudoinverse extends the idea of a matrix inverse to cases where the matrix is not square or not invertible. For a tall matrix A with more rows than columns (m ≥ n) and full column rank, the pseudoinverse A⁺ acts like a “best possible” inverse: it provides the least-squares solution to Ax ≈ b and projects vectors onto the column space of A. Concretely, when A has full column rank, A⁺ = (AᵀA)^{-1}Aᵀ. This formula appears naturally from the normal equations of least squares. The pseudoinverse is unique and satisfies four defining equations (the Moore–Penrose conditions) that make it the canonical choice among generalized inverses. In data fitting, signal processing, control, and robotics, A⁺ is used to recover parameters from noisy measurements, compute minimum-norm solutions, and project onto subspaces. Practically, while A⁺ can be formed explicitly using the formula above, it is often better numerically to avoid explicit inversion and instead solve linear systems using QR or SVD factorizations. This improves stability, especially when the columns of A are nearly linearly dependent. Understanding the pseudoinverse helps unify linear regression, projections, and constrained optimization under a single mathematical tool.
02Intuition & Analogies
Imagine A as a machine that takes an input vector x (parameters) and produces an output Ax (measurements). If A were a perfect square and invertible machine, you could reverse it exactly with A^{-1}. But in real life, we often have more measurements than parameters (tall matrices), and they may be noisy or inconsistent. There is no exact reverse button because many measurement vectors b are not exactly producible by any x. The pseudoinverse A⁺ is the next best thing: it is the “undo as well as possible” button. It takes b and returns the x that makes Ax as close as possible to b in the least-squares sense (minimizes the sum of squared errors). Geometrically, think of A’s columns spanning some plane or subspace inside a higher-dimensional space. Given b, you first drop a perpendicular from b onto that plane to find the closest point p (this is the projection onto the column space). Then you find the coordinates x of p in the basis given by A’s columns. The composition of “project then express coordinates” is exactly what A⁺ does. If the machine A slightly squashes or stretches space (some directions stronger than others), A⁺ unsquashes appropriately, but it refuses to chase the parts of b that lie outside the subspace (that would require guessing information that A never produced). When the columns of A are nearly redundant (almost in the same direction), reversing those squashes becomes numerically risky—small measurement noise can blow up. That’s why stable methods (QR, SVD) are often preferred over directly inverting AᵀA.
03Formal Definition
04When to Use
Use the pseudoinverse when you need to solve Ax ≈ b in a least-squares sense, especially for tall full-column-rank matrices. Typical applications include linear regression (fitting models to data), parameter estimation in system identification, and signal reconstruction from noisy measurements. In geometry and graphics, A⁺ is used to compute best-fit transformations and projections. In control and robotics, it appears in inverse kinematics to compute joint configurations that achieve a desired end-effector motion while minimizing joint effort (minimum-norm solutions). In statistics and machine learning, it provides closed-form solutions for ordinary least squares and ridge regression variants. Use the closed-form A⁺ = (AᵀA)^{-1}Aᵀ when A has full column rank and when the problem is reasonably well conditioned; otherwise, prefer numerically stable algorithms such as QR or SVD to compute x = A⁺b without explicitly forming A⁺. For rank-deficient or nearly singular problems, use SVD with small singular value thresholding or Tikhonov regularization (ridge) to control noise amplification.
⚠️Common Mistakes
- Blindly forming (AᵀA)^{-1} when A is rank-deficient or nearly so. AᵀA then becomes singular or ill-conditioned, leading to massive numerical errors. Use QR/SVD or add regularization.\n- Computing a literal matrix inverse instead of solving triangular systems. Inverse formation is slower and less stable. Use Cholesky/QR solves to apply (AᵀA)^{-1}Aᵀ.\n- Ignoring scaling and conditioning. Poorly scaled columns can inflate the condition number, making solutions sensitive to noise. Normalize features or use QR/SVD.\n- Dimension mismatches. For m × n tall A, A⁺ must be n × m; AA⁺ is m × m (a projector), while A⁺A is n × n.\n- Assuming A⁺A equals the identity for any A. It equals I only on A’s column space (for tall full-rank A, A⁺A = Iₙ but AA⁺ ≠ Iₘ).\n- Using float instead of double. Accumulating dot products in low precision increases round-off error. Prefer double and use tolerances when checking identities.\n- Forgetting that normal equations square the condition number. Forming AᵀA can worsen numerical stability by roughly \kappa(A)^2; prefer QR/SVD for ill-conditioned problems.
Key Formulas
Full-column-rank pseudoinverse
Explanation: If A has full column rank (, rank n), AᵀA is invertible and this closed form holds. It is the classic formula used to compute least-squares solutions via normal equations.
Least-squares solution
Explanation: The pseudoinverse maps b to the minimum‑residual, minimum‑norm solution. This connects the geometric projection view with an explicit algebraic solution.
Normal equations
Explanation: Stationary condition for minimizing the squared 2‑norm of residuals. Solving this SPD system yields the least-squares solution when A has full column rank.
SVD pseudoinverse
Explanation: By inverting nonzero singular values and swapping U and V, we obtain A⁺ that satisfies the Moore–Penrose conditions even when A is rectangular or rank deficient.
Spectral condition number
Explanation: The ratio of the largest to smallest singular values measures sensitivity to perturbations. Squaring occurs when forming AᵀA, worsening numerical stability.
Orthogonal projectors
Explanation: P projects onto the column space of A; Q projects onto the row space. They are idempotent (P² = P) and symmetric for real matrices.
Symmetry of projectors
Explanation: These are two of the Moore–Penrose conditions enforcing that the projectors are orthogonal (symmetric) projections.
Tikhonov-regularized pseudoinverse
Explanation: A damped inverse used when AᵀA is ill-conditioned. The parameter λ > 0 reduces noise amplification at the cost of bias.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility functions for small dense matrices using vector<vector<double>> 5 using Matrix = vector<vector<double>>; 6 7 Matrix zeros(int r, int c) { return Matrix(r, vector<double>(c, 0.0)); } 8 Matrix identity(int n) { Matrix I = zeros(n, n); for (int i=0;i<n;++i) I[i][i]=1.0; return I; } 9 10 Matrix transpose(const Matrix &A){ 11 int m = (int)A.size(), n = (int)A[0].size(); 12 Matrix AT = zeros(n, m); 13 for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i] = A[i][j]; 14 return AT; 15 } 16 17 Matrix matmul(const Matrix &A, const Matrix &B){ 18 int m = (int)A.size(), k = (int)A[0].size(); 19 int n = (int)B[0].size(); 20 Matrix C = zeros(m, n); 21 for(int i=0;i<m;++i){ 22 for(int t=0;t<k;++t){ 23 double a = A[i][t]; 24 if (a==0.0) continue; 25 for(int j=0;j<n;++j){ 26 C[i][j] += a * B[t][j]; 27 } 28 } 29 } 30 return C; 31 } 32 33 // Cholesky factorization for SPD matrix G: returns lower triangular L such that G = L L^T 34 Matrix cholesky(const Matrix &G){ 35 int n = (int)G.size(); 36 Matrix L = zeros(n, n); 37 for(int i=0;i<n;++i){ 38 for(int j=0;j<=i;++j){ 39 double sum = G[i][j]; 40 for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k]; 41 if(i==j){ 42 if(sum <= 0) throw runtime_error("Matrix not SPD (Cholesky failed)"); 43 L[i][i] = sqrt(sum); 44 } else { 45 L[i][j] = sum / L[j][j]; 46 } 47 } 48 } 49 return L; 50 } 51 52 // Solve L Y = B for Y, where L is lower triangular; B can have multiple RHS (columns) 53 Matrix solveLower(const Matrix &L, const Matrix &B){ 54 int n = (int)L.size(); 55 int m = (int)B[0].size(); 56 Matrix Y = zeros(n, m); 57 for(int col=0; col<m; ++col){ 58 for(int i=0;i<n;++i){ 59 double sum = B[i][col]; 60 for(int k=0;k<i;++k) sum -= L[i][k]*Y[k][col]; 61 Y[i][col] = sum / L[i][i]; 62 } 63 } 64 return Y; 65 } 66 67 // Solve L^T X = Y for X (upper triangular solve) 68 Matrix solveUpperFromLowerT(const Matrix &L, const Matrix &Y){ 69 int n = (int)L.size(); 70 int m = (int)Y[0].size(); 71 Matrix X = zeros(n, m); 72 for(int col=0; col<m; ++col){ 73 for(int i=n-1;i>=0;--i){ 74 double sum = Y[i][col]; 75 for(int k=i+1;k<n;++k) sum -= L[k][i]*X[k][col]; // using L^T entries 76 X[i][col] = sum / L[i][i]; 77 } 78 } 79 return X; 80 } 81 82 Matrix pseudoinverse_full_column_rank(const Matrix &A){ 83 // A: m x n with full column rank, m >= n 84 int m = (int)A.size(); 85 int n = (int)A[0].size(); 86 Matrix AT = transpose(A); // n x m 87 Matrix G = matmul(AT, A); // n x n (Gram matrix) 88 Matrix L = cholesky(G); // n x n lower triangular 89 // Compute A^+ = (A^T A)^{-1} A^T by solving twice: X = G^{-1} AT = L^{-T} L^{-1} AT 90 Matrix Y = solveLower(L, AT); // solve L Y = A^T 91 Matrix X = solveUpperFromLowerT(L, Y);// solve L^T X = Y -> X = A^+ 92 return X; // X is n x m 93 } 94 95 void printMatrix(const Matrix &A, const string &name){ 96 cout << name << " (" << A.size() << "x" << A[0].size() << ")\n"; 97 cout.setf(ios::fixed); cout<<setprecision(6); 98 for(const auto &row: A){ 99 for(double v: row) cout << setw(12) << v << ' '; 100 cout << '\n'; 101 } 102 } 103 104 int main(){ 105 // Example: A is 4x2 with full column rank (tall matrix) 106 Matrix A = { 107 {1.0, 1.0}, 108 {1.0, 2.0}, 109 {1.0, 3.0}, 110 {1.0, 4.0} 111 }; // columns: [1,1,1,1]^T and [1,2,3,4]^T 112 113 Matrix Aplus = pseudoinverse_full_column_rank(A); 114 115 // Verify properties approximately: A^+ A ≈ I_2 and A A^+ A ≈ A 116 Matrix ApA = matmul(Aplus, A); // should be 2x2 identity 117 Matrix AAp = matmul(A, Aplus); // projector 4x4 118 Matrix AApA = matmul(AAp, A); // should recover A 119 120 printMatrix(Aplus, "A^+"); 121 printMatrix(ApA, "A^+ A"); 122 printMatrix(AApA, "A A^+ A"); 123 124 return 0; 125 } 126
This program computes the pseudoinverse A⁺ of a tall full-column-rank matrix using the identity A⁺ = (AᵀA)^{-1}Aᵀ. It avoids forming an explicit inverse by using Cholesky factorization of G = AᵀA and two triangular solves, which is both faster and more stable than inverting G directly. The demo prints A⁺, checks that A⁺A ≈ I (on Rⁿ), and verifies AA⁺A ≈ A.
1 #include <bits/stdc++.h> 2 using namespace std; 3 using Matrix = vector<vector<double>>; 4 using Vec = vector<double>; 5 6 Matrix zeros(int r, int c){ return Matrix(r, vector<double>(c, 0.0)); } 7 Matrix transpose(const Matrix &A){ int m=A.size(), n=A[0].size(); Matrix AT=zeros(n,m); for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i]=A[i][j]; return AT; } 8 9 // Modified Gram-Schmidt QR for m x n with m >= n 10 void qr_mgs(const Matrix &A, Matrix &Q, Matrix &R){ 11 int m = (int)A.size(), n = (int)A[0].size(); 12 Q = zeros(m, n); R = zeros(n, n); 13 vector<vector<double>> V = A; // copy columns to orthogonalize 14 for(int j=0;j<n;++j){ 15 // Compute norm of current vector 16 double norm = 0.0; 17 for(int i=0;i<m;++i) norm += V[i][j]*V[i][j]; 18 R[j][j] = sqrt(max(0.0, norm)); 19 if(R[j][j] == 0.0) throw runtime_error("Rank deficiency detected in QR"); 20 for(int i=0;i<m;++i) Q[i][j] = V[i][j] / R[j][j]; 21 // Orthogonalize remaining columns against Q[:,j] 22 for(int k=j+1;k<n;++k){ 23 double r = 0.0; 24 for(int i=0;i<m;++i) r += Q[i][j] * V[i][k]; 25 R[j][k] = r; 26 for(int i=0;i<m;++i) V[i][k] -= Q[i][j] * r; 27 } 28 } 29 } 30 31 // Solve upper triangular R x = y (n x n) 32 Vec back_substitute(const Matrix &R, const Vec &y){ 33 int n = (int)R.size(); 34 Vec x(n, 0.0); 35 for(int i=n-1;i>=0;--i){ 36 double sum = y[i]; 37 for(int k=i+1;k<n;++k) sum -= R[i][k]*x[k]; 38 x[i] = sum / R[i][i]; 39 } 40 return x; 41 } 42 43 // Compute x = argmin ||Ax - b||_2 using QR: R x = Q^T b 44 Vec least_squares_qr(const Matrix &A, const Vec &b){ 45 int m = (int)A.size(), n = (int)A[0].size(); 46 if((int)b.size()!=m) throw runtime_error("Dimension mismatch"); 47 Matrix Q, R; qr_mgs(A, Q, R); 48 // y = Q^T b 49 Vec y(n, 0.0); 50 for(int j=0;j<n;++j){ 51 double dot = 0.0; 52 for(int i=0;i<m;++i) dot += Q[i][j] * b[i]; 53 y[j] = dot; 54 } 55 // Solve R x = y 56 return back_substitute(R, y); 57 } 58 59 int main(){ 60 // Fit a line y = c0 + c1 * x to noisy data via least squares 61 vector<double> xs = {0,1,2,3,4,5}; 62 vector<double> ys = {1.1, 2.0, 2.9, 4.2, 5.1, 5.9}; 63 int m = (int)xs.size(); 64 // Build A with columns [1, x] 65 Matrix A(m, vector<double>(2)); 66 for(int i=0;i<m;++i){ A[i][0]=1.0; A[i][1]=xs[i]; } 67 // b is observations 68 vector<double> b = ys; 69 70 // Solve x = [c0, c1] 71 vector<double> coeff = least_squares_qr(A, b); 72 73 cout.setf(ios::fixed); cout<<setprecision(6); 74 cout << "c0 = " << coeff[0] << ", c1 = " << coeff[1] << "\n"; 75 76 // Predict and print residual norm 77 double rss = 0.0; 78 for(int i=0;i<m;++i){ 79 double pred = coeff[0] + coeff[1]*xs[i]; 80 double r = pred - ys[i]; 81 rss += r*r; 82 } 83 cout << "Residual sum of squares = " << rss << "\n"; 84 return 0; 85 } 86
This code solves a least-squares problem using QR decomposition via Modified Gram–Schmidt, computing x such that Ax ≈ b without ever forming A⁺ explicitly. It is numerically more stable than normal equations because it avoids forming AᵀA and reduces sensitivity to conditioning. The example fits a straight line to noisy data.