🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Pseudoinverse (Moore-Penrose)

Key Points

  • •
    The Moore–Penrose pseudoinverse generalizes matrix inversion to rectangular or singular matrices and is denoted A⁺.
  • •
    For full column rank tall matrices (m≥n), the pseudoinverse is A⁺ = (AᵀA)^{-1}Aᵀ and solves least-squares problems.
  • •
    Geometrically, A⁺ maps a vector to the best-fit solution and projects data onto A’s column space.
  • •
    Numerically, forming AᵀA can amplify conditioning; QR or SVD-based methods are usually more stable.
  • •
    The pseudoinverse yields the minimum-norm solution among all least-squares solutions.
  • •
    You can compute A⁺b without forming A⁺ explicitly by solving normal equations or via QR: R x = Qᵀ b.
  • •
    When A is rank-deficient, use SVD with thresholding or Tikhonov regularization instead of (AᵀA)^{-1}.
  • •
    Time to compute A⁺ for m × n full-rank tall A is roughly O(m n2); memory is O(m n + n2).

Prerequisites

  • →Matrix multiplication and transpose — Understanding AᵀA and Aᵀb requires comfort with basic matrix operations.
  • →Linear independence and rank — Full column rank is essential for using A⁺ = (AᵀA)^{-1}Aᵀ and guarantees invertibility of AᵀA.
  • →Least squares and projections — Pseudoinverse solves least-squares problems and represents orthogonal projections.
  • →Cholesky and triangular solves — Efficiently applying (AᵀA)^{-1} without forming an explicit inverse uses these techniques.
  • →QR and SVD basics — Stable computation of least-squares solutions uses QR or SVD instead of normal equations.

Detailed Explanation

Tap terms for definitions

01Overview

The Moore–Penrose pseudoinverse extends the idea of a matrix inverse to cases where the matrix is not square or not invertible. For a tall matrix A with more rows than columns (m ≥ n) and full column rank, the pseudoinverse A⁺ acts like a “best possible” inverse: it provides the least-squares solution to Ax ≈ b and projects vectors onto the column space of A. Concretely, when A has full column rank, A⁺ = (AᵀA)^{-1}Aᵀ. This formula appears naturally from the normal equations of least squares. The pseudoinverse is unique and satisfies four defining equations (the Moore–Penrose conditions) that make it the canonical choice among generalized inverses. In data fitting, signal processing, control, and robotics, A⁺ is used to recover parameters from noisy measurements, compute minimum-norm solutions, and project onto subspaces. Practically, while A⁺ can be formed explicitly using the formula above, it is often better numerically to avoid explicit inversion and instead solve linear systems using QR or SVD factorizations. This improves stability, especially when the columns of A are nearly linearly dependent. Understanding the pseudoinverse helps unify linear regression, projections, and constrained optimization under a single mathematical tool.

02Intuition & Analogies

Imagine A as a machine that takes an input vector x (parameters) and produces an output Ax (measurements). If A were a perfect square and invertible machine, you could reverse it exactly with A^{-1}. But in real life, we often have more measurements than parameters (tall matrices), and they may be noisy or inconsistent. There is no exact reverse button because many measurement vectors b are not exactly producible by any x. The pseudoinverse A⁺ is the next best thing: it is the “undo as well as possible” button. It takes b and returns the x that makes Ax as close as possible to b in the least-squares sense (minimizes the sum of squared errors). Geometrically, think of A’s columns spanning some plane or subspace inside a higher-dimensional space. Given b, you first drop a perpendicular from b onto that plane to find the closest point p (this is the projection onto the column space). Then you find the coordinates x of p in the basis given by A’s columns. The composition of “project then express coordinates” is exactly what A⁺ does. If the machine A slightly squashes or stretches space (some directions stronger than others), A⁺ unsquashes appropriately, but it refuses to chase the parts of b that lie outside the subspace (that would require guessing information that A never produced). When the columns of A are nearly redundant (almost in the same direction), reversing those squashes becomes numerically risky—small measurement noise can blow up. That’s why stable methods (QR, SVD) are often preferred over directly inverting AᵀA.

03Formal Definition

For any real matrix A ∈ Rm×n, the Moore–Penrose pseudoinverse A+ ∈ Rn×m is the unique matrix satisfying the four Penrose conditions: (1) AA+A=A, (2) A+AA+=A+, (3) (AA+)^{T} = AA+, and (4) (A+A)^{T} = A+A. These conditions enforce that A+ behaves like a true inverse on A’s range and gives orthogonal projections onto the column and row spaces. When A has full column rank (rank(A) = n, m ≥ n), A^{T}A is symmetric positive definite and invertible, and A+ can be written as A+ = (A^{T}A)^{-1}A^{T}. In general, using the singular value decomposition (SVD) A = UΣ V^{T}, with Σ containing singular values σ1​ ≥ … ≥ σr​ > 0, the pseudoinverse is A+=VΣ+UT, where Σ+ replaces each nonzero σi​ by 1/σi​ and transposes the rectangular shape. For overdetermined systems, x⋆=A+b is the minimum-norm least-squares solution to minx​ ∥ Ax - b ∥2​. The orthogonal projectors AA+ and A+A project onto col(A) and row(A), respectively.

04When to Use

Use the pseudoinverse when you need to solve Ax ≈ b in a least-squares sense, especially for tall full-column-rank matrices. Typical applications include linear regression (fitting models to data), parameter estimation in system identification, and signal reconstruction from noisy measurements. In geometry and graphics, A⁺ is used to compute best-fit transformations and projections. In control and robotics, it appears in inverse kinematics to compute joint configurations that achieve a desired end-effector motion while minimizing joint effort (minimum-norm solutions). In statistics and machine learning, it provides closed-form solutions for ordinary least squares and ridge regression variants. Use the closed-form A⁺ = (AᵀA)^{-1}Aᵀ when A has full column rank and when the problem is reasonably well conditioned; otherwise, prefer numerically stable algorithms such as QR or SVD to compute x = A⁺b without explicitly forming A⁺. For rank-deficient or nearly singular problems, use SVD with small singular value thresholding or Tikhonov regularization (ridge) to control noise amplification.

⚠️Common Mistakes

  • Blindly forming (AᵀA)^{-1} when A is rank-deficient or nearly so. AᵀA then becomes singular or ill-conditioned, leading to massive numerical errors. Use QR/SVD or add regularization.\n- Computing a literal matrix inverse instead of solving triangular systems. Inverse formation is slower and less stable. Use Cholesky/QR solves to apply (AᵀA)^{-1}Aᵀ.\n- Ignoring scaling and conditioning. Poorly scaled columns can inflate the condition number, making solutions sensitive to noise. Normalize features or use QR/SVD.\n- Dimension mismatches. For m × n tall A, A⁺ must be n × m; AA⁺ is m × m (a projector), while A⁺A is n × n.\n- Assuming A⁺A equals the identity for any A. It equals I only on A’s column space (for tall full-rank A, A⁺A = Iₙ but AA⁺ ≠ Iₘ).\n- Using float instead of double. Accumulating dot products in low precision increases round-off error. Prefer double and use tolerances when checking identities.\n- Forgetting that normal equations square the condition number. Forming AᵀA can worsen numerical stability by roughly \kappa(A)^2; prefer QR/SVD for ill-conditioned problems.

Key Formulas

Full-column-rank pseudoinverse

A+=(ATA)−1AT

Explanation: If A has full column rank (m≥n, rank n), AᵀA is invertible and this closed form holds. It is the classic formula used to compute least-squares solutions via normal equations.

Least-squares solution

x⋆=argxmin​∥Ax−b∥2​=A+b

Explanation: The pseudoinverse maps b to the minimum‑residual, minimum‑norm solution. This connects the geometric projection view with an explicit algebraic solution.

Normal equations

ATAx=ATb

Explanation: Stationary condition for minimizing the squared 2‑norm of residuals. Solving this SPD system yields the least-squares solution when A has full column rank.

SVD pseudoinverse

A=UΣVT,A+=VΣ+UT

Explanation: By inverting nonzero singular values and swapping U and V, we obtain A⁺ that satisfies the Moore–Penrose conditions even when A is rectangular or rank deficient.

Spectral condition number

κ2​(A)=σmin​(A)σmax​(A)​

Explanation: The ratio of the largest to smallest singular values measures sensitivity to perturbations. Squaring occurs when forming AᵀA, worsening numerical stability.

Orthogonal projectors

P=AA+,Q=A+A

Explanation: P projects onto the column space of A; Q projects onto the row space. They are idempotent (P² = P) and symmetric for real matrices.

Symmetry of projectors

(AA+)T=AA+,(A+A)T=A+A

Explanation: These are two of the Moore–Penrose conditions enforcing that the projectors are orthogonal (symmetric) projections.

Tikhonov-regularized pseudoinverse

(ATA+λI)−1AT

Explanation: A damped inverse used when AᵀA is ill-conditioned. The parameter λ > 0 reduces noise amplification at the cost of bias.

Complexity Analysis

Let A be m × n with m≥n and full column rank. Forming the Gram matrix G=ATA costs O(m n2) time: there are n(n+1)/2 unique entries, each a dot product of length m. Computing a Cholesky factorization G=LLT requires O(n3) time and O(n2) memory. To build the explicit pseudoinverse A+ = (A^{T}A)^{-1}A^{T} without forming the inverse, we solve two triangular systems with multiple right-hand sides: solve L Y=AT (n × m) and then L^{T} X=Y to obtain X=A+. Each triangular solve with m right-hand sides costs O(n2 m). The overall time is O(m n2 + n3 + n2 m) = O(m n2 + n3); for tall problems (m ≫ n), the O(m n2) term dominates. Memory usage is O(m n) to store A plus O(n2) for G and O(n m) for A+. If you only need x=A+ b, you can avoid forming A+ entirely by solving G x=AT b using Cholesky in O(m n + n2 + n2) per right-hand side, saving O(n m) memory. Using QR decomposition (Householder) to solve least squares has time O(m n2) and avoids squaring the condition number, making it preferable for ill-conditioned A. In contrast, explicitly inverting G (forming G−1) costs O(n3) and increases numerical error; it is better to solve via triangular systems.

Code Examples

Compute A⁺ for tall full-column-rank A using Cholesky without explicit inverse
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility functions for small dense matrices using vector<vector<double>>
5using Matrix = vector<vector<double>>;
6
7Matrix zeros(int r, int c) { return Matrix(r, vector<double>(c, 0.0)); }
8Matrix identity(int n) { Matrix I = zeros(n, n); for (int i=0;i<n;++i) I[i][i]=1.0; return I; }
9
10Matrix transpose(const Matrix &A){
11 int m = (int)A.size(), n = (int)A[0].size();
12 Matrix AT = zeros(n, m);
13 for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i] = A[i][j];
14 return AT;
15}
16
17Matrix matmul(const Matrix &A, const Matrix &B){
18 int m = (int)A.size(), k = (int)A[0].size();
19 int n = (int)B[0].size();
20 Matrix C = zeros(m, n);
21 for(int i=0;i<m;++i){
22 for(int t=0;t<k;++t){
23 double a = A[i][t];
24 if (a==0.0) continue;
25 for(int j=0;j<n;++j){
26 C[i][j] += a * B[t][j];
27 }
28 }
29 }
30 return C;
31}
32
33// Cholesky factorization for SPD matrix G: returns lower triangular L such that G = L L^T
34Matrix cholesky(const Matrix &G){
35 int n = (int)G.size();
36 Matrix L = zeros(n, n);
37 for(int i=0;i<n;++i){
38 for(int j=0;j<=i;++j){
39 double sum = G[i][j];
40 for(int k=0;k<j;++k) sum -= L[i][k]*L[j][k];
41 if(i==j){
42 if(sum <= 0) throw runtime_error("Matrix not SPD (Cholesky failed)");
43 L[i][i] = sqrt(sum);
44 } else {
45 L[i][j] = sum / L[j][j];
46 }
47 }
48 }
49 return L;
50}
51
52// Solve L Y = B for Y, where L is lower triangular; B can have multiple RHS (columns)
53Matrix solveLower(const Matrix &L, const Matrix &B){
54 int n = (int)L.size();
55 int m = (int)B[0].size();
56 Matrix Y = zeros(n, m);
57 for(int col=0; col<m; ++col){
58 for(int i=0;i<n;++i){
59 double sum = B[i][col];
60 for(int k=0;k<i;++k) sum -= L[i][k]*Y[k][col];
61 Y[i][col] = sum / L[i][i];
62 }
63 }
64 return Y;
65}
66
67// Solve L^T X = Y for X (upper triangular solve)
68Matrix solveUpperFromLowerT(const Matrix &L, const Matrix &Y){
69 int n = (int)L.size();
70 int m = (int)Y[0].size();
71 Matrix X = zeros(n, m);
72 for(int col=0; col<m; ++col){
73 for(int i=n-1;i>=0;--i){
74 double sum = Y[i][col];
75 for(int k=i+1;k<n;++k) sum -= L[k][i]*X[k][col]; // using L^T entries
76 X[i][col] = sum / L[i][i];
77 }
78 }
79 return X;
80}
81
82Matrix pseudoinverse_full_column_rank(const Matrix &A){
83 // A: m x n with full column rank, m >= n
84 int m = (int)A.size();
85 int n = (int)A[0].size();
86 Matrix AT = transpose(A); // n x m
87 Matrix G = matmul(AT, A); // n x n (Gram matrix)
88 Matrix L = cholesky(G); // n x n lower triangular
89 // Compute A^+ = (A^T A)^{-1} A^T by solving twice: X = G^{-1} AT = L^{-T} L^{-1} AT
90 Matrix Y = solveLower(L, AT); // solve L Y = A^T
91 Matrix X = solveUpperFromLowerT(L, Y);// solve L^T X = Y -> X = A^+
92 return X; // X is n x m
93}
94
95void printMatrix(const Matrix &A, const string &name){
96 cout << name << " (" << A.size() << "x" << A[0].size() << ")\n";
97 cout.setf(ios::fixed); cout<<setprecision(6);
98 for(const auto &row: A){
99 for(double v: row) cout << setw(12) << v << ' ';
100 cout << '\n';
101 }
102}
103
104int main(){
105 // Example: A is 4x2 with full column rank (tall matrix)
106 Matrix A = {
107 {1.0, 1.0},
108 {1.0, 2.0},
109 {1.0, 3.0},
110 {1.0, 4.0}
111 }; // columns: [1,1,1,1]^T and [1,2,3,4]^T
112
113 Matrix Aplus = pseudoinverse_full_column_rank(A);
114
115 // Verify properties approximately: A^+ A ≈ I_2 and A A^+ A ≈ A
116 Matrix ApA = matmul(Aplus, A); // should be 2x2 identity
117 Matrix AAp = matmul(A, Aplus); // projector 4x4
118 Matrix AApA = matmul(AAp, A); // should recover A
119
120 printMatrix(Aplus, "A^+");
121 printMatrix(ApA, "A^+ A");
122 printMatrix(AApA, "A A^+ A");
123
124 return 0;
125}
126

This program computes the pseudoinverse A⁺ of a tall full-column-rank matrix using the identity A⁺ = (AᵀA)^{-1}Aᵀ. It avoids forming an explicit inverse by using Cholesky factorization of G = AᵀA and two triangular solves, which is both faster and more stable than inverting G directly. The demo prints A⁺, checks that A⁺A ≈ I (on Rⁿ), and verifies AA⁺A ≈ A.

Time: O(m n^2 + n^3) to form AᵀA, factor it, and solve two triangular systems with m right-hand sides.Space: O(m n + n^2) to store A, AᵀA, and intermediate factors; plus O(n m) for A⁺.
Solve least squares x = A⁺ b using QR (Modified Gram–Schmidt) without forming A⁺
1#include <bits/stdc++.h>
2using namespace std;
3using Matrix = vector<vector<double>>;
4using Vec = vector<double>;
5
6Matrix zeros(int r, int c){ return Matrix(r, vector<double>(c, 0.0)); }
7Matrix transpose(const Matrix &A){ int m=A.size(), n=A[0].size(); Matrix AT=zeros(n,m); for(int i=0;i<m;++i) for(int j=0;j<n;++j) AT[j][i]=A[i][j]; return AT; }
8
9// Modified Gram-Schmidt QR for m x n with m >= n
10void qr_mgs(const Matrix &A, Matrix &Q, Matrix &R){
11 int m = (int)A.size(), n = (int)A[0].size();
12 Q = zeros(m, n); R = zeros(n, n);
13 vector<vector<double>> V = A; // copy columns to orthogonalize
14 for(int j=0;j<n;++j){
15 // Compute norm of current vector
16 double norm = 0.0;
17 for(int i=0;i<m;++i) norm += V[i][j]*V[i][j];
18 R[j][j] = sqrt(max(0.0, norm));
19 if(R[j][j] == 0.0) throw runtime_error("Rank deficiency detected in QR");
20 for(int i=0;i<m;++i) Q[i][j] = V[i][j] / R[j][j];
21 // Orthogonalize remaining columns against Q[:,j]
22 for(int k=j+1;k<n;++k){
23 double r = 0.0;
24 for(int i=0;i<m;++i) r += Q[i][j] * V[i][k];
25 R[j][k] = r;
26 for(int i=0;i<m;++i) V[i][k] -= Q[i][j] * r;
27 }
28 }
29}
30
31// Solve upper triangular R x = y (n x n)
32Vec back_substitute(const Matrix &R, const Vec &y){
33 int n = (int)R.size();
34 Vec x(n, 0.0);
35 for(int i=n-1;i>=0;--i){
36 double sum = y[i];
37 for(int k=i+1;k<n;++k) sum -= R[i][k]*x[k];
38 x[i] = sum / R[i][i];
39 }
40 return x;
41}
42
43// Compute x = argmin ||Ax - b||_2 using QR: R x = Q^T b
44Vec least_squares_qr(const Matrix &A, const Vec &b){
45 int m = (int)A.size(), n = (int)A[0].size();
46 if((int)b.size()!=m) throw runtime_error("Dimension mismatch");
47 Matrix Q, R; qr_mgs(A, Q, R);
48 // y = Q^T b
49 Vec y(n, 0.0);
50 for(int j=0;j<n;++j){
51 double dot = 0.0;
52 for(int i=0;i<m;++i) dot += Q[i][j] * b[i];
53 y[j] = dot;
54 }
55 // Solve R x = y
56 return back_substitute(R, y);
57}
58
59int main(){
60 // Fit a line y = c0 + c1 * x to noisy data via least squares
61 vector<double> xs = {0,1,2,3,4,5};
62 vector<double> ys = {1.1, 2.0, 2.9, 4.2, 5.1, 5.9};
63 int m = (int)xs.size();
64 // Build A with columns [1, x]
65 Matrix A(m, vector<double>(2));
66 for(int i=0;i<m;++i){ A[i][0]=1.0; A[i][1]=xs[i]; }
67 // b is observations
68 vector<double> b = ys;
69
70 // Solve x = [c0, c1]
71 vector<double> coeff = least_squares_qr(A, b);
72
73 cout.setf(ios::fixed); cout<<setprecision(6);
74 cout << "c0 = " << coeff[0] << ", c1 = " << coeff[1] << "\n";
75
76 // Predict and print residual norm
77 double rss = 0.0;
78 for(int i=0;i<m;++i){
79 double pred = coeff[0] + coeff[1]*xs[i];
80 double r = pred - ys[i];
81 rss += r*r;
82 }
83 cout << "Residual sum of squares = " << rss << "\n";
84 return 0;
85}
86

This code solves a least-squares problem using QR decomposition via Modified Gram–Schmidt, computing x such that Ax ≈ b without ever forming A⁺ explicitly. It is numerically more stable than normal equations because it avoids forming AᵀA and reduces sensitivity to conditioning. The example fits a straight line to noisy data.

Time: O(m n^2) for QR factorization and O(m n) for forming Qᵀ b, dominated by O(m n^2).Space: O(m n) to store Q plus O(n^2) for R.
#pseudoinverse#moore-penrose#least squares#normal equations#qr decomposition#cholesky#svd#projection#full column rank#overdetermined#condition number#ridge regression#minimum norm#tall matrix#orthogonal projector