🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
⏱CoachđŸ§©Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Implicit Bias of Gradient Descent

Key Points

  • ‱
    In underdetermined linear systems (more variables than equations), gradient descent started at zero converges to the minimum Euclidean norm solution without any explicit regularizer.
  • ‱
    This happens because the gradient lies in the row space of A, so gradient descent never changes components in the null space of A.
  • ‱
    The minimum-norm solution for Ax=b (full row rank and consistent) is x* = AT (A AT)^{-1} b, the Moore–Penrose pseudoinverse solution.
  • ‱
    With a nonzero initialization that has a null-space component, gradient descent converges to the minimum-norm solution plus that unchanged null-space component.
  • ‱
    A stable learning rate must satisfy 0 < η < 2 / σm​ax2, where σm​ax is the largest singular value of A.
  • ‱
    For inconsistent systems, gradient descent converges to the minimum-norm least-squares solution, still given by the pseudoinverse.
  • ‱
    In classification with separable data (logistic loss), gradient descent’s direction converges to the L2 maximum-margin separator—another example of implicit bias.
  • ‱
    You can verify this phenomenon numerically in C++ by comparing gradient descent iterates to the closed-form pseudoinverse solution.

Prerequisites

  • →Linear algebra: vector spaces, subspaces, orthogonality — Understanding row space, null space, and projections is essential for why GD preserves null-space components.
  • →Least squares and normal equations — The objective f(x) = 1/2||Ax − b||^2 and its gradient structure underpin the GD updates.
  • →SVD and pseudoinverse — Relates the closed-form minimum-norm solution x* = A^+ b and informs numerical stability considerations.
  • →Gradient descent fundamentals — Knowing convergence conditions and step-size selection is critical for the implicit bias result.
  • →Eigenvalues and singular values — The learning-rate bound 0 < η < 2/σ_max^2 and conditioning depend on spectral properties.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine you have infinitely many ways to tie a rope between two posts because the rope is much longer than the straight-line distance. Which tying pattern will you pick if no one tells you explicitly? Concept: In underdetermined linear systems (more parameters than constraints), there are infinitely many solutions to Ax = b. Yet, when we run plain gradient descent on the squared loss starting from zero, it systematically converges to the one with the smallest Euclidean length (L2 norm). This behavior is called the implicit bias (or implicit regularization) of gradient descent. Example: For a 2×4 matrix A with full row rank and a consistent b, the set of solutions forms a flat plane in 4D. Gradient descent started at the origin moves only in a special subspace (the row space) and thus ends up at the unique point on that plane closest to the origin—the minimum-norm solution x* = A^T(AA^T)^{-1}b. The punchline is that even without adding an explicit penalty like ||x||^2, the algorithm’s dynamics favor a simple solution.

02Intuition & Analogies

Hook: Think of standing in a wide, flat valley with a gentle slope pointing east–west but perfectly flat north–south. If you roll a ball from the origin, gravity pulls it only along the east–west slope; it never moves north–south because there’s no slope in that direction. Concept: For the least-squares objective f(x) = 1/2 ||Ax − b||^2, the gradient ∇f(x) = A^T(Ax − b) always lies in the row space of A. Directions orthogonal to this space—the null space of A—are perfectly flat: moving along them doesn’t change Ax and hence doesn’t change the loss. Example: Start at x0 = 0. Every update x_{t+1} = x_t − ηA^T(Ax_t − b) is a step within the row space. Because you never pick up any null-space component, you end in the unique point where the plane of solutions intersects the row space, which is exactly the minimum-norm solution. If instead you start with a nonzero vector in the null space (like a sideways push along the flat direction), gradient descent can’t remove it because the gradient never points that way. You converge to the same minimum-norm point in the row space plus this unchanged null-space component. In this sense, the algorithm’s geometry—what directions the gradient can and cannot affect—implicitly biases the final answer, even without explicit regularization.

03Formal Definition

Hook: The story becomes crisp with linear algebra. Concept: Consider the convex quadratic f(x) = 21​ ∄ Ax - b ∄22​ with A ∈ Rm×n. When m<n and A has full row rank, the system is underdetermined and, if consistent, has infinitely many solutions. The solution set is an affine space x=xp​+z with z ∈ N(A) (the null space). The unique minimum-norm solution is x^⋆ = A^⊀ (AA^⊀)^{-1} b, i.e., A^+ b where A^+ is the Moore–Penrose pseudoinverse. Gradient descent with step size 0 < η < 2/σmax2​ (σmax​ is the largest singular value of A) obeys xt+1​=xt​ - η A^⊀ (Axt​ - b). Because A^⊀ r ∈ Row(A), the iterates preserve the decomposition x_t = P_{\operatorname{Row}(A)} xt​ + P_{N(A)} xt​, keeping the null-space component fixed. Example (theorem): If x0​ ∈ Row(A) (e.g., x0​=0), then xt​ → x^⋆ = A^⊀(AA^⊀)^{-1}b. In general, xt​ → P_{N(A)} x0​ + x^⋆. For inconsistent problems, the limit is the minimum-norm least-squares solution x^⋆ = A^+ b. This formalizes the implicit bias: among all minimizers, gradient descent selects the minimum-norm one (modulo the fixed null-space component from initialization).

04When to Use

Hook: Why care if you can just compute the pseudoinverse once and be done? Concept: In large-scale or streaming problems, forming A^+ explicitly (via SVD or matrix inversion) can be too costly in time or memory, while gradient descent scales with matrix–vector products and can be stopped early. Use Cases: 1) High-dimensional linear regression where n ≫ m (more features than samples). Setting x0 = 0 and using a stable learning rate yields the minimum-norm interpolator, which often generalizes better. 2) Overparameterized models (e.g., wide neural networks). Even without explicit L2 regularization, gradient-based training picks low-complexity solutions (e.g., max-margin in separable classification), which helps explain generalization. 3) Signal processing and inverse problems, where the physics yields underdetermined linear maps A; gradient methods naturally recover the least-energy solution. 4) Distributed or online settings where storing or factorizing A is expensive; you can stream A and b, applying iterative updates. Example: In practical C++, computing x^\star via x^\star = A^\top(AA^\top)^{-1}b requires inverting an m×m matrix—fine for small m but expensive for large m. Gradient descent instead uses repeated A and A^\top multiplies, which are often cache-friendly and parallelizable.

⚠Common Mistakes

Hook: The theory is clean, but implementations can quietly violate assumptions. Concept: 1) Step size too large: If η ≄ 2/σ_max^2, gradient descent can diverge or oscillate. Always estimate or backtrack to a safe η. 2) Assuming any initialization gives the minimum-norm solution: Not true if x0 has a null-space component; that part remains forever. Start from x0 = 0 (or project x0 onto the row space) to get the pure minimum-norm solution. 3) Ignoring inconsistency: If b ∉ Col(A), there is no exact solution; gradient descent converges to the minimum-norm least-squares solution instead—don’t misinterpret residuals. 4) Numerical instability: Forming AA^T and inverting it can be ill-conditioned; prefer SVD for A^+ in production or add Tikhonov regularization. 5) Confusing implicit bias with explicit regularization: Weight decay (L2) changes the objective; implicit bias arises from algorithm dynamics even without penalties. 6) Stopping too early or too late: Early stopping can act like regularization; know whether you want the exact min-norm interpolator or a partially fitted model. Example: If you warm-start from a previous solution containing null-space drift, gradient descent won’t remove it; explicitly project onto the row space if you want the pure min-norm solution.

Key Formulas

Gradient Descent Update (Least Squares)

xt+1​=xt​−ηA⊀(Axt​−b)

Explanation: Each step moves opposite to the gradient of 1/2|∣Ax−b∣|^2. The term AT(Ax−b) lies in the row space, so updates never change null-space components.

Minimum-Norm Solution (Full Row Rank)

x⋆=A⊀(AA⊀)−1b=A+b

Explanation: When A has full row rank and Ax=b is consistent, the unique minimum-norm solution is given by the Moore–Penrose pseudoinverse. This is the point of smallest Euclidean norm among all exact solutions.

Step Size Stability

0<η<σmax2​2​

Explanation: For the quadratic least-squares objective, the gradient is Lipschitz with constant σm​ax2. Choosing η in this range guarantees convergence of gradient descent.

Decomposition and Projections

Rn=Row(A)⊕N(A),PRow(A)​=A⊀(AA⊀)−1A,PN(A)​=I−PRow(A)​

Explanation: Any vector splits uniquely into row-space and null-space parts. These projection matrices extract those components when A has full row rank.

Solution Set Structure

{x:Ax=b}=x⋆+N(A)

Explanation: All solutions form an affine set: the minimum-norm solution plus any vector in the null space. This shows why there are infinitely many solutions when m < n.

GD Limit with Arbitrary Initialization

x∞​=PN(A)​x0​+A⊀(AA⊀)−1b

Explanation: Gradient descent preserves the null-space component of the initialization and converges to the minimum-norm point in the row space. Starting at zero removes the first term.

Gradient Lipschitz Constant

L=∄A⊀A∄2​=σmax2​

Explanation: For least squares, the gradient’s Lipschitz constant equals the spectral norm of AT A. This value controls step-size choices for convergence.

Hard-Margin SVM

wmin​∄w∄2​s.t.yi​(w⊀xi​)≄1

Explanation: In separable classification, gradient descent on logistic loss converges in direction to this maximum-margin separator, showing implicit bias beyond linear regression.

Complexity Analysis

Per iteration of gradient descent for f(x) = 1/2|∣Ax−b∣|^2 with A ∈ Rm×n, the dominant costs are two matrix–vector products: r=Ax (O(mn)) and g=AT r (O(mn)). The update x←x − η g is O(n). Therefore, one iteration runs in O(mn) time and O(m + n) space to store x, b, and the intermediate residual. Over T iterations, total time is O(Tmn). In many applications with m â‰Ș n (underdetermined), this scales linearly in the number of parameters n per pass. If you compare with forming the pseudoinverse explicitly, a standard SVD-based A^+ costs roughly O(m2 n) when m≀n (or O(n2 m) symmetrically), which can be prohibitive for large n. Alternatively, using the identity x* = AT (A AT)^{-1} b (when A has full row rank) requires computing M=A AT (O(m2 n)) and inverting the m×m matrix M (O(m3)), which is feasible only when m is small. Thus, gradient descent is often preferred when repeated A·v and AT·w multiplications are cheap and memory-friendly. Space-wise, storing A costs O(mn) unless A is structured (sparse, convolutional), in which case matrix–vector products can be done faster and with reduced memory. The numerical stability of gradient descent is governed by the condition number Îș = σm​ax/σm​in (restricted to the row space). A larger Îș slows convergence along poorly conditioned directions, potentially requiring more iterations or preconditioning. Choosing a stable learning rate 0 < η < 2/σ_max2 ensures convergence; backtracking or spectral estimates (e.g., power iteration) can provide practical bounds.

Code Examples

GD on an underdetermined system converges to the minimum-norm solution (zero initialization)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Basic linear algebra helpers for small dense matrices
5using Mat = vector<vector<double>>;
6using Vec = vector<double>;
7
8Mat transpose(const Mat &A){
9 size_t m=A.size(), n=A[0].size();
10 Mat AT(n, vector<double>(m,0.0));
11 for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) AT[j][i]=A[i][j];
12 return AT;
13}
14
15Vec matVec(const Mat &A, const Vec &x){
16 size_t m=A.size(), n=A[0].size();
17 Vec y(m,0.0);
18 for(size_t i=0;i<m;++i){
19 double s=0.0; for(size_t j=0;j<n;++j) s += A[i][j]*x[j];
20 y[i]=s;
21 }
22 return y;
23}
24
25Mat matMul(const Mat &A, const Mat &B){
26 size_t m=A.size(), k=A[0].size(), n=B[0].size();
27 Mat C(m, vector<double>(n,0.0));
28 for(size_t i=0;i<m;++i){
29 for(size_t t=0;t<k;++t){ double a=A[i][t]; if(a==0) continue; for(size_t j=0;j<n;++j) C[i][j]+=a*B[t][j]; }
30 }
31 return C;
32}
33
34Vec vecAdd(const Vec &a, const Vec &b, double alpha=1.0){
35 size_t n=a.size(); Vec c(n); for(size_t i=0;i<n;++i) c[i]=a[i]+alpha*b[i]; return c;
36}
37
38Vec vecScale(const Vec &a, double s){ Vec c=a; for(double &v:c) v*=s; return c; }
39
40double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
41
42double norm2(const Vec &a){ return sqrt(max(0.0, dot(a,a))); }
43
44// Gauss-Jordan inversion for a small square matrix (with partial pivoting)
45Mat inverse(Mat A){
46 size_t n=A.size();
47 Mat I(n, vector<double>(n,0.0));
48 for(size_t i=0;i<n;++i) I[i][i]=1.0;
49 for(size_t col=0; col<n; ++col){
50 // pivot
51 size_t piv=col; double best=fabs(A[col][col]);
52 for(size_t r=col+1;r<n;++r){ double v=fabs(A[r][col]); if(v>best){best=v;piv=r;} }
53 if(best < 1e-12) throw runtime_error("Matrix is singular or ill-conditioned");
54 if(piv!=col){ swap(A[piv],A[col]); swap(I[piv],I[col]); }
55 // normalize row
56 double diag=A[col][col]; double invd=1.0/diag;
57 for(size_t j=0;j<n;++j){ A[col][j]*=invd; I[col][j]*=invd; }
58 // eliminate other rows
59 for(size_t r=0;r<n;++r){ if(r==col) continue; double f=A[r][col]; if(f==0) continue; for(size_t j=0;j<n;++j){ A[r][j]-=f*A[col][j]; I[r][j]-=f*I[col][j]; } }
60 }
61 return I;
62}
63
64int main(){
65 // Construct a 2x4 full-row-rank underdetermined system A x = b
66 // Rows are linearly independent; null space has dimension at least 2.
67 Mat A = {
68 {1, 0, 1, 0},
69 {0, 1, 0, 1}
70 }; // m=2, n=4
71 size_t m=2, n=4;
72
73 // Create a true x with a nonzero null-space component so min-norm != x_true
74 Vec x_true = {1, 2, -1, 0}; // contains [1,0,-1,0] component (in null space)
75 Vec b = matVec(A, x_true); // consistent system by construction
76
77 // Compute the minimum-norm solution x_min = A^T (A A^T)^{-1} b
78 Mat AT = transpose(A);
79 Mat M = matMul(A, AT); // 2x2
80 Mat Minv = inverse(M); // invert AA^T
81 // Compute y = Minv * b
82 Vec y(m,0.0); for(size_t i=0;i<m;++i){ for(size_t j=0;j<m;++j) y[i]+=Minv[i][j]*b[j]; }
83 // x_min = A^T * y
84 Vec x_min(n,0.0); for(size_t i=0;i<n;++i){ for(size_t j=0;j<m;++j) x_min[i]+=AT[i][j]*y[j]; }
85
86 // Gradient Descent from zero initialization
87 Vec x(n, 0.0);
88 double eta = 0.25; // stable step size for this small example
89 int T = 2000;
90 for(int t=0; t<T; ++t){
91 Vec r = vecAdd(matVec(A, x), b, -1.0); // r = A x - b
92 // g = A^T r
93 Vec g(n,0.0);
94 for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s += AT[i][j]*r[j]; g[i]=s; }
95 // x <- x - eta * g
96 for(size_t i=0;i<n;++i) x[i] -= eta * g[i];
97 }
98
99 // Report distances
100 cout.setf(std::ios::fixed); cout<<setprecision(6);
101 cout << "||A x - b|| (GD) = " << norm2(vecAdd(matVec(A,x), b, -1.0)) << "\n";
102 cout << "||x_GD - x_min|| = " << norm2(vecAdd(x, x_min, -1.0)) << "\n";
103 cout << "x_min = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x_min[i]; cout << "]\n";
104 cout << "x_GD = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x[i]; cout << "]\n";
105 return 0;
106}
107

We build a 2×4 underdetermined, consistent system and compute the minimum-norm solution via x* = A^T(AA^T)^{-1}b. Starting gradient descent at zero with a stable step size ensures updates stay in the row space and converge to x*. The output shows a small residual and that ||x_GD − x_min|| is near zero, confirming the implicit bias.

Time: O(T m n) for T iterations; the one-time inverse of the 2×2 matrix is O(m^3).Space: O(m n) to store A plus O(n) for vectors.
Null-space component persists under GD (nonzero initialization)
1#include <bits/stdc++.h>
2using namespace std;
3using Mat = vector<vector<double>>; using Vec = vector<double>;
4
5// Helpers
6Mat transpose(const Mat &A){ size_t m=A.size(), n=A[0].size(); Mat AT(n, vector<double>(m)); for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) AT[j][i]=A[i][j]; return AT; }
7Vec matVec(const Mat &A, const Vec &x){ size_t m=A.size(), n=A[0].size(); Vec y(m,0); for(size_t i=0;i<m;++i){ double s=0; for(size_t j=0;j<n;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
8Vec add(const Vec &a, const Vec &b, double alpha=1.0){ Vec c=a; for(size_t i=0;i<a.size();++i) c[i]+=alpha*b[i]; return c; }
9double dot(const Vec &a, const Vec &b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
10double norm2(const Vec &a){ return sqrt(max(0.0, dot(a,a))); }
11
12int main(){
13 // Same A as before
14 Mat A = {
15 {1, 0, 1, 0},
16 {0, 1, 0, 1}
17 }; size_t m=2, n=4;
18 Mat AT = transpose(A);
19
20 // Build a consistent right-hand side
21 Vec x_row_only = {0.5, 1.0, 0.5, 1.0}; // lies in Row(A): it's A^T y for y=[1,1]
22 Vec b = matVec(A, x_row_only);
23
24 // Compute minimum-norm solution: it should equal x_row_only here
25 // because x_row_only has zero null-space component and satisfies Ax=b.
26 // We'll just store it as x_min for comparison.
27 Vec x_min = x_row_only;
28
29 // Choose a nonzero null-space vector z with A z = 0
30 Vec z_null = {1, 0, -1, 0};
31
32 // Initialize GD at x0 = z_null (pure null-space component)
33 Vec x = z_null;
34
35 // Run GD: x_{t+1} = x_t - eta * A^T (A x_t - b)
36 double eta = 0.25; int T=2000;
37 for(int t=0;t<T;++t){
38 Vec r = add(matVec(A, x), b, -1.0); // A x - b
39 Vec g(n,0.0);
40 for(size_t i=0;i<n;++i){ double s=0; for(size_t j=0;j<m;++j) s+=AT[i][j]*r[j]; g[i]=s; }
41 for(size_t i=0;i<n;++i) x[i] -= eta * g[i];
42 }
43
44 // Theory predicts: limit = z_null + x_min
45 Vec predicted = add(x_min, z_null, 1.0);
46
47 cout.setf(std::ios::fixed); cout<<setprecision(6);
48 cout << "||A x - b|| (GD) = " << norm2(add(matVec(A,x), b, -1.0)) << "\n";
49 cout << "||x_GD - (x_min + z_null)|| = " << norm2(add(x, predicted, -1.0)) << "\n";
50 cout << "x_GD = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << x[i]; cout << "]\n";
51 cout << "x_min + z_null = ["; for(size_t i=0;i<n;++i) cout << (i?", ":"") << predicted[i]; cout << "]\n";
52 return 0;
53}
54

We craft a consistent system with a known row-space solution x_min and start gradient descent from a vector in the null space. Because the gradient lies in the row space, the null-space component cannot be removed by updates. The final iterate equals x_min plus the initial null-space vector, confirming the persistence of null-space components and the role of initialization.

Time: O(T m n) for T iterations.Space: O(m n) to store A plus O(n) for vectors.
#implicit bias#gradient descent#minimum norm#pseudoinverse#moore-penrose#underdetermined system#null space#row space#least squares#overparameterization#maximum margin#svm#learning rate#singular values#projection