Spectral Analysis of Neural Networks
Key Points
- âąSpectral analysis studies the distribution of eigenvalues and singular values of neural network weight matrices during training.
- âąThe empirical spectral distribution (ESD) often shows a random-matrix âbulkâ plus a few outlier eigenvalues that capture learned structure.
- âąThe largest singular value (spectral norm) controls gradient amplification and stable learning-rate choices.
- âąFor non-symmetric weight matrices, singular values are usually more informative than eigenvalues; use W to get a symmetric, positive semidefinite matrix.
- âąRandom Matrix Theory (RMT) provides useful baselines such as the MarchenkoâPastur law to compare with observed spectra.
- âąTracking spectra across epochs can diagnose exploding/vanishing gradients, overfitting, and implicit regularization.
- âąExact full spectra cost O(); scalable approximations (power iteration, Lanczos) run in O(k ) for large layers.
- âąIn practice, analyze the symmetric part (W + )/2 or the Gram matrix W, and use histograms to inspect the ESD.
Prerequisites
- âLinear algebra (vectors, matrices, eigenvalues/SVD) â Spectral analysis relies on eigenvalues and singular values, Gram matrices, and orthogonality.
- âProbability and Random Matrix Theory (basics) â MP law and bulk/outlier interpretations require understanding of i.i.d. models and limiting distributions.
- âGradient-based optimization â Learning-rate stability bounds involve top eigenvalues of Hessians or Lipschitz constants from spectral norms.
- âNeural network architectures (linear layers, ReLU, CNNs) â You must know how weights map to matrices to form W, W^T W, or symmetric parts for analysis.
- âNumerical methods â Iterative eigensolvers (power iteration, Lanczos) and stability concerns are essential for scalable analysis.
Detailed Explanation
Tap terms for definitions01Overview
Spectral analysis of neural networks examines the eigenvalues and singular values of matrices that arise in learningâprimarily weight matrices, but also Hessians, Jacobians, and Gram/NTK matrices. During training, these spectra evolve in characteristic ways: early on they can resemble random matrices, while later certain directions (outliers) emerge that reflect the learned patterns in data. By inspecting the empirical spectral distribution (ESD)âthe histogram of eigenvalues or singular valuesâwe gain a window into stability (e.g., whether gradients explode), expressivity (which directions dominate), and generalization (how heavy-tailed the spectrum is). Random Matrix Theory (RMT) provides a powerful reference: purely random weights have predictable spectra (e.g., MarchenkoâPastur for singular values), so deviations from these baselines indicate learned structure or pathologies. Practically, full eigenvalue decompositions are expensive for modern networks, but scalable approximationsâtracking just the top singular value (spectral norm) or using Krylov methodsâare often sufficient. This perspective is model-agnostic: it applies to linear layers and, with care, to convolutional and attention blocks by reshaping operators into matrices. Overall, spectral analysis offers a principled, quantitative tool to monitor and guide training beyond simple scalar metrics like loss and accuracy.
02Intuition & Analogies
Think of a weight matrix as a machine that stretches and rotates space. Drop a bunch of arrows (vectors) into this machine: some directions get stretched a lot, others compressed. The stretching factors are the singular values; the most-stretched direction corresponds to the largest singular value. If that top stretch is huge, tiny errors can get blown upâlike whispering into a loudspeaker cranked to maxâleading to unstable training. Conversely, if everything is squashed, signals and gradients fade away. Now imagine plotting how much stretching happens in all directions as a histogram: thatâs the empirical spectral distribution. For a brand-new, randomly initialized network, the histogram typically has a smooth âbulkâ predicted by probability theory (RMT). As the model learns structure in data, a few bars start poking out from the bulk: these are outliers, representing directions that the network has tuned to be especially sensitive or expressive. You can think of the bulk as background noise and the outliers as the melodyâitâs where the learning lives. Over training, watching those bars move helps you see if your learning rate is reasonable (top stretch not exploding), if regularization is working (not too many huge bars), and if youâre starting to overfit (heavy tail growth). For non-symmetric matrices (most weights), eigenvalues can be complex and tricky; instead, look at the symmetric part or at W^T W, whose eigenvalues are the squared singular valuesâclean, real, and easy to interpret.
03Formal Definition
04When to Use
Use spectral analysis when you want to diagnose or improve training stability and generalization. Concretely: (1) Before training, inspect |W|2 for new architectures or initializations; very large spectral norms suggest gradients may explode, motivating scaling or spectral norm regularization. (2) During training, track the top singular value of key layers; sudden growth can signal that the learning rate is too high. (3) To study generalization, compare the ESD of W^{\top}W to the MarchenkoâPastur bulk: outliers indicate learned structure; heavy tails can correlate with over-parameterization and implicit regularization effects. (4) For second-order diagnostics, examine the Hessianâs top eigenvalues to gauge curvature; large \lambda{\max}(H) suggests reducing the learning rate or using adaptive/second-order methods. (5) In pruning or compression, small singular values indicate near-redundant directions that can be truncated with limited performance impact. (6) For safety/robustness, large spectral norms in early layers imply strong input amplification and potential adversarial vulnerability; spectral normalization can help. (7) In sequence models and RNNs, keeping spectral radius near 1 in recurrent matrices mitigates vanishing/exploding gradients.
â ïžCommon Mistakes
- Confusing eigenvalues with singular values for non-symmetric weights. Eigenvalues can be complex and do not directly measure amplification; use singular values via W^{\top}W instead.\n- Looking only at the largest singular value and ignoring the distribution. The bulk shape, outliers, and tails carry distinct information about learned structure and regularization.\n- Misapplying random-matrix baselines. MarchenkoâPastur assumes i.i.d. entries; convolutions, normalization layers, and weight sharing break assumptions, so treat RMT as a qualitative guide, not a strict test.\n- Forming W^{\top}W explicitly for huge layers. This squares the condition number and costs O(n^2 m) memory/time; prefer implicit products A v = W^{\top}(W v) for iterative methods.\n- Using power iteration without normalization or with too few iterations, leading to poor estimates of the spectral norm; monitor the Rayleigh quotient and convergence.\n- Interpreting negative eigenvalues of the symmetric part (W+W^{\top})/2 as instability in forward pass; they relate to non-normality and asymmetric componentsâprefer singular values for amplification.\n- Ignoring scaling. Compare spectra across epochs using consistent normalization (e.g., by layer width) or else apparent changes may just reflect size, not learning.\n- Relying on tiny samples for histograms. Use enough bins and samples; for large matrices, adopt stochastic methods (Lanczos, Hutchinson) to estimate spectral densities reliably.
Key Formulas
Empirical Spectral Distribution (ESD)
Explanation: The ESD is the uniform measure over eigenvalues of a Hermitian matrix A. In practice, we estimate it by a histogram of eigenvalues.
Spectral Norm
Explanation: The largest singular value equals the square root of the largest eigenvalue of the Gram matrix W. It measures the maximum amplification by W.
Spectral Radius
Explanation: The spectral radius is the largest magnitude among eigenvalues. For recurrent dynamics, it governs stability of repeated application of W.
Stieltjes Transform
Explanation: This complex-valued function characterizes the spectrum of A. Its imaginary part along the real axis recovers the density of eigenvalues.
MarchenkoâPastur Law
Explanation: For X with i.i.d. entries and aspect ratio , the eigenvalues of X concentrate on [ with the above density. Deviations signal structure beyond randomness.
Gradient Descent Stability (Quadratic)
Explanation: For minimizing a quadratic with Hessian H using gradient descent, the step size must be below 2/ to ensure convergence.
Rayleigh Quotient and Power Iteration
Explanation: The Rayleigh quotient approximates the dominant eigenvalue of symmetric A as the normalized iterates converge to the top eigenvector.
TraceâEigenvalue Identity
Explanation: The sum of eigenvalues equals the trace, which is easy to compute. Useful for sanity checks and moment estimates.
Frobenius Norm and Singular Values
Explanation: The squared Frobenius norm equals the sum of squared singular values, relating entrywise scale to spectral scale.
Condition Number
Explanation: A large condition number indicates ill-conditioning and potential optimization slowdowns or instability.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Matrix { 5 int r, c; 6 vector<double> a; // row-major 7 Matrix() : r(0), c(0) {} 8 Matrix(int r_, int c_, double val=0.0) : r(r_), c(c_), a(r_*c_, val) {} 9 double &operator()(int i, int j) { return a[i*c + j]; } 10 const double &operator()(int i, int j) const { return a[i*c + j]; } 11 }; 12 13 // Multiply A (r x c) by vector x (size c) -> y (size r) 14 vector<double> matvec(const Matrix &A, const vector<double> &x){ 15 vector<double> y(A.r, 0.0); 16 for(int i=0;i<A.r;++i){ 17 const double *row = &A.a[i*A.c]; 18 double s=0.0; 19 for(int j=0;j<A.c;++j) s += row[j] * x[j]; 20 y[i]=s; 21 } 22 return y; 23 } 24 25 // Multiply A^T by vector x (size r) -> y (size c) 26 vector<double> matTvec(const Matrix &A, const vector<double> &x){ 27 vector<double> y(A.c, 0.0); 28 for(int i=0;i<A.r;++i){ 29 double xi = x[i]; 30 const double *row = &A.a[i*A.c]; 31 for(int j=0;j<A.c;++j) y[j] += row[j] * xi; 32 } 33 return y; 34 } 35 36 // Matrix-matrix multiply: C = A * B 37 Matrix matmul(const Matrix &A, const Matrix &B){ 38 assert(A.c == B.r); 39 Matrix C(A.r, B.c, 0.0); 40 for(int i=0;i<A.r;++i){ 41 for(int k=0;k<A.c;++k){ 42 double aik = A(i,k); 43 for(int j=0;j<B.c;++j){ 44 C(i,j) += aik * B(k,j); 45 } 46 } 47 } 48 return C; 49 } 50 51 Matrix transpose(const Matrix &A){ 52 Matrix T(A.c, A.r, 0.0); 53 for(int i=0;i<A.r;++i) 54 for(int j=0;j<A.c;++j) 55 T(j,i) = A(i,j); 56 return T; 57 } 58 59 // Spectral norm via power iteration on implicit Gram operator v -> A^T(A v) 60 // Returns an estimate of ||A||_2 61 double spectral_norm_power_iteration(const Matrix &A, int iters=50, unsigned seed=42){ 62 mt19937 rng(seed); 63 normal_distribution<double> N(0.0,1.0); 64 // Work in A.c-dimensional space (columns of A) 65 vector<double> v(A.c); 66 for(double &vi : v) vi = N(rng); 67 auto normalize = [](vector<double> &x){ 68 double n=0.0; for(double xi: x) n += xi*xi; n = sqrt(max(n,1e-30)); 69 for(double &xi : x) xi /= n; return n; }; 70 normalize(v); 71 double rayleigh = 0.0; 72 for(int t=0;t<iters;++t){ 73 // z = A v 74 vector<double> z = matvec(A, v); 75 // v_next = A^T z = A^T (A v) 76 vector<double> v_next = matTvec(A, z); 77 // Rayleigh quotient for A^T A: r = (v^T (A^T A) v) / (v^T v) = ||A v||^2 / ||v||^2 78 double num=0.0; for(double zi: z) num += zi*zi; 79 double den=0.0; for(double vi: v) den += vi*vi; 80 rayleigh = (den>0? num/den : 0.0); 81 normalize(v_next); 82 v.swap(v_next); 83 } 84 return sqrt(max(0.0, rayleigh)); 85 } 86 87 // Simple 2-layer linear network: y_hat = X W1 W2; MSE loss; gradient descent 88 int main(){ 89 ios::sync_with_stdio(false); 90 cin.tie(nullptr); 91 92 int m=256; // samples 93 int d=64; // input dim 94 int h=64; // hidden dim 95 int o=10; // output dim 96 double lr = 0.05; // learning rate 97 int epochs = 50; 98 99 mt19937 rng(1); 100 normal_distribution<double> N(0.0,1.0); 101 102 Matrix X(m,d), W1(d,h), W2(h,o), Y(m,o); 103 104 // Random data and target with a planted low-rank structure 105 for(double &v: X.a) v = N(rng) / sqrt(d); 106 Matrix U_true(d, 3), V_true(3, o); 107 for(double &v: U_true.a) v = N(rng)/sqrt(d); 108 for(double &v: V_true.a) v = N(rng)/sqrt(o); 109 Matrix W_true = matmul(U_true, V_true); 110 Matrix XW_true = matmul(X, W_true); 111 // Add small noise to targets 112 for(int i=0;i<m;++i) 113 for(int j=0;j<o;++j) 114 Y(i,j) = XW_true(i,j) + 0.05*N(rng); 115 116 // Initialize weights 117 for(double &v: W1.a) v = N(rng) * sqrt(2.0/d); 118 for(double &v: W2.a) v = N(rng) * sqrt(2.0/h); 119 120 auto mse = [&](const Matrix &P){ 121 double s=0.0; for(double x: P.a) s += x*x; return 0.5*s / m; }; 122 123 cout.setf(std::ios::fixed); cout<<setprecision(6); 124 for(int e=0;e<epochs;++e){ 125 // Forward: H = X W1; Yhat = H W2 126 Matrix H = matmul(X, W1); 127 Matrix Yhat = matmul(H, W2); 128 // Error and loss 129 Matrix E(m,o); 130 for(int i=0;i<m;++i) for(int j=0;j<o;++j) E(i,j) = Yhat(i,j) - Y(i,j); 131 double loss = mse(E); 132 133 // Gradients: dW2 = (H^T E)/m ; dW1 = (X^T E W2^T)/m 134 Matrix HT = transpose(H); 135 Matrix dW2 = matmul(HT, E); 136 for(double &v: dW2.a) v /= m; 137 Matrix XT = transpose(X); 138 Matrix ET = transpose(E); 139 Matrix ETW2T = transpose(matmul(transpose(W2), ET)); // E * W2 140 Matrix dW1 = matmul(XT, ETW2T); 141 for(double &v: dW1.a) v /= m; 142 143 // Update 144 for(size_t i=0;i<W1.a.size();++i) W1.a[i] -= lr * dW1.a[i]; 145 for(size_t i=0;i<W2.a.size();++i) W2.a[i] -= lr * dW2.a[i]; 146 147 // Track spectral norms of W1, W2, and product W_eff = W1 W2 (via power iteration) 148 double s1 = spectral_norm_power_iteration(W1, 40, 100+e); 149 double s2 = spectral_norm_power_iteration(W2, 40, 200+e); 150 Matrix Weff = matmul(W1, W2); 151 double se = spectral_norm_power_iteration(Weff, 40, 300+e); 152 153 cout << "epoch "<<setw(2)<<e 154 << " loss="<<loss 155 << " ||W1||2="<<s1 156 << " ||W2||2="<<s2 157 << " ||W1W2||2="<<se << "\n"; 158 } 159 return 0; 160 } 161
This program trains a small 2-layer linear network with mean-squared error. After each epoch, it estimates the spectral norms of W1, W2, and their product W1W2 using power iteration without forming Gram matrices. The spectral norm tracks the maximum amplification of each layer, which is closely tied to stability and gradient behavior. The planted low-rank target induces outlier singular values as training progresses.
1 #include <bits/stdc++.h> 2 #include <Eigen/Dense> 3 using namespace std; 4 using namespace Eigen; 5 6 // Compile with: g++ -O2 -std=c++17 main.cpp -I /path/to/eigen 7 8 int main(){ 9 ios::sync_with_stdio(false); 10 cin.tie(nullptr); 11 12 int m=512, d=64, h=64, o=16; // small so eigendecomposition is feasible 13 double lr = 0.05; int epochs = 30; 14 15 std::mt19937 rng(42); 16 std::normal_distribution<double> N(0.0,1.0); 17 18 auto randn = [&](int r, int c){ MatrixXd X(r,c); for(int i=0;i<r;i++) for(int j=0;j<c;j++) X(i,j)=N(rng); return X; }; 19 20 MatrixXd X = randn(m,d) / sqrt(d); 21 MatrixXd W1 = randn(d,h) * sqrt(2.0/d); 22 MatrixXd b1 = MatrixXd::Zero(1,h); 23 MatrixXd W2 = randn(h,o) * sqrt(2.0/h); 24 MatrixXd b2 = MatrixXd::Zero(1,o); 25 26 // Create targets with a planted low-rank signal through a ReLU layer 27 MatrixXd U = randn(d, 3) / sqrt(d); 28 MatrixXd V = randn(3, o) / sqrt(o); 29 MatrixXd Z = (X * U).array().max(0.0).matrix() * V; // nonlinearity 30 MatrixXd Y = Z + 0.05 * randn(m,o); 31 32 auto relu = [](const MatrixXd &A){ return A.array().max(0.0).matrix(); }; 33 34 for(int e=0;e<epochs;++e){ 35 // Forward 36 MatrixXd Hpre = X * W1; Hpre.rowwise() += b1; 37 MatrixXd H = relu(Hpre); 38 MatrixXd Yhat = H * W2; Yhat.rowwise() += b2; 39 MatrixXd E = Yhat - Y; 40 double loss = 0.5 * E.squaredNorm() / m; 41 42 // Backward 43 MatrixXd dYhat = E / m; 44 MatrixXd dW2 = H.transpose() * dYhat; 45 MatrixXd db2 = dYhat.colwise().sum(); 46 MatrixXd dH = dYhat * W2.transpose(); 47 MatrixXd dHpre = dH.array() * (Hpre.array() > 0.0).cast<double>(); 48 MatrixXd dW1 = X.transpose() * dHpre; 49 MatrixXd db1 = dHpre.colwise().sum(); 50 51 // Update 52 W2 -= lr * dW2; b2 -= lr * db2; 53 W1 -= lr * dW1; b1 -= lr * db1; 54 55 cout.setf(std::ios::fixed); cout<<setprecision(6); 56 cout << "epoch "<<setw(2)<<e<<" loss="<<loss; 57 58 // Exact singular values of W1 (small h,d) and eigenvalues of symmetric part 59 JacobiSVD<MatrixXd> svd(W1, ComputeThinU | ComputeThinV); 60 VectorXd s = svd.singularValues(); 61 double smax = s(0); 62 cout << " ||W1||2="<<smax; 63 64 // Symmetric part eigenvalues (real): S = (W1 + W1^T)/2 65 MatrixXd S = 0.5*(W1 + W1.transpose()); 66 SelfAdjointEigenSolver<MatrixXd> es(S); 67 VectorXd evals = es.eigenvalues(); 68 double lam_max = evals.maxCoeff(); 69 cout << " lam_max(S)="<<lam_max << "\n"; 70 } 71 72 // Final: ESD histogram of W1^T W1 (squared singular values) 73 MatrixXd G = W1.transpose() * W1; // positive semidefinite 74 SelfAdjointEigenSolver<MatrixXd> esG(G); 75 VectorXd lams = esG.eigenvalues(); // these are sigma_i^2 76 77 // Histogram 78 int bins = 20; 79 double mn = lams.minCoeff(); 80 double mx = lams.maxCoeff(); 81 vector<int> hist(bins,0); 82 for(int i=0;i<lams.size();++i){ 83 int b = (mx==mn)?0: (int)((lams(i)-mn)/(mx-mn) * bins); 84 if(b==bins) b=bins-1; 85 hist[b]++; 86 } 87 cout << "\nESD of W1^T W1 (sigma^2) histogram:" << "\n"; 88 for(int b=0;b<bins;++b){ 89 double left = mn + (mx-mn)*b/bins; 90 double right = mn + (mx-mn)*(b+1)/bins; 91 cout << fixed << setprecision(4) << "["<<left<<","<<right<<"): "; 92 int cnt = hist[b]; 93 int stars = min(60, cnt); 94 for(int s=0;s<stars;++s) cout << '*'; 95 cout << " ("<<cnt<<")\n"; 96 } 97 98 return 0; 99 } 100
This example trains a small 1-hidden-layer ReLU network (so that exact decompositions are feasible) and then computes: (1) the exact singular values of W1, (2) eigenvalues of its symmetric part S=(W1+W1^T)/2 each epoch, and (3) a final histogram of eigenvalues of W1^T W1 (squared singular values), which is an ESD estimate. This gives a compact, practical workflow to inspect spectra and their evolution during training.