Feature Learning vs Kernel Regime
Key Points
- •The kernel (lazy) regime keeps neural network parameters close to their initialization, making training equivalent to kernel regression with a fixed kernel such as the Neural Tangent Kernel (NTK).
- •The feature learning (rich) regime lets parameters move significantly, so the network actively discovers new, data-dependent features that improve with depth and width.
- •Kernel methods are convex and stable but scale poorly with dataset size because they require O() memory and O() time to solve.
- •Feature learning scales better with data size and can discover useful representations, but optimization is non-convex and sensitive to hyperparameters.
- •In the infinite-width limit with small learning rate, gradient descent on neural nets converges to kernel regression using the NTK.
- •Choosing between regimes depends on data size, computational budget, and whether task-specific features need to be learned.
- •You can simulate both regimes in C++: kernel ridge regression for the lazy regime and a small two-layer network trained with gradient descent for feature learning.
Prerequisites
- →Linear algebra (vectors, matrices, eigenvalues) — Kernel methods use Gram matrices and linear solves; NTK dynamics depend on eigen-decompositions.
- →Multivariable calculus and gradients — Understanding Jacobians, linearization, and gradient descent requires differentiation.
- →Supervised learning and loss functions — Both regimes minimize losses like mean squared error using gradient-based methods or closed-form solutions.
- →Kernel methods and RKHS basics — The kernel regime is kernel ridge regression with specific kernels (e.g., RBF, NTK).
- →Optimization (gradient descent, learning rates) — Regime depends on step size and how parameters evolve during training.
- →Neural network architectures — Feature learning depends on depth, width, activations, and parameter scaling.
- →Numerical linear algebra — Stability of solving (K+λI)α=y and conditioning issues are central in practice.
- →C++ programming fundamentals — To implement solvers and training loops safely and efficiently.
- →Probability and statistics — Regularization, noise models, and generalization considerations rely on probabilistic thinking.
- →Approximation theory / generalization — Helps reason about when fixed kernels suffice versus when learned representations are needed.
Detailed Explanation
Tap terms for definitions01Overview
Modern neural networks can operate in two distinct training behaviors: the lazy (kernel) regime and the rich (feature learning) regime. In the kernel regime, the network behaves almost linearly around its initialization; weights barely move, and predictions can be described by a fixed kernel, most famously the Neural Tangent Kernel (NTK). Training is essentially kernel regression: you compute a Gram matrix, solve a linear system, and obtain a closed-form predictor. In the feature learning regime, parameters move significantly and the network adapts its internal representations to the data, improving performance especially on complex tasks. This dichotomy clarifies why infinitely wide networks trained with small learning rates mimic kernel methods, while finite, moderately wide networks trained more aggressively learn useful intermediate features. Understanding both regimes helps you select algorithms and hyperparameters: the kernel regime favors theoretical guarantees and small datasets; the feature learning regime favors large, complex datasets and tasks requiring hierarchical abstractions. This concept also connects optimization (gradient descent dynamics), generalization (double descent), and scaling laws with practical engineering choices like width, learning rate, and regularization.
02Intuition & Analogies
Imagine two ways to learn to recognize a song. In the first (kernel/lazy), you keep your hearing unchanged and just compare new songs to a library of references using a fixed similarity measure. You never get better at hearing; you only memorize how similar things sound. In the second (feature learning/rich), your ear gradually becomes more sensitive to important patterns—bass lines, chord progressions, rhythm—so your perception itself improves as you practice. Neural networks in the kernel regime act like the first person: their internal filters (features) barely change, and training reduces to combining fixed features to fit the labels. Kernel methods use a similarity function (kernel) to compare inputs and interpolate labels accordingly. In contrast, feature learning changes the filters: the network discovers better features during training, like learning to detect edges, shapes, or semantics in images. Width and learning rate play the role of practice style: extremely wide networks with tiny learning rates barely change their internal filters (lazy practice); narrower networks or larger learning rates allow substantial internal adaptation (rich practice). Because the lazy approach is linear at heart, it offers clean math and stable optimization but can plateau when data requires new abstractions. The rich approach can surpass it by inventing better features but at the cost of harder optimization and less transparent theory.
03Formal Definition
04When to Use
Use the kernel/lazy regime when: (1) datasets are small to medium (so O(n^2) memory and O(n^3) time are feasible), (2) you need convex optimization with closed-form solutions and uncertainty estimates, (3) you want stable, reproducible baselines or theory-grounded comparisons, and (4) features are expected to be sufficiently captured by fixed similarities (e.g., smooth functions with RBF kernels). Use the feature learning regime when: (1) datasets are large and hierarchical (images, audio, language) where deep learned features excel, (2) you need scalability in training/inference and cannot store an O(n^2) kernel, (3) transfer learning and representation learning matter, and (4) you can tolerate non-convex optimization with careful tuning. Hybrid strategies also exist: start in the lazy regime to diagnose data issues, then switch to feature learning for performance; or use NTK to initialize and precondition training. In settings like few-shot learning, kernels give strong baselines; for long-horizon generalization (e.g., compositional reasoning), feature learning is often necessary.
⚠️Common Mistakes
- Assuming infinite width is required for the kernel regime: in practice, sufficiently large width combined with small learning rate and proper parameter scaling already induces lazy behavior. To confirm, monitor parameter drift norms relative to initialization.
- Believing kernels always underperform: on smooth, low-noise problems or small datasets, kernel methods can match or beat deep nets while being simpler to train.
- Ignoring computational costs: kernel regression needs O(n^2) memory and O(n^3) time; with n in the tens of thousands this becomes impractical without approximations (Nyström, random features).
- Mis-scaling learning rates: using a large learning rate with very wide networks can still push out of the lazy regime, invalidating NTK-based predictions. Conversely, tiny learning rates with moderate width may trap you in a lazy regime that underfits.
- Confusing training loss with generalization: kernel methods can interpolate training data with minimal norm solutions yet still overfit when kernels are mismatched; feature learning can overfit too without regularization and data augmentation.
- Numerical instability: forming and inverting Gram matrices without regularization (\lambda I) can cause blow-ups; always add ridge terms and use stable solvers (Cholesky) when possible.
- Overlooking initialization scaling (1/\sqrt{m}): omitting this in wide networks distorts gradient magnitudes and ruins NTK approximations.
Key Formulas
First-order linearization
Explanation: The network is approximated by a linear function of parameter changes around initialization. This is accurate in the lazy regime where parameters move little.
Neural Tangent Kernel (NTK)
Explanation: The NTK measures similarity by how similarly parameters affect outputs at two inputs. In the lazy regime, this kernel remains effectively constant during training.
Function-space gradient flow
Explanation: Under gradient flow with square loss, predictions at training points follow linear dynamics governed by the NTK. Solutions exponentially approach labels when the kernel matrix is positive definite.
Closed-form solution of dynamics
Explanation: This expresses how predictions converge over time. Each eigenmode decays at a rate proportional to its NTK eigenvalue.
Kernel ridge coefficients
Explanation: Training in kernel ridge regression reduces to solving a linear system for coefficients. The ridge term improves numerical stability and controls complexity.
Kernel prediction
Explanation: Predictions are kernel-weighted sums over training labels. Each training example contributes according to similarity with the test point.
NTK Gram via Jacobian
Explanation: Stacking parameter gradients into a Jacobian gives the NTK Gram matrix as J. This lets you perform NTK regression without a closed-form kernel.
Parameter drift bound (informal)
Explanation: With appropriate scaling and large width m, parameter movement shrinks, keeping training in the lazy regime. This justifies the linearization.
RBF kernel
Explanation: A popular smooth kernel controlled by bandwidth . Works well when target functions are smooth relative to input distance.
Gradient descent update
Explanation: Parameters move in the direction of the negative gradient scaled by learning rate . Larger tends to push training toward the rich regime.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct KernelRidgeRBF { 5 double sigma; // RBF bandwidth 6 double lambda; // ridge regularization 7 vector<double> X; // 1D inputs for simplicity 8 vector<double> alpha; // dual coefficients 9 10 static double sqr(double x){ return x*x; } 11 12 KernelRidgeRBF(double sigma_=0.2, double lambda_=1e-6): sigma(sigma_), lambda(lambda_) {} 13 14 double kernel(double x, double y) const { 15 double dist2 = sqr(x - y); 16 return exp(- dist2 / (2.0 * sqr(sigma))); 17 } 18 19 // Solve linear system A a = b via Gaussian elimination with partial pivoting 20 static vector<double> solve_linear(vector<vector<double>> A, vector<double> b){ 21 int n = (int)A.size(); 22 for(int i=0;i<n;i++) A[i].push_back(b[i]); 23 // Forward elimination 24 for(int col=0; col<n; ++col){ 25 // Pivot 26 int pivot = col; 27 for(int r=col+1; r<n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r; 28 swap(A[pivot], A[col]); 29 double piv = A[col][col]; 30 if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix (add larger lambda)"); 31 // Normalize row 32 for(int c=col; c<=n; ++c) A[col][c] /= piv; 33 // Eliminate below 34 for(int r=col+1; r<n; ++r){ 35 double f = A[r][col]; 36 for(int c=col; c<=n; ++c) A[r][c] -= f * A[col][c]; 37 } 38 } 39 // Back substitution 40 vector<double> x(n,0.0); 41 for(int r=n-1; r>=0; --r){ 42 double s = A[r][n]; 43 for(int c=r+1;c<n;++c) s -= A[r][c]*x[c]; 44 x[r]=s; // row is normalized 45 } 46 return x; 47 } 48 49 void fit(const vector<double>& x, const vector<double>& y){ 50 X = x; 51 int n = (int)x.size(); 52 vector<vector<double>> K(n, vector<double>(n, 0.0)); 53 for(int i=0;i<n;i++){ 54 for(int j=0;j<n;j++){ 55 K[i][j] = kernel(x[i], x[j]); 56 } 57 K[i][i] += lambda; // ridge for numerical stability and regularization 58 } 59 alpha = solve_linear(K, y); 60 } 61 62 double predict_one(double x) const { 63 double s = 0.0; 64 for(size_t i=0;i<X.size();++i) s += alpha[i]*kernel(x, X[i]); 65 return s; 66 } 67 68 vector<double> predict(const vector<double>& Xtest) const { 69 vector<double> out; out.reserve(Xtest.size()); 70 for(double xt: Xtest) out.push_back(predict_one(xt)); 71 return out; 72 } 73 }; 74 75 int main(){ 76 // Generate synthetic 1D data: y = sin(2πx) + small noise 77 int n = 100; 78 vector<double> x(n), y(n); 79 mt19937 rng(42); 80 uniform_real_distribution<double> uni(0.0, 1.0); 81 normal_distribution<double> noise(0.0, 0.05); 82 for(int i=0;i<n;i++){ 83 x[i] = uni(rng); 84 y[i] = sin(2.0*M_PI*x[i]) + noise(rng); 85 } 86 87 KernelRidgeRBF krr(0.15, 1e-4); 88 krr.fit(x, y); 89 90 // Predict on a grid 91 int m = 50; 92 vector<double> xt(m); 93 for(int i=0;i<m;i++) xt[i] = i/(double)(m-1); 94 vector<double> yp = krr.predict(xt); 95 96 // Report a few predictions 97 cout << fixed << setprecision(4); 98 for(int i=0;i<m;i+=10){ 99 cout << "x=" << xt[i] << ", y_pred=" << yp[i] << "\n"; 100 } 101 return 0; 102 } 103
This program implements kernel ridge regression using the RBF kernel to fit noisy samples of a sine wave. It constructs the Gram matrix K, adds ridge regularization λI, solves for α, and predicts via weighted kernel sums. This exemplifies the lazy/kernel regime where learning is entirely determined by a fixed similarity function rather than adapting features.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct TwoLayerNet { 5 int m; // width 6 double lr; // learning rate 7 // Parameters: w (input->hidden), b (hidden bias), a (hidden->output) 8 vector<double> w, b, a; 9 // Initial copies to measure drift 10 vector<double> w0, b0, a0; 11 12 TwoLayerNet(int m_, double lr_, mt19937 &rng): m(m_), lr(lr_) { 13 normal_distribution<double> nd(0.0, 1.0); 14 w.resize(m); b.resize(m); a.resize(m); 15 for(int j=0;j<m;j++){ 16 // NTK scaling 1/sqrt(m) keeps outputs O(1) 17 w[j] = nd(rng) / sqrt((double)m); 18 b[j] = nd(rng) / sqrt((double)m); 19 a[j] = nd(rng) / sqrt((double)m); 20 } 21 w0=w; b0=b; a0=a; 22 } 23 24 static double relu(double z){ return z>0? z: 0.0; } 25 static double relu_grad(double z){ return z>0? 1.0: 0.0; } 26 27 // Forward pass for one input x 28 double forward_one(double x, vector<double>* cache_h=nullptr, vector<double>* cache_z=nullptr) const { 29 double sum = 0.0; 30 for(int j=0;j<m;j++){ 31 double z = w[j]*x + b[j]; 32 double h = relu(z); 33 if (cache_h) (*cache_h)[j] = h; 34 if (cache_z) (*cache_z)[j] = z; 35 sum += a[j]*h; // output is sum_j a_j * h_j 36 } 37 return sum; 38 } 39 40 // One full-batch GD step on MSE loss 41 double train_step(const vector<double>& X, const vector<double>& Y){ 42 int n = (int)X.size(); 43 vector<double> grad_w(m,0.0), grad_b(m,0.0), grad_a(m,0.0); 44 double loss = 0.0; 45 // Accumulate gradients over all examples 46 for(int i=0;i<n;i++){ 47 vector<double> h(m), z(m); 48 double yhat = forward_one(X[i], &h, &z); 49 double err = yhat - Y[i]; 50 loss += 0.5*err*err; // MSE/2 51 for(int j=0;j<m;j++){ 52 grad_a[j] += err * h[j]; 53 double dz = err * a[j] * relu_grad(z[j]); 54 grad_w[j] += dz * X[i]; 55 grad_b[j] += dz; 56 } 57 } 58 // Average gradients 59 for(int j=0;j<m;j++){ 60 grad_a[j] /= n; grad_w[j] /= n; grad_b[j] /= n; 61 } 62 loss /= n; 63 // Gradient descent update 64 for(int j=0;j<m;j++){ 65 a[j] -= lr * grad_a[j]; 66 w[j] -= lr * grad_w[j]; 67 b[j] -= lr * grad_b[j]; 68 } 69 return loss; 70 } 71 72 double param_drift_norm() const { 73 double s=0.0; 74 for(int j=0;j<m;j++){ 75 s += (w[j]-w0[j])*(w[j]-w0[j]); 76 s += (b[j]-b0[j])*(b[j]-b0[j]); 77 s += (a[j]-a0[j])*(a[j]-a0[j]); 78 } 79 return sqrt(s); 80 } 81 }; 82 83 int main(){ 84 // Data: y = sin(2πx) + noise 85 int n = 256; 86 mt19937 rng(123); 87 uniform_real_distribution<double> uni(0.0, 1.0); 88 normal_distribution<double> noise(0.0, 0.05); 89 vector<double> X(n), Y(n); 90 for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2*M_PI*X[i]) + noise(rng); } 91 92 TwoLayerNet net(/*width*/128, /*lr*/5e-2, rng); 93 94 // Train for a few epochs 95 int epochs = 500; 96 for(int e=1;e<=epochs;e++){ 97 double loss = net.train_step(X, Y); 98 if (e%50==0){ 99 cout << fixed << setprecision(6) 100 << "epoch=" << e 101 << ", loss=" << loss 102 << ", drift_norm=" << net.param_drift_norm() << "\n"; 103 } 104 } 105 106 // Show a few predictions 107 for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){ 108 double yhat = net.forward_one(xt); 109 cout << fixed << setprecision(4) << "x=" << xt << ", y_pred=" << yhat << "\n"; 110 } 111 return 0; 112 } 113
This code trains a simple two-layer ReLU network with full-batch gradient descent on the sine dataset. The parameter drift norm quantifies how far parameters move from initialization; significant drift indicates feature learning (rich regime). With a moderately large learning rate and finite width, features adapt during training.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Two-layer ReLU network at initialization, but we DO NOT train it. 5 // We compute Jacobian features and perform NTK kernel ridge on the residual y - f0. 6 struct NTKLinearized { 7 int m; // width 8 vector<double> w, b, a; // parameters at init (scaled by 1/sqrt(m)) 9 double lambda; // ridge 10 11 NTKLinearized(int m_, double lambda_, mt19937 &rng): m(m_), lambda(lambda_) { 12 normal_distribution<double> nd(0.0, 1.0); 13 w.resize(m); b.resize(m); a.resize(m); 14 for(int j=0;j<m;j++){ 15 w[j] = nd(rng) / sqrt((double)m); 16 b[j] = nd(rng) / sqrt((double)m); 17 a[j] = nd(rng) / sqrt((double)m); 18 } 19 } 20 21 static double relu(double z){ return z>0? z: 0.0; } 22 static double relu_grad(double z){ return z>0? 1.0: 0.0; } 23 24 double f0(double x) const { 25 double s=0.0; 26 for(int j=0;j<m;j++) s += a[j]*relu(w[j]*x + b[j]); 27 return s; 28 } 29 30 // Build Jacobian feature vector phi(x) for parameters [a_j, w_j, b_j] (length 3m) 31 vector<double> phi(double x) const { 32 vector<double> p(3*m); 33 for(int j=0;j<m;j++){ 34 double z = w[j]*x + b[j]; 35 double h = relu(z); 36 double g = relu_grad(z); 37 // d f / d a_j = h 38 p[j] = h; 39 // d f / d w_j = a_j * g * x 40 p[m + j] = a[j] * g * x; 41 // d f / d b_j = a_j * g 42 p[2*m + j] = a[j] * g; 43 } 44 return p; 45 } 46 47 // Fit NTK kernel ridge on residual r = y - f0 48 void fit(const vector<double>& X, const vector<double>& Y){ 49 int n = (int)X.size(); 50 // Compute Phi (n x 3m) implicitly; build K = Phi Phi^T 51 vector<vector<double>> K(n, vector<double>(n, 0.0)); 52 vector<vector<double>> PHI(n); PHI.reserve(n); 53 vector<double> r(n); 54 for(int i=0;i<n;i++){ 55 PHI[i] = phi(X[i]); 56 r[i] = Y[i] - f0(X[i]); 57 } 58 for(int i=0;i<n;i++){ 59 for(int j=i;j<n;j++){ 60 double s=0.0; 61 for(size_t k=0;k<PHI[i].size();++k) s += PHI[i][k]*PHI[j][k]; 62 K[i][j]=K[j][i]=s; 63 } 64 K[i][i] += lambda; 65 } 66 // Solve (K + lambda I) alpha = r 67 alpha = solve_linear(K, r); 68 trainX = X; 69 storePHI = move(PHI); // keep PHI to predict fast (or recompute on the fly) 70 } 71 72 // Gaussian elimination solver 73 static vector<double> solve_linear(vector<vector<double>> A, const vector<double>& b){ 74 int n = (int)A.size(); 75 vector<vector<double>> M = A; 76 for(int i=0;i<n;i++) M[i].push_back(b[i]); 77 for(int col=0; col<n; ++col){ 78 int pivot = col; 79 for(int r=col+1; r<n; ++r) if (fabs(M[r][col]) > fabs(M[pivot][col])) pivot = r; 80 swap(M[pivot], M[col]); 81 double piv = M[col][col]; 82 if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix"); 83 for(int c=col; c<=n; ++c) M[col][c] /= piv; 84 for(int r=col+1; r<n; ++r){ 85 double f = M[r][col]; 86 for(int c=col; c<=n; ++c) M[r][c] -= f * M[col][c]; 87 } 88 } 89 vector<double> x(n,0.0); 90 for(int r=n-1; r>=0; --r){ 91 double s = M[r][n]; 92 for(int c=r+1;c<n;++c) s -= M[r][c]*x[c]; 93 x[r]=s; 94 } 95 return x; 96 } 97 98 // Predict using f_hat(x) = f0(x) + k_x^T alpha where k_x = Phi(x) * Phi(X)^T 99 double predict_one(double x) const { 100 vector<double> ph = phi(x); 101 double s = f0(x); 102 for(size_t i=0;i<trainX.size();++i){ 103 double dot=0.0; 104 for(size_t k=0;k<ph.size();++k) dot += ph[k]*storePHI[i][k]; 105 s += alpha[i] * dot; 106 } 107 return s; 108 } 109 110 vector<double> predict(const vector<double>& Xtest) const { 111 vector<double> out; out.reserve(Xtest.size()); 112 for(double xt: Xtest) out.push_back(predict_one(xt)); 113 return out; 114 } 115 116 // Stored state 117 vector<double> alpha; 118 vector<double> trainX; 119 vector<vector<double>> storePHI; 120 }; 121 122 int main(){ 123 // Data 124 int n = 120; 125 mt19937 rng(7); 126 uniform_real_distribution<double> uni(0.0, 1.0); 127 normal_distribution<double> noise(0.0, 0.03); 128 vector<double> X(n), Y(n); 129 for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2*M_PI*X[i]) + noise(rng); } 130 131 NTKLinearized ntk(/*width*/64, /*lambda*/1e-4, rng); 132 ntk.fit(X, Y); 133 134 // Compare a few predictions 135 for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){ 136 cout << fixed << setprecision(4) 137 << "x=" << xt 138 << ", y_pred_NTK=" << ntk.predict_one(xt) << "\n"; 139 } 140 return 0; 141 } 142
This code keeps a two-layer network at its random initialization and computes its Jacobian features to build the NTK Gram matrix. It then solves a kernel ridge regression on the residual y − f0 and predicts using the NTK kernel. This demonstrates that lazy training of the network matches kernel regression with the NTK, contrasting with the feature-learning behavior in the previous example.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Minimal skeleton (no training loop) showing how to toggle regimes: 5 // - Increase width m and decrease lr -> lazier behavior (smaller drift) 6 // - Decrease width or increase lr -> richer feature learning (larger drift) 7 struct ToggleExample { 8 int m; double lr; 9 ToggleExample(int m_, double lr_): m(m_), lr(lr_) {} 10 void print_hint(){ 11 cout << "For width m=" << m << ", try lr=" << lr << ".\n"; 12 cout << "- If you increase m and reduce lr (e.g., m*4, lr/4), parameter drift shrinks (lazy).\n"; 13 cout << "- If you reduce m or increase lr, drift grows (feature learning).\n"; 14 } 15 }; 16 17 int main(){ 18 ToggleExample a(1024, 1e-3); a.print_hint(); 19 ToggleExample b(128, 5e-2); b.print_hint(); 20 return 0; 21 } 22
This small skeleton emphasizes the practical control knobs: width and learning rate. Larger width with smaller learning rate promotes the lazy regime; smaller width or larger learning rate promotes feature learning. Use it alongside the previous examples to verify drift norms and behavior.