📚TheoryAdvanced

Feature Learning vs Kernel Regime

Key Points

•
The kernel (lazy) regime keeps neural network parameters close to their initialization, making training equivalent to kernel regression with a fixed kernel such as the Neural Tangent Kernel (NTK).
•
The feature learning (rich) regime lets parameters move significantly, so the network actively discovers new, data-dependent features that improve with depth and width.
•
Kernel methods are convex and stable but scale poorly with dataset size because they require O( $n^{2}$ ) memory and O( $n^{3}$ ) time to solve.
•
Feature learning scales better with data size and can discover useful representations, but optimization is non-convex and sensitive to hyperparameters.
•
In the infinite-width limit with small learning rate, gradient descent on neural nets converges to kernel regression using the NTK.
•
Choosing between regimes depends on data size, computational budget, and whether task-specific features need to be learned.
•
You can simulate both regimes in C++: kernel ridge regression for the lazy regime and a small two-layer network trained with gradient descent for feature learning.

Prerequisites

→Linear algebra (vectors, matrices, eigenvalues) — Kernel methods use Gram matrices and linear solves; NTK dynamics depend on eigen-decompositions.
→Multivariable calculus and gradients — Understanding Jacobians, linearization, and gradient descent requires differentiation.
→Supervised learning and loss functions — Both regimes minimize losses like mean squared error using gradient-based methods or closed-form solutions.
→Kernel methods and RKHS basics — The kernel regime is kernel ridge regression with specific kernels (e.g., RBF, NTK).
→Optimization (gradient descent, learning rates) — Regime depends on step size and how parameters evolve during training.
→Neural network architectures — Feature learning depends on depth, width, activations, and parameter scaling.
→Numerical linear algebra — Stability of solving (K+λI)α=y and conditioning issues are central in practice.
→C++ programming fundamentals — To implement solvers and training loops safely and efficiently.
→Probability and statistics — Regularization, noise models, and generalization considerations rely on probabilistic thinking.
→Approximation theory / generalization — Helps reason about when fixed kernels suffice versus when learned representations are needed.

Detailed Explanation

Tap terms for definitions

01Overview

Modern neural networks can operate in two distinct training behaviors: the lazy (kernel) regime and the rich (feature learning) regime. In the kernel regime, the network behaves almost linearly around its initialization; weights barely move, and predictions can be described by a fixed kernel, most famously the Neural Tangent Kernel (NTK). Training is essentially kernel regression: you compute a Gram matrix, solve a linear system, and obtain a closed-form predictor. In the feature learning regime, parameters move significantly and the network adapts its internal representations to the data, improving performance especially on complex tasks. This dichotomy clarifies why infinitely wide networks trained with small learning rates mimic kernel methods, while finite, moderately wide networks trained more aggressively learn useful intermediate features. Understanding both regimes helps you select algorithms and hyperparameters: the kernel regime favors theoretical guarantees and small datasets; the feature learning regime favors large, complex datasets and tasks requiring hierarchical abstractions. This concept also connects optimization (gradient descent dynamics), generalization (double descent), and scaling laws with practical engineering choices like width, learning rate, and regularization.

02Intuition & Analogies

Imagine two ways to learn to recognize a song. In the first (kernel/lazy), you keep your hearing unchanged and just compare new songs to a library of references using a fixed similarity measure. You never get better at hearing; you only memorize how similar things sound. In the second (feature learning/rich), your ear gradually becomes more sensitive to important patterns—bass lines, chord progressions, rhythm—so your perception itself improves as you practice. Neural networks in the kernel regime act like the first person: their internal filters (features) barely change, and training reduces to combining fixed features to fit the labels. Kernel methods use a similarity function (kernel) to compare inputs and interpolate labels accordingly. In contrast, feature learning changes the filters: the network discovers better features during training, like learning to detect edges, shapes, or semantics in images. Width and learning rate play the role of practice style: extremely wide networks with tiny learning rates barely change their internal filters (lazy practice); narrower networks or larger learning rates allow substantial internal adaptation (rich practice). Because the lazy approach is linear at heart, it offers clean math and stable optimization but can plateau when data requires new abstractions. The rich approach can surpass it by inventing better features but at the cost of harder optimization and less transparent theory.

03Formal Definition

Consider a neural network f(x,

θ

) with parameters

θ

. Linearizing at initialization

θ_{0}

yields f(x,

θ

)

\approx

f(x,

θ_{0}

) +

J_{θ_{0}}

(x) (

θ

θ_{0}

), where

J_{θ_{0}}

(x) is the Jacobian. In the kernel (lazy) regime, gradient descent keeps \

∣ θ - θ_{0} ∥

small, so the linear approximation is accurate throughout training. Training then reduces to kernel regression with the Neural Tangent Kernel (NTK)

Θ

(x, x') =

\nabla_{θ}

f(x,

θ_{0}

⊤

\nabla_{θ}

f(x',

θ_{0}

). Function-space dynamics under gradient flow obey

\frac{d}{d t}

f_{t}

(X) = -

Θ

(X, X) (

f_{t}

(X) - y), yielding an exponential approach to the labels when

Θ

is positive definite. In the feature learning (rich) regime, parameter movement is not negligible; higher-order terms beyond the linearization matter, and the effective kernel changes during training. This allows the model to adapt its representation, often improving inductive bias and sample efficiency on complex tasks. Infinite-width networks with learning rate scaled appropriately and small initialization concentrate in the lazy regime, where

Θ

remains constant. Finite-width networks, larger learning rates, or deeper architectures more readily enter the rich regime, where learned features significantly alter the mapping.

04When to Use

Use the kernel/lazy regime when: (1) datasets are small to medium (so O(n^2) memory and O(n^3) time are feasible), (2) you need convex optimization with closed-form solutions and uncertainty estimates, (3) you want stable, reproducible baselines or theory-grounded comparisons, and (4) features are expected to be sufficiently captured by fixed similarities (e.g., smooth functions with RBF kernels). Use the feature learning regime when: (1) datasets are large and hierarchical (images, audio, language) where deep learned features excel, (2) you need scalability in training/inference and cannot store an O(n^2) kernel, (3) transfer learning and representation learning matter, and (4) you can tolerate non-convex optimization with careful tuning. Hybrid strategies also exist: start in the lazy regime to diagnose data issues, then switch to feature learning for performance; or use NTK to initialize and precondition training. In settings like few-shot learning, kernels give strong baselines; for long-horizon generalization (e.g., compositional reasoning), feature learning is often necessary.

⚠️Common Mistakes

Assuming infinite width is required for the kernel regime: in practice, sufficiently large width combined with small learning rate and proper parameter scaling already induces lazy behavior. To confirm, monitor parameter drift norms relative to initialization.
Believing kernels always underperform: on smooth, low-noise problems or small datasets, kernel methods can match or beat deep nets while being simpler to train.
Ignoring computational costs: kernel regression needs O(n^2) memory and O(n^3) time; with n in the tens of thousands this becomes impractical without approximations (Nyström, random features).
Mis-scaling learning rates: using a large learning rate with very wide networks can still push out of the lazy regime, invalidating NTK-based predictions. Conversely, tiny learning rates with moderate width may trap you in a lazy regime that underfits.
Confusing training loss with generalization: kernel methods can interpolate training data with minimal norm solutions yet still overfit when kernels are mismatched; feature learning can overfit too without regularization and data augmentation.
Numerical instability: forming and inverting Gram matrices without regularization (\lambda I) can cause blow-ups; always add ridge terms and use stable solvers (Cholesky) when possible.
Overlooking initialization scaling (1/\sqrt{m}): omitting this in wide networks distorts gradient magnitudes and ruins NTK approximations.

Key Formulas

First-order linearization

f (x, θ) \approx f (x, θ_{0}) + J_{θ_{0}} (x) (θ - θ_{0})

Explanation: The network is approximated by a linear function of parameter changes around initialization. This is accurate in the lazy regime where parameters move little.

Neural Tangent Kernel (NTK)

Θ (x, x^{'}) = \nabla_{θ} f (x, θ_{0})^{⊤} \nabla_{θ} f (x^{'}, θ_{0})

Explanation: The NTK measures similarity by how similarly parameters affect outputs at two inputs. In the lazy regime, this kernel remains effectively constant during training.

Function-space gradient flow

\frac{d}{d t} f_{t} (X) = - Θ (X, X) (f_{t} (X) - y)

Explanation: Under gradient flow with square loss, predictions at training points follow linear dynamics governed by the NTK. Solutions exponentially approach labels when the kernel matrix is positive definite.

Closed-form solution of dynamics

f_{t} (X) = y + exp (- t Θ (X, X)) (f_{0} (X) - y)

Explanation: This expresses how predictions converge over time. Each eigenmode decays at a rate proportional to its NTK eigenvalue.

Kernel ridge coefficients

α = (K + λ I)^{- 1} y

Explanation: Training in kernel ridge regression reduces to solving a linear system for coefficients. The ridge term $λ$ improves numerical stability and controls complexity.

Kernel prediction

\hat{f} (x) = k_{x}^{⊤} α, (k_{x})_{i} = k (x, x_{i})

Explanation: Predictions are kernel-weighted sums over training labels. Each training example contributes according to similarity with the test point.

NTK Gram via Jacobian

J = \nabla_{θ} f (x_{1}, θ_{0})^{⊤} ⋮ \nabla_{θ} f (x_{n}, θ_{0})^{⊤}, K_{NTK} = J J^{⊤}

Explanation: Stacking parameter gradients into a Jacobian gives the NTK Gram matrix as J $J^{T}$ . This lets you perform NTK regression without a closed-form kernel.

Parameter drift bound (informal)

∥ θ_{t} - θ_{0} ∥_{2} = O (\frac{1}{m})

Explanation: With appropriate scaling and large width m, parameter movement shrinks, keeping training in the lazy regime. This justifies the linearization.

RBF kernel

k_{RBF} (x, x^{'}) = exp (- \frac{∥ x - x ^{'} ∥ _{2}^{2}}{2 σ ^{2}})

Explanation: A popular smooth kernel controlled by bandwidth $σ$ . Works well when target functions are smooth relative to input distance.

Gradient descent update

θ_{t + 1} = θ_{t} - η \nabla_{θ} L (θ_{t})

Explanation: Parameters move in the direction of the negative gradient scaled by learning rate $η$ . Larger $η$ tends to push training toward the rich regime.

Complexity Analysis

Kernel (lazy) regime via kernel ridge regression requires computing the n×n Gram matrix K, which costs O(

n^{2}

d) time for d-dimensional inputs (each pairwise kernel evaluation typically costs O(d)). Solving the linear system (K +

λ I) α

= y with a dense solver takes O(

n^{3}

) time and O(

n^{2}

) memory. Prediction for a new point needs O(n d) to compute similarities and O(n) to form the weighted sum, which becomes expensive for large n. Forming the NTK by explicit Jacobians adds a factor of the parameter count p, giving O(n p d) to build J and O(

n^{2}

p) to form

K = J J^{T}

, so it is practical only for small networks or approximate methods. In the feature learning regime with a two-layer network of width m and input dimension d, a full-batch gradient descent step costs O(n m d) for forward and backward passes (constant factors depend on activation). For T epochs, the total cost is O(T n m d), and memory is O(m d) for parameters plus O(n m) if you cache activations per batch (with mini-batches, memory is O(B m)). Inference for one example is O(m d). Training scales linearly with data and width, which is often more favorable than the cubic solve of kernels when n is large. However, optimization is non-convex and may require careful learning-rate schedules, regularization (weight decay), and initialization. In practice, kernels dominate for small n (e.g.,

n < 5 k

) when exact solutions and uncertainty quantification matter; neural feature learning dominates for large n or tasks where learned representations drastically improve generalization. Hybrid or approximate methods (Nyström, random features, conjugate gradients) can shrink kernel costs, while wide-net NTK approximations can precondition or warm-start feature learning.

Code Examples

Kernel Ridge Regression with RBF Kernel (Lazy/Kernel Regime)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct KernelRidgeRBF {
5     double sigma;      // RBF bandwidth
6     double lambda;     // ridge regularization
7     vector<double> X;  // 1D inputs for simplicity
8     vector<double> alpha; // dual coefficients
9 
10     static double sqr(double x){ return x*x; }
11 
12     KernelRidgeRBF(double sigma_=0.2, double lambda_=1e-6): sigma(sigma_), lambda(lambda_) {}
13 
14     double kernel(double x, double y) const {
15         double dist2 = sqr(x - y);
16         return exp(- dist2 / (2.0 * sqr(sigma)));
17     }
18 
19     // Solve linear system A a = b via Gaussian elimination with partial pivoting
20     static vector<double> solve_linear(vector<vector<double>> A, vector<double> b){
21         int n = (int)A.size();
22         for(int i=0;i<n;i++) A[i].push_back(b[i]);
23         // Forward elimination
24         for(int col=0; col<n; ++col){
25             // Pivot
26             int pivot = col;
27             for(int r=col+1; r<n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r;
28             swap(A[pivot], A[col]);
29             double piv = A[col][col];
30             if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix (add larger lambda)");
31             // Normalize row
32             for(int c=col; c<=n; ++c) A[col][c] /= piv;
33             // Eliminate below
34             for(int r=col+1; r<n; ++r){
35                 double f = A[r][col];
36                 for(int c=col; c<=n; ++c) A[r][c] -= f * A[col][c];
37             }
38         }
39         // Back substitution
40         vector<double> x(n,0.0);
41         for(int r=n-1; r>=0; --r){
42             double s = A[r][n];
43             for(int c=r+1;c<n;++c) s -= A[r][c]*x[c];
44             x[r]=s; // row is normalized
45         }
46         return x;
47     }
48 
49     void fit(const vector<double>& x, const vector<double>& y){
50         X = x;
51         int n = (int)x.size();
52         vector<vector<double>> K(n, vector<double>(n, 0.0));
53         for(int i=0;i<n;i++){
54             for(int j=0;j<n;j++){
55                 K[i][j] = kernel(x[i], x[j]);
56             }
57             K[i][i] += lambda; // ridge for numerical stability and regularization
58         }
59         alpha = solve_linear(K, y);
60     }
61 
62     double predict_one(double x) const {
63         double s = 0.0;
64         for(size_t i=0;i<X.size();++i) s += alpha[i]*kernel(x, X[i]);
65         return s;
66     }
67 
68     vector<double> predict(const vector<double>& Xtest) const {
69         vector<double> out; out.reserve(Xtest.size());
70         for(double xt: Xtest) out.push_back(predict_one(xt));
71         return out;
72     }
73 };
74 
75 int main(){
76     // Generate synthetic 1D data: y = sin(2πx) + small noise
77     int n = 100; 
78     vector<double> x(n), y(n);
79     mt19937 rng(42);
80     uniform_real_distribution<double> uni(0.0, 1.0);
81     normal_distribution<double> noise(0.0, 0.05);
82     for(int i=0;i<n;i++){
83         x[i] = uni(rng);
84         y[i] = sin(2.0*M_PI*x[i]) + noise(rng);
85     }
86 
87     KernelRidgeRBF krr(0.15, 1e-4);
88     krr.fit(x, y);
89 
90     // Predict on a grid
91     int m = 50;
92     vector<double> xt(m);
93     for(int i=0;i<m;i++) xt[i] = i/(double)(m-1);
94     vector<double> yp = krr.predict(xt);
95 
96     // Report a few predictions
97     cout << fixed << setprecision(4);
98     for(int i=0;i<m;i+=10){
99         cout << "x=" << xt[i] << ", y_pred=" << yp[i] << "\n";
100     }
101     return 0;
102 }
103

This program implements kernel ridge regression using the RBF kernel to fit noisy samples of a sine wave. It constructs the Gram matrix K, adds ridge regularization λI, solves for α, and predicts via weighted kernel sums. This exemplifies the lazy/kernel regime where learning is entirely determined by a fixed similarity function rather than adapting features.

Time: Training: O(n^2) kernel construction + O(n^3) linear solve; Prediction: O(n) per test pointSpace: O(n^2) for the Gram matrix and O(n) for coefficients

Two-Layer Neural Network with Feature Learning (Full-batch Gradient Descent)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct TwoLayerNet {
5     int m;         // width
6     double lr;     // learning rate
7     // Parameters: w (input->hidden), b (hidden bias), a (hidden->output)
8     vector<double> w, b, a; 
9     // Initial copies to measure drift
10     vector<double> w0, b0, a0;
11 
12     TwoLayerNet(int m_, double lr_, mt19937 &rng): m(m_), lr(lr_) {
13         normal_distribution<double> nd(0.0, 1.0);
14         w.resize(m); b.resize(m); a.resize(m);
15         for(int j=0;j<m;j++){
16             // NTK scaling 1/sqrt(m) keeps outputs O(1)
17             w[j] = nd(rng) / sqrt((double)m);
18             b[j] = nd(rng) / sqrt((double)m);
19             a[j] = nd(rng) / sqrt((double)m);
20         }
21         w0=w; b0=b; a0=a;
22     }
23 
24     static double relu(double z){ return z>0? z: 0.0; }
25     static double relu_grad(double z){ return z>0? 1.0: 0.0; }
26 
27     // Forward pass for one input x
28     double forward_one(double x, vector<double>* cache_h=nullptr, vector<double>* cache_z=nullptr) const {
29         double sum = 0.0;
30         for(int j=0;j<m;j++){
31             double z = w[j]*x + b[j];
32             double h = relu(z);
33             if (cache_h) (*cache_h)[j] = h;
34             if (cache_z) (*cache_z)[j] = z;
35             sum += a[j]*h; // output is sum_j a_j * h_j
36         }
37         return sum;
38     }
39 
40     // One full-batch GD step on MSE loss
41     double train_step(const vector<double>& X, const vector<double>& Y){
42         int n = (int)X.size();
43         vector<double> grad_w(m,0.0), grad_b(m,0.0), grad_a(m,0.0);
44         double loss = 0.0;
45         // Accumulate gradients over all examples
46         for(int i=0;i<n;i++){
47             vector<double> h(m), z(m);
48             double yhat = forward_one(X[i], &h, &z);
49             double err = yhat - Y[i];
50             loss += 0.5*err*err; // MSE/2
51             for(int j=0;j<m;j++){
52                 grad_a[j] += err * h[j];
53                 double dz = err * a[j] * relu_grad(z[j]);
54                 grad_w[j] += dz * X[i];
55                 grad_b[j] += dz;
56             }
57         }
58         // Average gradients
59         for(int j=0;j<m;j++){
60             grad_a[j] /= n; grad_w[j] /= n; grad_b[j] /= n;
61         }
62         loss /= n;
63         // Gradient descent update
64         for(int j=0;j<m;j++){
65             a[j] -= lr * grad_a[j];
66             w[j] -= lr * grad_w[j];
67             b[j] -= lr * grad_b[j];
68         }
69         return loss;
70     }
71 
72     double param_drift_norm() const {
73         double s=0.0; 
74         for(int j=0;j<m;j++){
75             s += (w[j]-w0[j])*(w[j]-w0[j]);
76             s += (b[j]-b0[j])*(b[j]-b0[j]);
77             s += (a[j]-a0[j])*(a[j]-a0[j]);
78         }
79         return sqrt(s);
80     }
81 };
82 
83 int main(){
84     // Data: y = sin(2πx) + noise
85     int n = 256; 
86     mt19937 rng(123);
87     uniform_real_distribution<double> uni(0.0, 1.0);
88     normal_distribution<double> noise(0.0, 0.05);
89     vector<double> X(n), Y(n);
90     for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2*M_PI*X[i]) + noise(rng); }
91 
92     TwoLayerNet net(/*width*/128, /*lr*/5e-2, rng);
93 
94     // Train for a few epochs
95     int epochs = 500;
96     for(int e=1;e<=epochs;e++){
97         double loss = net.train_step(X, Y);
98         if (e%50==0){
99             cout << fixed << setprecision(6)
100                  << "epoch=" << e 
101                  << ", loss=" << loss 
102                  << ", drift_norm=" << net.param_drift_norm() << "\n";
103         }
104     }
105 
106     // Show a few predictions
107     for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){
108         double yhat = net.forward_one(xt);
109         cout << fixed << setprecision(4) << "x=" << xt << ", y_pred=" << yhat << "\n";
110     }
111     return 0;
112 }
113

This code trains a simple two-layer ReLU network with full-batch gradient descent on the sine dataset. The parameter drift norm quantifies how far parameters move from initialization; significant drift indicates feature learning (rich regime). With a moderately large learning rate and finite width, features adapt during training.

Time: Per epoch: O(n m) for 1D input (general O(n m d)); Total: O(T n m d)Space: O(m) for parameters; O(m) per example for temporary activations (not stored across batch)

NTK Linearization via Explicit Jacobian (Lazy Regime as Kernel Regression)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Two-layer ReLU network at initialization, but we DO NOT train it.
5 // We compute Jacobian features and perform NTK kernel ridge on the residual y - f0.
6 struct NTKLinearized {
7     int m; // width
8     vector<double> w, b, a; // parameters at init (scaled by 1/sqrt(m))
9     double lambda; // ridge
10 
11     NTKLinearized(int m_, double lambda_, mt19937 &rng): m(m_), lambda(lambda_) {
12         normal_distribution<double> nd(0.0, 1.0);
13         w.resize(m); b.resize(m); a.resize(m);
14         for(int j=0;j<m;j++){
15             w[j] = nd(rng) / sqrt((double)m);
16             b[j] = nd(rng) / sqrt((double)m);
17             a[j] = nd(rng) / sqrt((double)m);
18         }
19     }
20 
21     static double relu(double z){ return z>0? z: 0.0; }
22     static double relu_grad(double z){ return z>0? 1.0: 0.0; }
23 
24     double f0(double x) const {
25         double s=0.0;
26         for(int j=0;j<m;j++) s += a[j]*relu(w[j]*x + b[j]);
27         return s;
28     }
29 
30     // Build Jacobian feature vector phi(x) for parameters [a_j, w_j, b_j] (length 3m)
31     vector<double> phi(double x) const {
32         vector<double> p(3*m);
33         for(int j=0;j<m;j++){
34             double z = w[j]*x + b[j];
35             double h = relu(z);
36             double g = relu_grad(z);
37             // d f / d a_j = h
38             p[j] = h;
39             // d f / d w_j = a_j * g * x
40             p[m + j] = a[j] * g * x;
41             // d f / d b_j = a_j * g
42             p[2*m + j] = a[j] * g;
43         }
44         return p;
45     }
46 
47     // Fit NTK kernel ridge on residual r = y - f0
48     void fit(const vector<double>& X, const vector<double>& Y){
49         int n = (int)X.size();
50         // Compute Phi (n x 3m) implicitly; build K = Phi Phi^T
51         vector<vector<double>> K(n, vector<double>(n, 0.0));
52         vector<vector<double>> PHI(n); PHI.reserve(n);
53         vector<double> r(n);
54         for(int i=0;i<n;i++){
55             PHI[i] = phi(X[i]);
56             r[i] = Y[i] - f0(X[i]);
57         }
58         for(int i=0;i<n;i++){
59             for(int j=i;j<n;j++){
60                 double s=0.0;
61                 for(size_t k=0;k<PHI[i].size();++k) s += PHI[i][k]*PHI[j][k];
62                 K[i][j]=K[j][i]=s;
63             }
64             K[i][i] += lambda;
65         }
66         // Solve (K + lambda I) alpha = r
67         alpha = solve_linear(K, r);
68         trainX = X;
69         storePHI = move(PHI); // keep PHI to predict fast (or recompute on the fly)
70     }
71 
72     // Gaussian elimination solver
73     static vector<double> solve_linear(vector<vector<double>> A, const vector<double>& b){
74         int n = (int)A.size();
75         vector<vector<double>> M = A;
76         for(int i=0;i<n;i++) M[i].push_back(b[i]);
77         for(int col=0; col<n; ++col){
78             int pivot = col;
79             for(int r=col+1; r<n; ++r) if (fabs(M[r][col]) > fabs(M[pivot][col])) pivot = r;
80             swap(M[pivot], M[col]);
81             double piv = M[col][col];
82             if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix");
83             for(int c=col; c<=n; ++c) M[col][c] /= piv;
84             for(int r=col+1; r<n; ++r){
85                 double f = M[r][col];
86                 for(int c=col; c<=n; ++c) M[r][c] -= f * M[col][c];
87             }
88         }
89         vector<double> x(n,0.0);
90         for(int r=n-1; r>=0; --r){
91             double s = M[r][n];
92             for(int c=r+1;c<n;++c) s -= M[r][c]*x[c];
93             x[r]=s;
94         }
95         return x;
96     }
97 
98     // Predict using f_hat(x) = f0(x) + k_x^T alpha where k_x = Phi(x) * Phi(X)^T
99     double predict_one(double x) const {
100         vector<double> ph = phi(x);
101         double s = f0(x);
102         for(size_t i=0;i<trainX.size();++i){
103             double dot=0.0;
104             for(size_t k=0;k<ph.size();++k) dot += ph[k]*storePHI[i][k];
105             s += alpha[i] * dot;
106         }
107         return s;
108     }
109 
110     vector<double> predict(const vector<double>& Xtest) const {
111         vector<double> out; out.reserve(Xtest.size());
112         for(double xt: Xtest) out.push_back(predict_one(xt));
113         return out;
114     }
115 
116     // Stored state
117     vector<double> alpha; 
118     vector<double> trainX;
119     vector<vector<double>> storePHI;
120 };
121 
122 int main(){
123     // Data
124     int n = 120;
125     mt19937 rng(7);
126     uniform_real_distribution<double> uni(0.0, 1.0);
127     normal_distribution<double> noise(0.0, 0.03);
128     vector<double> X(n), Y(n);
129     for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2*M_PI*X[i]) + noise(rng); }
130 
131     NTKLinearized ntk(/*width*/64, /*lambda*/1e-4, rng);
132     ntk.fit(X, Y);
133 
134     // Compare a few predictions
135     for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){
136         cout << fixed << setprecision(4)
137              << "x=" << xt 
138              << ", y_pred_NTK=" << ntk.predict_one(xt) << "\n";
139     }
140     return 0;
141 }
142

This code keeps a two-layer network at its random initialization and computes its Jacobian features to build the NTK Gram matrix. It then solves a kernel ridge regression on the residual y − f0 and predicts using the NTK kernel. This demonstrates that lazy training of the network matches kernel regression with the NTK, contrasting with the feature-learning behavior in the previous example.

Time: Building PHI: O(n m) for 1D; Gram: O(n^2 m); Solve: O(n^3); Prediction per test: O(n m)Space: O(n m) to store PHI and O(n^2) for the Gram matrix

Optional: Switching Regimes by Tuning Learning Rate and Width (Experiment Skeleton)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Minimal skeleton (no training loop) showing how to toggle regimes:
5 // - Increase width m and decrease lr -> lazier behavior (smaller drift)
6 // - Decrease width or increase lr -> richer feature learning (larger drift)
7 struct ToggleExample {
8     int m; double lr;
9     ToggleExample(int m_, double lr_): m(m_), lr(lr_) {}
10     void print_hint(){
11         cout << "For width m=" << m << ", try lr=" << lr << ".\n";
12         cout << "- If you increase m and reduce lr (e.g., m*4, lr/4), parameter drift shrinks (lazy).\n";
13         cout << "- If you reduce m or increase lr, drift grows (feature learning).\n";
14     }
15 };
16 
17 int main(){
18     ToggleExample a(1024, 1e-3); a.print_hint();
19     ToggleExample b(128, 5e-2);  b.print_hint();
20     return 0;
21 }
22

This small skeleton emphasizes the practical control knobs: width and learning rate. Larger width with smaller learning rate promotes the lazy regime; smaller width or larger learning rate promotes feature learning. Use it alongside the previous examples to verify drift norms and behavior.

Time: O(1)Space: O(1)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct KernelRidgeRBF {
5	double sigma; // RBF bandwidth
6	double lambda; // ridge regularization
7	vector<double> X; // 1D inputs for simplicity
8	vector<double> alpha; // dual coefficients
9
10	static double sqr(double x){ return x*x; }
11
12	KernelRidgeRBF(double sigma_=0.2, double lambda_=1e-6): sigma(sigma_), lambda(lambda_) {}
13
14	double kernel(double x, double y) const {
15	double dist2 = sqr(x - y);
16	return exp(- dist2 / (2.0 * sqr(sigma)));
17	}
18
19	// Solve linear system A a = b via Gaussian elimination with partial pivoting
20	static vector<double> solve_linear(vector<vector<double>> A, vector<double> b){
21	int n = (int)A.size();
22	for(int i=0;i<n;i++) A[i].push_back(b[i]);
23	// Forward elimination
24	for(int col=0; col<n; ++col){
25	// Pivot
26	int pivot = col;
27	for(int r=col+1; r<n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r;
28	swap(A[pivot], A[col]);
29	double piv = A[col][col];
30	if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix (add larger lambda)");
31	// Normalize row
32	for(int c=col; c<=n; ++c) A[col][c] /= piv;
33	// Eliminate below
34	for(int r=col+1; r<n; ++r){
35	double f = A[r][col];
36	for(int c=col; c<=n; ++c) A[r][c] -= f * A[col][c];
37	}
38	}
39	// Back substitution
40	vector<double> x(n,0.0);
41	for(int r=n-1; r>=0; --r){
42	double s = A[r][n];
43	for(int c=r+1;c<n;++c) s -= A[r][c]*x[c];
44	x[r]=s; // row is normalized
45	}
46	return x;
47	}
48
49	void fit(const vector<double>& x, const vector<double>& y){
50	X = x;
51	int n = (int)x.size();
52	vector<vector<double>> K(n, vector<double>(n, 0.0));
53	for(int i=0;i<n;i++){
54	for(int j=0;j<n;j++){
55	K[i][j] = kernel(x[i], x[j]);
56	}
57	K[i][i] += lambda; // ridge for numerical stability and regularization
58	}
59	alpha = solve_linear(K, y);
60	}
61
62	double predict_one(double x) const {
63	double s = 0.0;
64	for(size_t i=0;i<X.size();++i) s += alpha[i]*kernel(x, X[i]);
65	return s;
66	}
67
68	vector<double> predict(const vector<double>& Xtest) const {
69	vector<double> out; out.reserve(Xtest.size());
70	for(double xt: Xtest) out.push_back(predict_one(xt));
71	return out;
72	}
73	};
74
75	int main(){
76	// Generate synthetic 1D data: y = sin(2πx) + small noise
77	int n = 100;
78	vector<double> x(n), y(n);
79	mt19937 rng(42);
80	uniform_real_distribution<double> uni(0.0, 1.0);
81	normal_distribution<double> noise(0.0, 0.05);
82	for(int i=0;i<n;i++){
83	x[i] = uni(rng);
84	y[i] = sin(2.0M_PIx[i]) + noise(rng);
85	}
86
87	KernelRidgeRBF krr(0.15, 1e-4);
88	krr.fit(x, y);
89
90	// Predict on a grid
91	int m = 50;
92	vector<double> xt(m);
93	for(int i=0;i<m;i++) xt[i] = i/(double)(m-1);
94	vector<double> yp = krr.predict(xt);
95
96	// Report a few predictions
97	cout << fixed << setprecision(4);
98	for(int i=0;i<m;i+=10){
99	cout << "x=" << xt[i] << ", y_pred=" << yp[i] << "\n";
100	}
101	return 0;
102	}
103

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct TwoLayerNet {
5	int m; // width
6	double lr; // learning rate
7	// Parameters: w (input->hidden), b (hidden bias), a (hidden->output)
8	vector<double> w, b, a;
9	// Initial copies to measure drift
10	vector<double> w0, b0, a0;
11
12	TwoLayerNet(int m_, double lr_, mt19937 &rng): m(m_), lr(lr_) {
13	normal_distribution<double> nd(0.0, 1.0);
14	w.resize(m); b.resize(m); a.resize(m);
15	for(int j=0;j<m;j++){
16	// NTK scaling 1/sqrt(m) keeps outputs O(1)
17	w[j] = nd(rng) / sqrt((double)m);
18	b[j] = nd(rng) / sqrt((double)m);
19	a[j] = nd(rng) / sqrt((double)m);
20	}
21	w0=w; b0=b; a0=a;
22	}
23
24	static double relu(double z){ return z>0? z: 0.0; }
25	static double relu_grad(double z){ return z>0? 1.0: 0.0; }
26
27	// Forward pass for one input x
28	double forward_one(double x, vector<double>* cache_h=nullptr, vector<double>* cache_z=nullptr) const {
29	double sum = 0.0;
30	for(int j=0;j<m;j++){
31	double z = w[j]*x + b[j];
32	double h = relu(z);
33	if (cache_h) (*cache_h)[j] = h;
34	if (cache_z) (*cache_z)[j] = z;
35	sum += a[j]h; // output is sum_j a_j h_j
36	}
37	return sum;
38	}
39
40	// One full-batch GD step on MSE loss
41	double train_step(const vector<double>& X, const vector<double>& Y){
42	int n = (int)X.size();
43	vector<double> grad_w(m,0.0), grad_b(m,0.0), grad_a(m,0.0);
44	double loss = 0.0;
45	// Accumulate gradients over all examples
46	for(int i=0;i<n;i++){
47	vector<double> h(m), z(m);
48	double yhat = forward_one(X[i], &h, &z);
49	double err = yhat - Y[i];
50	loss += 0.5errerr; // MSE/2
51	for(int j=0;j<m;j++){
52	grad_a[j] += err * h[j];
53	double dz = err * a[j] * relu_grad(z[j]);
54	grad_w[j] += dz * X[i];
55	grad_b[j] += dz;
56	}
57	}
58	// Average gradients
59	for(int j=0;j<m;j++){
60	grad_a[j] /= n; grad_w[j] /= n; grad_b[j] /= n;
61	}
62	loss /= n;
63	// Gradient descent update
64	for(int j=0;j<m;j++){
65	a[j] -= lr * grad_a[j];
66	w[j] -= lr * grad_w[j];
67	b[j] -= lr * grad_b[j];
68	}
69	return loss;
70	}
71
72	double param_drift_norm() const {
73	double s=0.0;
74	for(int j=0;j<m;j++){
75	s += (w[j]-w0[j])*(w[j]-w0[j]);
76	s += (b[j]-b0[j])*(b[j]-b0[j]);
77	s += (a[j]-a0[j])*(a[j]-a0[j]);
78	}
79	return sqrt(s);
80	}
81	};
82
83	int main(){
84	// Data: y = sin(2πx) + noise
85	int n = 256;
86	mt19937 rng(123);
87	uniform_real_distribution<double> uni(0.0, 1.0);
88	normal_distribution<double> noise(0.0, 0.05);
89	vector<double> X(n), Y(n);
90	for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2M_PIX[i]) + noise(rng); }
91
92	TwoLayerNet net(/width/128, /lr/5e-2, rng);
93
94	// Train for a few epochs
95	int epochs = 500;
96	for(int e=1;e<=epochs;e++){
97	double loss = net.train_step(X, Y);
98	if (e%50==0){
99	cout << fixed << setprecision(6)
100	<< "epoch=" << e
101	<< ", loss=" << loss
102	<< ", drift_norm=" << net.param_drift_norm() << "\n";
103	}
104	}
105
106	// Show a few predictions
107	for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){
108	double yhat = net.forward_one(xt);
109	cout << fixed << setprecision(4) << "x=" << xt << ", y_pred=" << yhat << "\n";
110	}
111	return 0;
112	}
113

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Two-layer ReLU network at initialization, but we DO NOT train it.
5	// We compute Jacobian features and perform NTK kernel ridge on the residual y - f0.
6	struct NTKLinearized {
7	int m; // width
8	vector<double> w, b, a; // parameters at init (scaled by 1/sqrt(m))
9	double lambda; // ridge
10
11	NTKLinearized(int m_, double lambda_, mt19937 &rng): m(m_), lambda(lambda_) {
12	normal_distribution<double> nd(0.0, 1.0);
13	w.resize(m); b.resize(m); a.resize(m);
14	for(int j=0;j<m;j++){
15	w[j] = nd(rng) / sqrt((double)m);
16	b[j] = nd(rng) / sqrt((double)m);
17	a[j] = nd(rng) / sqrt((double)m);
18	}
19	}
20
21	static double relu(double z){ return z>0? z: 0.0; }
22	static double relu_grad(double z){ return z>0? 1.0: 0.0; }
23
24	double f0(double x) const {
25	double s=0.0;
26	for(int j=0;j<m;j++) s += a[j]relu(w[j]x + b[j]);
27	return s;
28	}
29
30	// Build Jacobian feature vector phi(x) for parameters [a_j, w_j, b_j] (length 3m)
31	vector<double> phi(double x) const {
32	vector<double> p(3*m);
33	for(int j=0;j<m;j++){
34	double z = w[j]*x + b[j];
35	double h = relu(z);
36	double g = relu_grad(z);
37	// d f / d a_j = h
38	p[j] = h;
39	// d f / d w_j = a_j * g * x
40	p[m + j] = a[j] * g * x;
41	// d f / d b_j = a_j * g
42	p[2m + j] = a[j] g;
43	}
44	return p;
45	}
46
47	// Fit NTK kernel ridge on residual r = y - f0
48	void fit(const vector<double>& X, const vector<double>& Y){
49	int n = (int)X.size();
50	// Compute Phi (n x 3m) implicitly; build K = Phi Phi^T
51	vector<vector<double>> K(n, vector<double>(n, 0.0));
52	vector<vector<double>> PHI(n); PHI.reserve(n);
53	vector<double> r(n);
54	for(int i=0;i<n;i++){
55	PHI[i] = phi(X[i]);
56	r[i] = Y[i] - f0(X[i]);
57	}
58	for(int i=0;i<n;i++){
59	for(int j=i;j<n;j++){
60	double s=0.0;
61	for(size_t k=0;k<PHI[i].size();++k) s += PHI[i][k]*PHI[j][k];
62	K[i][j]=K[j][i]=s;
63	}
64	K[i][i] += lambda;
65	}
66	// Solve (K + lambda I) alpha = r
67	alpha = solve_linear(K, r);
68	trainX = X;
69	storePHI = move(PHI); // keep PHI to predict fast (or recompute on the fly)
70	}
71
72	// Gaussian elimination solver
73	static vector<double> solve_linear(vector<vector<double>> A, const vector<double>& b){
74	int n = (int)A.size();
75	vector<vector<double>> M = A;
76	for(int i=0;i<n;i++) M[i].push_back(b[i]);
77	for(int col=0; col<n; ++col){
78	int pivot = col;
79	for(int r=col+1; r<n; ++r) if (fabs(M[r][col]) > fabs(M[pivot][col])) pivot = r;
80	swap(M[pivot], M[col]);
81	double piv = M[col][col];
82	if (fabs(piv) < 1e-12) throw runtime_error("Singular matrix");
83	for(int c=col; c<=n; ++c) M[col][c] /= piv;
84	for(int r=col+1; r<n; ++r){
85	double f = M[r][col];
86	for(int c=col; c<=n; ++c) M[r][c] -= f * M[col][c];
87	}
88	}
89	vector<double> x(n,0.0);
90	for(int r=n-1; r>=0; --r){
91	double s = M[r][n];
92	for(int c=r+1;c<n;++c) s -= M[r][c]*x[c];
93	x[r]=s;
94	}
95	return x;
96	}
97
98	// Predict using f_hat(x) = f0(x) + k_x^T alpha where k_x = Phi(x) * Phi(X)^T
99	double predict_one(double x) const {
100	vector<double> ph = phi(x);
101	double s = f0(x);
102	for(size_t i=0;i<trainX.size();++i){
103	double dot=0.0;
104	for(size_t k=0;k<ph.size();++k) dot += ph[k]*storePHI[i][k];
105	s += alpha[i] * dot;
106	}
107	return s;
108	}
109
110	vector<double> predict(const vector<double>& Xtest) const {
111	vector<double> out; out.reserve(Xtest.size());
112	for(double xt: Xtest) out.push_back(predict_one(xt));
113	return out;
114	}
115
116	// Stored state
117	vector<double> alpha;
118	vector<double> trainX;
119	vector<vector<double>> storePHI;
120	};
121
122	int main(){
123	// Data
124	int n = 120;
125	mt19937 rng(7);
126	uniform_real_distribution<double> uni(0.0, 1.0);
127	normal_distribution<double> noise(0.0, 0.03);
128	vector<double> X(n), Y(n);
129	for(int i=0;i<n;i++){ X[i]=uni(rng); Y[i]=sin(2M_PIX[i]) + noise(rng); }
130
131	NTKLinearized ntk(/width/64, /lambda/1e-4, rng);
132	ntk.fit(X, Y);
133
134	// Compare a few predictions
135	for(double xt : {0.0, 0.25, 0.5, 0.75, 1.0}){
136	cout << fixed << setprecision(4)
137	<< "x=" << xt
138	<< ", y_pred_NTK=" << ntk.predict_one(xt) << "\n";
139	}
140	return 0;
141	}
142

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Minimal skeleton (no training loop) showing how to toggle regimes:
5	// - Increase width m and decrease lr -> lazier behavior (smaller drift)
6	// - Decrease width or increase lr -> richer feature learning (larger drift)
7	struct ToggleExample {
8	int m; double lr;
9	ToggleExample(int m_, double lr_): m(m_), lr(lr_) {}
10	void print_hint(){
11	cout << "For width m=" << m << ", try lr=" << lr << ".\n";
12	cout << "- If you increase m and reduce lr (e.g., m*4, lr/4), parameter drift shrinks (lazy).\n";
13	cout << "- If you reduce m or increase lr, drift grows (feature learning).\n";
14	}
15	};
16
17	int main(){
18	ToggleExample a(1024, 1e-3); a.print_hint();
19	ToggleExample b(128, 5e-2); b.print_hint();
20	return 0;
21	}
22