🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Mean Field Theory of Neural Networks

Key Points

  • •
    Mean field theory treats very wide randomly initialized neural networks as averaging machines where each neuron behaves like a sample from a common distribution.
  • •
    As width grows, pre-activations in each layer become Gaussian by the Central Limit Theorem, letting us predict variances and correlations layer by layer.
  • •
    In the infinite-width limit, the network’s output distribution becomes a Gaussian process with a kernel computable by a simple recurrence (the NNGP kernel).
  • •
    For ReLU networks, variance and correlation propagate via closed-form arc-cosine formulas, enabling exact predictions at initialization.
  • •
    Choosing weight variance using mean field (e.g., He or Xavier) prevents signal explosion/decay and keeps variances stable across depth.
  • •
    The “edge of chaos” is characterized by a slope parameter \(χ\); when \(χ=1\), signals (and gradients) neither explode nor vanish.
  • •
    Mean field predictions are most accurate at large width and at initialization; finite-width and training dynamics introduce deviations.
  • •
    These tools guide hyperparameters, diagnose vanishing/exploding behavior, and connect deep nets to kernels NTKNNGP​ for theory and practice.

Prerequisites

  • →Gaussian random variables — Mean field recurrences assume Gaussian pre-activations; understanding variance, covariance, and bivariate Gaussians is essential.
  • →Central Limit Theorem and Law of Large Numbers — They justify Gaussian pre-activations and self-averaging across wide layers.
  • →Inner products and norms — The base kernel K^(0) is the normalized inner product; covariance depends on norms and angles.
  • →Feedforward neural networks — You need the structure of layers, activations, weights, and biases to apply mean-field ideas.
  • →Derivatives of activations — NTK and edge-of-chaos analyses involve expectations over activation derivatives.
  • →Random number generation — Simulations use Gaussian RNGs for weights, biases, and inputs.
  • →Kernel methods — NNGP and NTK connect deep nets to kernel regression and learning dynamics.
  • →Trigonometric functions and angles — ReLU arc-cosine formulas use angles between inputs via arccos and trigonometric identities.

Detailed Explanation

Tap terms for definitions

01Overview

Mean field theory (MFT) of neural networks studies randomly initialized, very wide networks by replacing complicated, high-dimensional randomness with simpler aggregate behavior. The key idea is self-averaging: when each neuron receives a sum of many small, independent contributions, the Central Limit Theorem implies the pre-activation becomes approximately Gaussian. This lets us track only low-order statistics—means, variances, and correlations—through layers, rather than the entire weight matrices. Practically, MFT yields recurrences describing how variance and input correlations propagate through depth, providing principled rules for weight scaling (e.g., He or Xavier initialization). In the infinite-width limit, outputs of deep networks converge to a Gaussian process (NNGP), with a kernel defined by these recurrences. Closely related, the Neural Tangent Kernel (NTK) characterizes training dynamics for infinitesimal learning rates or at infinite width. These perspectives explain stability at initialization, predict signal propagation, and clarify when and why gradients vanish or explode. For popular activations like ReLU or tanh, the recurrences admit closed-form or efficiently computable expressions. Mean field tools thus bridge deep learning practice and probabilistic/statistical physics methods, yielding both intuition and concrete formulas usable for model design.

02Intuition & Analogies

Imagine each neuron as a weather station summing many small, independent sensor readings (its inputs), each with tiny random weightings. For a single reading, noise dominates; but average enough sensors and the result becomes reliably bell-shaped (Gaussian) due to the Central Limit Theorem. In a very wide layer, each neuron receives a large number of such inputs, so its pre-activation is like a noisy average—well-approximated by a Gaussian with predictable mean and variance. Push this through a nonlinearity (like ReLU), and you get a new distribution with a new variance; feed that into the next layer and repeat. Instead of tracking every station’s detailed wiring, you track the weather pattern’s key summaries—variance (how turbulent) and correlation (how similar two cities’ weather is). If two input points are similar (highly correlated), their weather patterns remain similar after passing through layers, governed by a simple correlation update rule. When layers are immensely wide, randomness across different neurons averages out so thoroughly that the entire network acts like sampling from a single Gaussian process: for any set of inputs, the output vector is jointly Gaussian with a kernel that’s easy to compute. Tuning the weight scale is like adjusting sensor gain: too small and signals die out (calm weather everywhere); too large and everything saturates or becomes chaotic (storms everywhere). The sweet spot—edge of chaos—keeps distinctions among inputs alive across many layers, making learning easier.

03Formal Definition

Consider a fully connected feedforward network with widths \(n0​, n1​, …, nL​\) and activation \(ϕ\). Weights and biases are i.i.d. with \(Wij(ℓ)​ ∼ N(0, σw2​/nℓ−1​)\) and \(bi(ℓ)​ ∼ N(0, σb2​)\). Define pre-activations and activations by \(zi(ℓ)​(x) = ∑j=1nℓ−1​​ Wij(ℓ)​ aj(ℓ−1)​(x) + bi(ℓ)​\) and \(ai(ℓ)​(x) = ϕ(zi(ℓ)​(x))\), with \(a(0)(x) = x\). Under standard assumptions (i.i.d. inputs with finite second moments, independence across units), as \(minℓ​ nℓ​ → ∞\), the empirical distribution of \(zi(ℓ)​(x)\) across \(i\) converges to a Gaussian \(N(0, q(ℓ)(x))\). Variance propagates via the mean-field recursion: \(q(ℓ+1)(x) = σw2​\, Ez∼N(0,q(ℓ)(x))​[ϕ(z)^2] + σb2​\). For two inputs \(x, x'\), define the layer-\(ℓ\) kernel \(K(ℓ)(x,x') = E[ai(ℓ)​(x) ai(ℓ)​(x')]\) (independent of \(i\) by symmetry). Then \(K(0)(x,x') = n0​1​ x^⊤ x'\), and for \(ℓ ≥ 0\), \(K(ℓ+1)(x,x') = σw2​\, E(u,v)∼N(0,Σ(ℓ))​[ϕ(u)ϕ(v)] + σb2​\), where \(Σ(ℓ) = \begin{pmatrix} K(ℓ)(x,x) & K(ℓ)(x,x') \\ K(ℓ)(x',x) & K(ℓ)(x',x') \end{pmatrix}\). Consequently, the network output (with a random linear readout) converges in finite-dimensional distributions to a Gaussian process \(GP(0, K(L))\).

04When to Use

  • Selecting initialization scales: Use the variance recursion to choose (\sigma_w^2) and (\sigma_b^2) (e.g., He for ReLU, Xavier for tanh) so that (q^{(\ell)}) remains stable across depth.
  • Diagnosing vanishing/exploding signals or gradients: If (q^{(\ell)}) shrinks/grows, activations collapse/saturate; if the correlation map’s slope (\chi) is far from 1, gradients vanish/explode. Adjust (\sigma_w^2), (\sigma_b^2), or activation.
  • Comparing architectures/activations: Predict which nonlinearity preserves information (correlation) best over many layers; choose depth accordingly.
  • Kernel viewpoints: For infinite width, use NNGP to compute predictive distributions without training, or NTK to approximate gradient descent dynamics. These give baselines and theoretical bounds.
  • Sanity checks for inputs: Normalize inputs so (K^{(0)}(x,x)) is (\mathcal{O}(1)) to match assumptions of the recurrences.
  • Understanding batchnorm or residual connections: Although classical MFT assumes i.i.d. layers without normalization, extensions analyze how these components affect (q), correlations, and (\chi).

⚠️Common Mistakes

  • Confusing fan-in scaling: Mean field requires (\operatorname{Var}(W_{ij}) = \sigma_w^2/n_{\ell-1}), not a constant variance; otherwise variance blows up with width.
  • Ignoring biases in recurrences: (\sigma_b^2) shifts variances and correlations; setting it incorrectly changes fixed points and can induce unwanted drift.
  • Overtrusting infinite-width predictions at small width: Finite networks deviate due to higher-order correlations; treat MFT as a guide, not gospel.
  • Assuming results hold during training: NNGP characterizes initialization; NTK gives a training approximation in specific regimes (small learning rate, wide layers). General training can drift away from mean-field assumptions.
  • Misusing activation formulas: Closed-form arc-cosine kernels apply to ReLU and some homogeneous activations; tanh/sigmoid require numerical expectations.
  • Forgetting input scaling: If inputs are not normalized, the base kernel (K^{(0)}) is off, invalidating later layers’ predictions.
  • Mixing fan-in/fan-out variants (Xavier vs He): ReLU typically uses He (variance (2/\text{fan-in})); tanh uses Xavier ((1/\text{fan-in})).
  • Overlooking correlation slope (\chi): Stability around the fixed point depends on (\chi); setting (\sigma_w^2) without checking (\chi) can cause vanishing/exploding gradients.

Key Formulas

Pre-activation

zi(ℓ)​=j=1∑nℓ−1​​Wij(ℓ)​aj(ℓ−1)​+bi(ℓ)​

Explanation: Each neuron’s pre-activation is a sum of many small random contributions plus bias. With i.i.d. weights and large fan-in, this sum becomes approximately Gaussian.

Variance Recurrence

q(ℓ+1)=σw2​Ez∼N(0,q(ℓ))​[ϕ(z)2]+σb2​

Explanation: Layer-wise variance evolves by pushing a Gaussian through the activation and rescaling by weight variance, then adding bias variance. It predicts signal growth or decay across depth.

NNGP Kernel Recurrence

K(0)(x,x′)=n0​1​x⊤x′,K(ℓ+1)(x,x′)=σw2​E(u,v)∼N(0,Σ(ℓ))​[ϕ(u)ϕ(v)]+σb2​

Explanation: Starting from the input inner product (normalized by input dimension), the kernel at the next layer is the expected product of activations under a joint Gaussian with covariance from the previous layer.

Layer Covariance

Σ(ℓ)(x,x′)=(K(ℓ)(x,x)K(ℓ)(x′,x)​K(ℓ)(x,x′)K(ℓ)(x′,x′)​)

Explanation: This matrix captures variances and covariance of pre-activations for a pair of inputs. It parameterizes the bivariate Gaussian used in the kernel expectation.

ReLU Arc-cosine Expectation

E[ReLU(u)ReLU(v)]=2πqu​qv​​​(sinθ+(π−θ)cosθ),cosθ=qu​qv​​Cov(u,v)​

Explanation: For ReLU, the expectation of the product of activations has a closed form depending only on the variances and correlation (encoded by the angle \(θ\)). This yields exact NNGP recurrences.

ReLU Second Moment

E[ReLU(z)2]=2q​for z∼N(0,q)

Explanation: A zero-mean Gaussian pushed through ReLU has half the variance kept. This simplifies the variance recurrence.

Mean-field Scaling

Var(Wij(ℓ)​)=nℓ−1​σw2​​,Var(bi(ℓ)​)=σb2​

Explanation: Weights are scaled by fan-in so that pre-activation variance remains \(O(1)\) as width grows. Bias variance is typically small or zero.

Edge-of-Chaos Slope

χ=σw2​Ez∼N(0,q∗)​[ϕ′(z)2]

Explanation: The slope of the correlation map at the fixed point \(q^*\) controls stability. If \(χ<1\), signals contract; if \(χ>1\), they expand; \(χ=1\) is critical.

ReLU Criticality

χReLU​=2σw2​​(with σb2​=0, q∗=1)

Explanation: For ReLU, \(E[ϕ'(z)^2]=1/2\). Thus the edge of chaos occurs at \(σw2​=2\).

NNGP Limit

K(L)(x,x′)⇒output GP: y(x)∼GP(0,K(L)(x,x′))

Explanation: At infinite width, the network induces a Gaussian process whose kernel is the depth-L kernel computed by the recurrence. Predictions can be made by kernel regression.

NTK Recurrence (schematic)

Θ(ℓ+1)(x,x′)=Θ(ℓ)(x,x′)⊙Σ˙(ℓ+1)(x,x′)+K(ℓ+1)(x,x′),Σ˙(ℓ+1)=σw2​E[ϕ′(u)ϕ′(v)]

Explanation: The NTK across layers evolves using a derivative kernel and the NNGP. At infinite width, \(Θ\) remains constant during training, predicting linearized dynamics.

Central Limit Theorem

n​1​j=1∑n​Xj​d​N(0,Var(X1​))

Explanation: Sums of many i.i.d. contributions tend to a Gaussian. This underpins Gaussian pre-activations and mean-field recurrences.

Normalized Correlation Map (ReLU, He)

fReLU​(c)=π1​(1−c2​+(π−arccosc)c)

Explanation: With ReLU and variance-preserving weights (\(σw2​=2,\ σb2​=0\)), the next-layer correlation is a closed-form function of the current correlation.

Complexity Analysis

All provided C++ simulations operate by forward-propagating vectors through random layers and aggregating statistics; their costs depend on layer widths and input dimension. For a single fully connected layer with fan-in ni​n and fan-out no​ut, computing z=W a + b requires O(ni​n · no​ut) multiplications/additions and O(no​ut) extra for bias and nonlinearity. When we avoid storing full matrices and instead stream weights from an RNG, space reduces to O(ni​n + no​ut) for holding the current and next activations (plus negligible RNG state), instead of O(ni​n · no​ut). For multiple layers, total time is O(∑_ℓ nℓ−1​ n_ℓ); in common wide settings with nearly constant width n, time is O(L n2) and space is O(n). Kernel estimation by sample averages across width (empirical NNGP) also costs O(nℓ​) per layer once activations are computed; the dominant cost remains the matrix–vector multiplies. The theoretical kernel recursion (arc-cosine for ReLU) is O(1) per layer per input pair, since it uses only scalar algebra (angles and trigonometric functions), making it far faster than Monte Carlo or explicit network simulation. Monte Carlo checks of Gaussian expectations scale linearly with the number of samples M, i.e., O(M) time and O(1) space, but converge at rate O(1/√M), so large M may be required for tight confidence intervals. Overall, empirical demonstrations with n in the thousands and L≤3 are tractable on commodity CPUs when using streamed weight generation, while theoretical recurrences are essentially instantaneous and ideal for hyperparameter tuning and sanity checks before training.

Code Examples

Variance propagation and approximate Gaussianity in a wide ReLU network (He initialization)
1#include <bits/stdc++.h>
2using namespace std;
3
4// ReLU activation
5inline double relu(double x) { return x > 0.0 ? x : 0.0; }
6
7// Compute sample mean and variance of a vector
8pair<double,double> mean_var(const vector<double>& v) {
9 double m = 0.0; for (double x : v) m += x; m /= (double)v.size();
10 double s2 = 0.0; for (double x : v) { double d = x - m; s2 += d*d; }
11 s2 /= (double)v.size();
12 return {m, s2};
13}
14
15int main() {
16 ios::sync_with_stdio(false);
17 cin.tie(nullptr);
18
19 // Settings
20 int d0 = 512; // input dimension (fan-in of first layer)
21 int width = 4000; // width for hidden layers (wide)
22 int L = 5; // number of layers to simulate
23 double sigma_w2 = 2.0; // ReLU variance-preserving in mean-field (He): Var(W_ij)=2/fan_in
24 double sigma_b2 = 0.0; // bias variance
25
26 // Random generators
27 std::mt19937_64 rng(12345);
28 std::normal_distribution<double> stdn(0.0, 1.0);
29
30 // Create Gaussian input with unit variance entries
31 vector<double> a_prev(d0);
32 for (int j = 0; j < d0; ++j) a_prev[j] = stdn(rng); // mean 0, var 1
33
34 // Theoretical variance recursion q^{l}
35 // q^0 is input variance per coordinate. For N(0,1) inputs, q^0 = 1.
36 double q_theory = 1.0;
37
38 cout << fixed << setprecision(6);
39 cout << "Layer 0 (input): q_theory= " << q_theory << "\n";
40
41 for (int ell = 1; ell <= L; ++ell) {
42 int n_in = (ell == 1) ? d0 : width;
43 int n_out = width; // keep width constant
44 double w_scale = sqrt(sigma_w2 / (double)n_in);
45
46 vector<double> z(n_out, 0.0), a(n_out, 0.0);
47
48 // Stream weights: for each output neuron, accumulate dot product
49 for (int i = 0; i < n_out; ++i) {
50 double zi = 0.0;
51 for (int j = 0; j < n_in; ++j) {
52 double wij = w_scale * stdn(rng); // W_ij ~ N(0, sigma_w2 / n_in)
53 zi += wij * a_prev[j];
54 }
55 double bi = sqrt(sigma_b2) * stdn(rng);
56 zi += bi;
57 z[i] = zi;
58 a[i] = relu(zi);
59 }
60
61 auto mvz = mean_var(z);
62 auto mva = mean_var(a);
63
64 // Theoretical update for ReLU: E[ReLU(z)^2] = q/2 when z~N(0,q)
65 double q_next = sigma_w2 * (q_theory / 2.0) + sigma_b2;
66
67 cout << "Layer " << ell << ": preact mean= " << mvz.first
68 << ", preact var(empirical)= " << mvz.second
69 << ", act var(empirical)= " << mva.second
70 << ", q_theory(next)= " << q_next << "\n";
71
72 // Prepare next layer
73 a_prev.swap(a);
74 q_theory = q_next;
75 }
76
77 // Note: Empirical pre-activations should have near-zero mean and variance close to q_theory at each layer.
78 return 0;
79}
80

This program simulates forward propagation through several wide ReLU layers with He scaling (Var(W)=2/fan_in). It streams random weights (no large matrices stored), computes pre-activations and activations, and reports empirical pre-activation variance alongside the mean-field theoretical variance computed by q^{l+1} = sigma_w^2 * (q^l / 2) + sigma_b^2. As width increases, empirical and theoretical variances match closely; pre-activation means approach zero and distributions look Gaussian by aggregation.

Time: O(L · width^2) when width is constant across layers; more precisely O(∑_ℓ n_{ℓ-1} n_ℓ).Space: O(width) for activations (plus O(1) RNG state).
Empirical vs theoretical NNGP kernel for ReLU across layers
1#include <bits/stdc++.h>
2using namespace std;
3
4inline double relu(double x) { return x > 0.0 ? x : 0.0; }
5
6// Arc-cosine expectation for ReLU: E[ReLU(u) ReLU(v)] given variances q1, q2 and covariance c12
7double relu_expectation(double q1, double q2, double c12) {
8 if (q1 <= 0 || q2 <= 0) return 0.0;
9 double corr = c12 / sqrt(q1 * q2);
10 corr = max(-1.0, min(1.0, corr));
11 double theta = acos(corr);
12 return sqrt(q1 * q2) * (sin(theta) + (M_PI - theta) * cos(theta)) / (2.0 * M_PI);
13}
14
15// Theoretical NNGP kernel recursion for ReLU
16struct KernelState { double Kxx, Kxpxp, Kxxp; };
17KernelState relu_kernel_theory(const vector<double>& x, const vector<double>& xp,
18 int L, double sigma_w2, double sigma_b2) {
19 int d = (int)x.size();
20 auto dot = [&](const vector<double>& a, const vector<double>& b){
21 long double s=0; for (int i=0;i<d;++i) s += (long double)a[i]*b[i]; return (double)s; };
22 double Kxx = dot(x,x) / (double)d;
23 double Kxpxp = dot(xp,xp) / (double)d;
24 double Kxxp = dot(x,xp) / (double)d;
25 for (int ell = 0; ell < L; ++ell) {
26 double q1 = Kxx, q2 = Kxpxp, c12 = Kxxp;
27 double Exx = sigma_w2 * (q1 / 2.0) + sigma_b2; // E[ReLU(u)^2] = q/2
28 double Expxp = sigma_w2 * (q2 / 2.0) + sigma_b2;
29 double Exxp = sigma_w2 * relu_expectation(q1, q2, c12) + sigma_b2;
30 Kxx = Exx; Kxpxp = Expxp; Kxxp = Exxp;
31 }
32 return {Kxx, Kxpxp, Kxxp};
33}
34
35// Empirical kernel via a single wide random network (streamed weights)
36KernelState relu_kernel_empirical(const vector<double>& x, const vector<double>& xp,
37 int L, int width, double sigma_w2, double sigma_b2, uint64_t seed) {
38 std::mt19937_64 rng(seed);
39 std::normal_distribution<double> stdn(0.0, 1.0);
40
41 vector<double> a = x, ap = xp;
42
43 auto layer = [&](const vector<double>& ain, const vector<double>& apin, int n_in, int n_out) {
44 double w_scale = sqrt(sigma_w2 / (double)n_in);
45 vector<double> aout(n_out), aoutp(n_out);
46 for (int i = 0; i < n_out; ++i) {
47 double zi = 0.0, zip = 0.0;
48 for (int j = 0; j < n_in; ++j) {
49 double wij = w_scale * stdn(rng);
50 zi += wij * ain[j];
51 zip += wij * apin[j]; // same weights for both inputs
52 }
53 double bi = sqrt(sigma_b2) * stdn(rng);
54 zi += bi; zip += bi; // same bias shared across inputs
55 aout[i] = relu(zi);
56 aoutp[i] = relu(zip);
57 }
58 return pair<vector<double>, vector<double>>(move(aout), move(aoutp));
59 };
60
61 int n_in = (int)x.size();
62 for (int ell = 0; ell < L; ++ell) {
63 auto [anext, apnext] = layer(a, ap, n_in, width);
64 a.swap(anext); ap.swap(apnext);
65 n_in = width;
66 }
67
68 auto dot = [&](const vector<double>& u, const vector<double>& v){
69 long double s=0; for (int i=0;i<(int)u.size();++i) s += (long double)u[i]*v[i]; return (double)s; };
70 double Kxx = dot(a,a) / (double)a.size();
71 double Kxpxp = dot(ap,ap) / (double)ap.size();
72 double Kxxp = dot(a,ap) / (double)a.size();
73 return {Kxx, Kxpxp, Kxxp};
74}
75
76int main(){
77 ios::sync_with_stdio(false);
78 cin.tie(nullptr);
79
80 int d0 = 512; // input dimension
81 int L = 3; // number of ReLU layers
82 int width = 4000; // wide hidden layers
83 double sigma_w2 = 2.0; // He scaling in mean-field
84 double sigma_b2 = 0.0; // no bias
85
86 // Build two inputs with controlled correlation
87 std::mt19937_64 rng(4242);
88 std::normal_distribution<double> stdn(0.0, 1.0);
89 vector<double> x(d0), xp(d0);
90 double target_corr = 0.5; // desired cosine similarity
91 for (int i = 0; i < d0; ++i) {
92 double u = stdn(rng), v = stdn(rng);
93 x[i] = u;
94 xp[i] = target_corr * u + sqrt(1.0 - target_corr*target_corr) * v;
95 }
96
97 auto th = relu_kernel_theory(x, xp, L, sigma_w2, sigma_b2);
98 auto emp = relu_kernel_empirical(x, xp, L, width, sigma_w2, sigma_b2, 2024ULL);
99
100 cout << fixed << setprecision(6);
101 cout << "Theoretical K(x,x)= " << th.Kxx << ", K(x',x')= " << th.Kxpxp << ", K(x,x')= " << th.Kxxp << "\n";
102 cout << " Empirical K(x,x)= " << emp.Kxx << ", K(x',x')= " << emp.Kxpxp << ", K(x,x')= " << emp.Kxxp << "\n";
103
104 double c_th = th.Kxxp / sqrt(th.Kxx * th.Kxpxp);
105 double c_emp = emp.Kxxp / sqrt(emp.Kxx * emp.Kxpxp);
106 cout << "Correlation (theory)= " << c_th << ", (empirical)= " << c_emp << "\n";
107
108 return 0;
109}
110

This code compares the NNGP kernel predicted by mean-field theory to an empirical estimate computed by forward-propagating two inputs through the same wide random ReLU network. The theoretical kernel uses the arc-cosine expectation to update variances and covariance across layers. The empirical kernel is the average product of activations across the final layer. With sufficiently large width, the two match closely, illustrating the Gaussian process limit.

Time: O(L ¡ width^2) for the empirical network (dominant), O(L) for the theoretical recursion.Space: Empirical: O(width) for activations; Theoretical: O(1).
Correlation map and edge-of-chaos for ReLU (analytic iteration)
1#include <bits/stdc++.h>
2using namespace std;
3
4// ReLU correlation map under mean-field with given sigma_w^2, sigma_b^2.
5// We propagate (q, c) pairs assuming identical variances for the two inputs.
6struct QC { double q; double c; };
7
8// E[ReLU(u) ReLU(v)] with angle theta (derived from correlation c)
9static inline double relu_E(double q1, double q2, double c) {
10 if (q1 <= 0 || q2 <= 0) return 0.0;
11 c = max(-1.0, min(1.0, c));
12 double theta = acos(c);
13 return sqrt(q1*q2) * (sin(theta) + (M_PI - theta) * cos(theta)) / (2.0*M_PI);
14}
15
16QC step_relu(double q, double c, double sigma_w2, double sigma_b2) {
17 double q_next = sigma_w2 * (q / 2.0) + sigma_b2; // ReLU second moment q/2
18 double cov_next = sigma_w2 * relu_E(q, q, c) + sigma_b2;
19 double c_next = cov_next / q_next; // normalize to correlation
20 return {q_next, c_next};
21}
22
23int main(){
24 ios::sync_with_stdio(false);
25 cin.tie(nullptr);
26
27 // Settings
28 double sigma_b2 = 0.0;
29
30 // Case A: subcritical (signals contract)
31 double sigma_w2_A = 1.5; // chi = sigma_w2/2 = 0.75 < 1
32 // Case B: supercritical (signals expand/chaotic)
33 double sigma_w2_B = 3.0; // chi = 1.5 > 1
34 // Case C: edge of chaos
35 double sigma_w2_C = 2.0; // chi = 1
36
37 // Initial variance and correlation
38 double q0 = 1.0; // normalized inputs
39 double c0 = 0.5; // initial cosine similarity
40
41 auto run_case = [&](double sigma_w2, const string& name){
42 double q = q0, c = c0;
43 cout << "\n" << name << ": sigma_w^2= " << sigma_w2 << ", chi= " << (sigma_w2/2.0) << "\n";
44 for (int ell = 0; ell < 10; ++ell) {
45 cout << "Layer " << ell << ": q= " << q << ", c= " << c << "\n";
46 auto qc = step_relu(q, c, sigma_w2, sigma_b2);
47 q = qc.q; c = qc.c;
48 }
49 };
50
51 run_case(sigma_w2_A, "Subcritical");
52 run_case(sigma_w2_B, "Supercritical");
53 run_case(sigma_w2_C, "Edge of Chaos");
54
55 return 0;
56}
57

This analytic simulator iterates the mean-field correlation map for ReLU. It reports how variance q and correlation c evolve with depth under three regimes: subcritical (chi < 1), supercritical (chi > 1), and critical (edge of chaos, chi = 1). For ReLU, chi = sigma_w^2 / 2. Subcritical maps push c toward 1 quickly (loss of discriminability), supercritical can destabilize or push c away from fixed points, and the edge preserves distinctions best over many layers.

Time: O(L) per run (constant work per layer).Space: O(1).
#mean field theory#neural tangent kernel#neural network gaussian process#relu arc-cosine kernel#he initialization#xavier initialization#edge of chaos#variance propagation#wide networks#kernel methods#central limit theorem#correlation map#nngp#ntk#gaussian pre-activations