📚TheoryIntermediate

Batch Normalization

Key Points

•
Batch Normalization rescales and recenters activations using mini-batch statistics to stabilize and speed up neural network training.
•
It computes a per-feature mean and variance over the current batch, normalizes activations, and then applies learnable scale ( $γ)$ and shift ( $β) .$
•
During inference, BatchNorm uses running (moving average) statistics instead of batch statistics to produce deterministic outputs.
•
The forward pass is O(ND) where N is batch size and D is features; the backward pass is also O(ND).
•
Key hyperparameters include epsilon (numerical stability) and momentum (for running averages).
•
Common pitfalls include mixing training and inference modes, using biased variance, and misplacing epsilon under the square root.
•
For small batches, BatchNorm can become noisy; alternatives include LayerNorm or GroupNorm.
•
A correct backward pass uses a compact formula for dX that depends on sums over the batch of dY and dY·X̂.

Prerequisites

→Matrix and vector operations — BatchNorm is applied to matrices of shape (batch, features) or tensors; understanding reductions and broadcasting is essential.
→Mean and variance — Computing batch statistics requires clear understanding of these moments and how they are estimated.
→Backpropagation and chain rule — Implementing the backward pass for BatchNorm depends on differentiating through normalization and affine transforms.
→Floating-point arithmetic and numerical stability — Epsilon placement and precision issues directly affect BN’s robustness.
→Neural network training loop — You need to know how forward, backward, parameter updates, and mode toggling (train/inference) interact.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine trying to run on a treadmill that keeps randomly speeding up and slowing down—your body constantly has to readjust. Neural networks feel something similar when the distribution of activations shifts during training. Concept: Batch Normalization (BN) acts like a stabilizer for each layer’s activations: it standardizes them to have zero mean and unit variance per feature across a mini-batch, then lets the network learn an appropriate scale (γ) and shift (β). This keeps the "training treadmill" steady. Example: For a dense layer output with dimension D and a mini-batch of size N, BN computes the mean and variance of each of the D features across the N examples, normalizes each feature, and then applies γ and β for that feature. This reduces internal covariate shift, enabling higher learning rates and faster convergence. In practice, BN also maintains running (moving average) estimates of the mean and variance to use during inference, so predictions are deterministic and independent of the current batch. BN became a foundational component of many CNN and MLP architectures because it improves optimization, regularizes the model slightly, and often leads to higher accuracy with fewer epochs.

02Intuition & Analogies

Hook: Think of baking cookies in batches. If each batch’s oven temperature and ingredient amounts vary wildly, your cookies come out inconsistent and you keep adjusting the recipe. Concept: BatchNorm is like calibrating the oven and measuring cups every batch: you re-center (mean 0) and rescale (variance 1) the dough’s key attributes before baking, then you add a signature flavor (γ and β) so the cookies still taste the way you want. Example: Suppose a hidden layer sometimes outputs values mostly around 100 and other times around -5 for the same feature. The next layer’s weights then face a moving target and learning becomes jittery. With BN, we measure the batch mean and variance for that feature, subtract the mean, divide by the standard deviation (with a small epsilon to avoid division by zero), and then apply a learned scale and shift. Now the downstream layers always receive standardized inputs, so gradients propagate more predictably. The learned γ and β ensure that normalization doesn’t limit expressiveness—if the network really wants a certain distribution, it can learn γ and β to recreate it after stabilization. During inference, since you may process a single sample at a time, we can’t rely on batch statistics; instead, we use the accumulated running averages from training to keep the normalization consistent.

03Formal Definition

Given a mini-batch B = {

x_{1}

x_{2}

\dots

x_{m}

} for a single feature dimension (extend per feature/channel), BatchNorm computes the batch mean

μ_{B}

and variance

σ_{B}^{2}

μ_{B}

\frac{1}{m}

\sum_{i = 1}^{m}

x_{i}

and

σ_{B}^{2}

\frac{1}{m}

\sum_{i = 1}^{m}

(

x_{i}

μ_{B}

)^{2}. The normalized activations are

\overset{x}{^}_{i}

\frac{x_{i} - \mu_{B}}

{

σ_{B}^{2} + ϵ

}, where

ϵ

> 0 ensures numerical stability. The output for each activation is

y_{i}

γ

\overset{x}{^}_{i}

β

, where

γ

and

β

are learnable parameters per feature. During training, running averages (exponential moving averages) of mean and variance are updated and later used during inference in place of batch statistics. For multi-dimensional data (e.g., images), statistics are usually computed per channel across the batch and spatial dimensions. The backward pass derivatives for a single feature dimension with upstream gradients \{

\frac{\partial L}{\partial y _{i}}

\} yield

\frac{\partial L}{\partial β}

\sum_{i}

\frac{\partial L}{\partial y _{i}}

\frac{\partial L}{\partial γ}

\sum_{i}

\frac{\partial L}{\partial y _{i}}

\overset{x}{^}_{i}

, and a compact form for inputs:

\frac{\partial L}{\partial x _{i}}

\frac{\gamma}

σ_{B}^{2} + ϵ

}

\Big

\frac{\partial L}{\partial y _{i}}

\sum_{j}

\frac{\partial L}{\partial y _{j}}

\overset{x}{^}_{i}

\sum_{j}

\frac{\partial L}{\partial y _{j}}

\hat{x}_{j}\Big

04When to Use

Use BatchNorm when training deep networks where optimization is unstable or slow, especially with fully connected layers or convolutional networks processing large batches. It helps when gradients vanish/explode or when the distribution of activations shifts during training. Typical use cases include image classification CNNs (per-channel BN after convolutions), MLPs for tabular data (BN after linear layers and before nonlinearity), and sequence models when applied carefully to feed-forward projections. BN also acts as a mild regularizer, often reducing the need for heavy dropout. During fine-tuning or transfer learning, you may freeze running statistics (and sometimes γ/β) if target data are limited or differ in distribution. If your batch sizes are very small (e.g., batch size 1–8), BN’s batch statistics become noisy; in such cases, consider LayerNorm or GroupNorm, which do not depend on batch statistics. For inference-only deployments, ensure running averages are properly accumulated and the model runs in inference mode so outputs are deterministic and independent of batch composition.

⚠️Common Mistakes

Mixing modes: Using batch statistics at inference time or running statistics during training inadvertently. Always switch between training and inference modes consistently. - Epsilon placement: Putting \epsilon outside the square root, which changes scaling. Correct is division by $\sqrt{\sigm$ a^{2} $_{B}$ + \epsilon}. - Variance estimator: Accidentally using an unbiased estimator (divide by m-1) instead of the population formula (divide by m) during training, causing drift between training and inference. - Momentum confusion: Different libraries define momentum differently (weight on new vs. old). Ensure your update matches your framework’s convention. - Tiny batches: With small m, estimated statistics are noisy; accuracy may degrade. Prefer LayerNorm/GroupNorm or accumulate stats across multiple mini-batches. - Per-feature axis errors: For images, compute stats per channel across batch and spatial dims; do not mix channel with spatial incorrectly. - Forgetting to learn γ and β: Omitting the affine step limits expressiveness; include learnable scale and shift unless you have a reason not to. - Numerical stability: Using too small \epsilon (e.g., 1e-12 with float32) can cause NaNs; too large can bias normalization. Start with 1e-5 to 1e-3. - Backward pass mistakes: Not using the compact formula for dX or forgetting to cache $\hat{x}$ , mean, and variance from the forward pass leads to wrong gradients. - Data leakage: Computing stats across the entire dataset or across time may leak information between samples; only use per-batch stats during training.

Key Formulas

Batch Mean

μ_{B} = \frac{1}{m} i = 1 \sum m x_{i}

Explanation: The mean of a feature over a mini-batch of size m. It centers the data so the average becomes zero after subtraction.

Batch Variance

σ_{B}^{2} = \frac{1}{m} i = 1 \sum m (x_{i} - μ_{B})^{2}

Explanation: The average squared deviation of values from the batch mean. It measures how spread out the batch values are for that feature.

Normalization

\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}

Explanation: Each activation is centered and scaled to have unit variance with a small epsilon added for numerical stability under the square root.

Affine Transform

y_{i} = γ \overset{x}{^}_{i} + β

Explanation: After normalization, learnable scale (gamma) and shift (beta) are applied so the model can represent any needed distribution.

Running Mean Update

μ_{run} \leftarrow ρ μ_{run} + (1 - ρ) μ_{B}

Explanation: An exponential moving average updates the running mean using momentum parameter $ρ$ . Larger $ρ$ smooths updates more heavily.

Running Variance Update

σ_{run}^{2} \leftarrow ρ σ_{run}^{2} + (1 - ρ) σ_{B}^{2}

Explanation: The running variance is updated analogously to the mean to be used during inference for deterministic outputs.

Beta Gradient

\frac{\partial L}{\partial β} = i = 1 \sum m \frac{\partial L}{\partial y _{i}}

Explanation: The gradient with respect to the shift parameter beta is the sum of upstream gradients across the batch for that feature.

Gamma Gradient

\frac{\partial L}{\partial γ} = i = 1 \sum m \frac{\partial L}{\partial y _{i}} \overset{x}{^}_{i}

Explanation: The gradient with respect to gamma is the sum over the batch of upstream gradient times the normalized activation.

Compact Input Gradient

\frac{\partial L}{\partial x _{i}} = \frac{γ}{m σ _{B}^{2} + ϵ} (m \frac{\partial L}{\partial y _{i}} - j = 1 \sum m \frac{\partial L}{\partial y _{j}} - \overset{x}{^}_{i} j = 1 \sum m \frac{\partial L}{\partial y _{j}} \overset{x}{^}_{j})

Explanation: A numerically stable, vectorized formula for the gradient of the loss with respect to inputs. It depends on batch-wide sums of upstream gradients and their correlation with normalized activations.

Time Complexity (Forward/Backward)

O (N D)

Explanation: For batch size N and feature dimension D, both forward and backward passes require a constant number of passes over the ND elements. This scales linearly with data size.

Complexity Analysis

Let N be the mini-batch size and D the number of features (channels). The forward pass performs: (1) one pass to compute per-feature means (O(ND)); (2) one pass to compute per-feature variances (O(ND)); (3) one pass to normalize and apply affine parameters (O(ND)). Constant-factor optimizations can fuse steps, but asymptotically this remains O(ND). Memory-wise, we store parameters γ and β (O(D)), running mean/variance (O(D)), and optionally caches from the forward pass for backprop: normalized activations, batch mean, and batch variance per feature (O(ND) for x̂ and O(D) for moments). Thus, training memory is O(ND + D); inference memory is O(D). The backward pass needs similar ND-level operations: computing dβ and dγ via reductions over the batch (O(ND)) and computing dX using the compact formula that requires sums over the batch and a final elementwise combination (O(ND)). In practice, numerical stability requires adding ε under the square root, which has negligible complexity impact. For convolutional layers, D equals the number of channels, and N is effectively N × H × W when reducing across batch and spatial dims; the same linear-time behavior holds. Parallelization across features and vectorization across batch elements further reduce wall-clock time but not asymptotic complexity.

Code Examples

BatchNorm1D Forward (Training and Inference) with Running Statistics

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // A simple 1D Batch Normalization for (N x D) matrices, double precision.
5 // - Per-feature statistics across the batch axis
6 // - Maintains running mean/variance for inference
7 // - Learnable gamma (scale) and beta (shift)
8 
9 struct BatchNorm1D {
10     int D;                    // number of features
11     double eps;               // epsilon for numerical stability
12     double momentum;          // momentum for running stats (EMA: run = rho*run + (1-rho)*batch)
13     vector<double> gamma;     // scale parameters (size D)
14     vector<double> beta;      // shift parameters (size D)
15     vector<double> running_mean; // running averages (size D)
16     vector<double> running_var;  // running variances (size D)
17 
18     // Cache for backward (optional if you plan to backprop later)
19     vector<double> last_mean;    // batch mean (size D)
20     vector<double> last_var;     // batch var (size D)
21     vector<vector<double>> last_xhat; // normalized inputs (N x D)
22     bool cached = false;
23 
24     BatchNorm1D(int D_, double eps_=1e-5, double momentum_=0.9)
25         : D(D_), eps(eps_), momentum(momentum_),
26           gamma(D_, 1.0), beta(D_, 0.0),
27           running_mean(D_, 0.0), running_var(D_, 1.0) {}
28 
29     // Forward pass
30     // x: N x D matrix
31     // train: if true, use batch stats and update running stats; else use running stats
32     vector<vector<double>> forward(const vector<vector<double>>& x, bool train=true) {
33         int N = (int)x.size();
34         if (N == 0) return {};
35         if ((int)x[0].size() != D) throw runtime_error("Input feature dimension mismatch");
36         vector<vector<double>> y(N, vector<double>(D, 0.0));
37 
38         vector<double> mean(D, 0.0), var(D, 0.0), inv_std(D, 0.0);
39 
40         if (train) {
41             // 1) compute per-feature mean
42             for (int n = 0; n < N; ++n)
43                 for (int d = 0; d < D; ++d)
44                     mean[d] += x[n][d];
45             for (int d = 0; d < D; ++d)
46                 mean[d] /= (double)N;
47 
48             // 2) compute per-feature variance (population form: divide by N)
49             for (int n = 0; n < N; ++n)
50                 for (int d = 0; d < D; ++d) {
51                     double diff = x[n][d] - mean[d];
52                     var[d] += diff * diff;
53                 }
54             for (int d = 0; d < D; ++d) {
55                 var[d] /= (double)N;
56                 inv_std[d] = 1.0 / sqrt(var[d] + eps);
57             }
58 
59             // 3) normalize, scale and shift
60             last_xhat.assign(N, vector<double>(D, 0.0));
61             for (int n = 0; n < N; ++n) {
62                 for (int d = 0; d < D; ++d) {
63                     double xhat = (x[n][d] - mean[d]) * inv_std[d];
64                     last_xhat[n][d] = xhat; // cache
65                     y[n][d] = gamma[d] * xhat + beta[d];
66                 }
67             }
68 
69             // 4) update running stats (EMA)
70             for (int d = 0; d < D; ++d) {
71                 running_mean[d] = momentum * running_mean[d] + (1.0 - momentum) * mean[d];
72                 running_var[d]  = momentum * running_var[d]  + (1.0 - momentum) * var[d];
73             }
74 
75             last_mean = mean; last_var = var; cached = true;
76         } else {
77             // Inference path: use running stats
78             for (int d = 0; d < D; ++d) inv_std[d] = 1.0 / sqrt(running_var[d] + eps);
79             for (int n = 0; n < N; ++n) {
80                 for (int d = 0; d < D; ++d) {
81                     double xhat = (x[n][d] - running_mean[d]) * inv_std[d];
82                     y[n][d] = gamma[d] * xhat + beta[d];
83                 }
84             }
85             cached = false; // no training cache updated
86         }
87         return y;
88     }
89 };
90 
91 int main() {
92     // Example usage: N=3 samples, D=4 features
93     vector<vector<double>> X = {
94         {1.0, 2.0, 3.0, 4.0},
95         {2.0, 3.0, 4.0, 5.0},
96         {3.0, 4.0, 5.0, 6.0}
97     };
98 
99     BatchNorm1D bn(4, 1e-5, 0.9);
100 
101     // Training forward
102     auto Y_train = bn.forward(X, true);
103     cout << "Training output:\n";
104     for (auto &row : Y_train) {
105         for (double v : row) cout << fixed << setprecision(5) << v << ' ';
106         cout << '\n';
107     }
108 
109     // Inference forward (uses running stats)
110     auto Y_infer = bn.forward(X, false);
111     cout << "\nInference output:\n";
112     for (auto &row : Y_infer) {
113         for (double v : row) cout << fixed << setprecision(5) << v << ' ';
114         cout << '\n';
115     }
116 
117     return 0;
118 }
119

This program implements 1D Batch Normalization for a (N × D) matrix. In training mode, it computes batch mean and variance per feature, normalizes inputs, applies learnable γ and β, and updates running statistics using exponential moving averages. In inference mode, it relies on running statistics to produce deterministic outputs. The code caches normalized inputs, mean, and variance during training, which are useful for the backward pass.

Time: O(ND) per forward callSpace: O(ND) for outputs and cached x̂ during training; O(D) for parameters and running stats

BatchNorm1D Backward Pass (Gradients for x, gamma, beta)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct BatchNorm1DBackward {
5     int D;
6     double eps;
7     vector<double> gamma, beta;
8 
9     // Saved from forward pass
10     vector<double> mean, var, inv_std; // size D
11     vector<vector<double>> xhat;       // N x D
12 
13     BatchNorm1DBackward(int D_, double eps_=1e-5)
14         : D(D_), eps(eps_), gamma(D_, 1.0), beta(D_, 0.0) {}
15 
16     void cache_forward(const vector<vector<double>>& x, vector<double> mean_, vector<double> var_) {
17         int N = (int)x.size();
18         mean = move(mean_);
19         var = move(var_);
20         inv_std.assign(D, 0.0);
21         for (int d = 0; d < D; ++d) inv_std[d] = 1.0 / sqrt(var[d] + eps);
22         // compute xhat and cache
23         xhat.assign(N, vector<double>(D, 0.0));
24         for (int n = 0; n < N; ++n)
25             for (int d = 0; d < D; ++d)
26                 xhat[n][d] = (x[n][d] - mean[d]) * inv_std[d];
27     }
28 
29     // Backward pass: given upstream gradient dY (N x D), compute dX (N x D), dGamma (D), dBeta (D)
30     void backward(const vector<vector<double>>& dY,
31                   vector<vector<double>>& dX,
32                   vector<double>& dGamma,
33                   vector<double>& dBeta) {
34         int N = (int)dY.size();
35         if (N == 0) return;
36         dX.assign(N, vector<double>(D, 0.0));
37         dGamma.assign(D, 0.0);
38         dBeta.assign(D, 0.0);
39 
40         // Compute per-feature reductions: sum(dY) and sum(dY * xhat)
41         vector<double> sum_dY(D, 0.0), sum_dY_xhat(D, 0.0);
42         for (int n = 0; n < N; ++n) {
43             for (int d = 0; d < D; ++d) {
44                 dBeta[d] += dY[n][d];
45                 sum_dY[d] += dY[n][d];
46                 sum_dY_xhat[d] += dY[n][d] * xhat[n][d];
47             }
48         }
49         // dGamma = sum(dY * xhat)
50         for (int d = 0; d < D; ++d) dGamma[d] = sum_dY_xhat[d];
51 
52         // dX via compact formula
53         for (int n = 0; n < N; ++n) {
54             for (int d = 0; d < D; ++d) {
55                 double coeff = gamma[d] / ((double)N * (1.0 / inv_std[d])); // gamma / (N * sqrt(var+eps))
56                 // Note: 1/inv_std[d] = sqrt(var+eps)
57                 double term = (double)N * dY[n][d] - sum_dY[d] - xhat[n][d] * sum_dY_xhat[d];
58                 dX[n][d] = coeff * term * inv_std[d]; // multiply by inv_std to divide by sqrt(var+eps)
59             }
60         }
61         // The above simplifies to: dX = (1/N) * gamma * inv_std * (N*dY - sum(dY) - xhat*sum(dY*xhat))
62         // We wrote it in steps for clarity.
63     }
64 };
65 
66 int main() {
67     // Toy example: N=2, D=3
68     vector<vector<double>> X = {{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}};
69 
70     // Prepare forward cache (usually obtained from actual forward pass)
71     int D = 3; double eps = 1e-5;
72     vector<double> mean(D, 0.0), var(D, 0.0);
73     int N = (int)X.size();
74     for (int d = 0; d < D; ++d) {
75         for (int n = 0; n < N; ++n) mean[d] += X[n][d];
76         mean[d] /= (double)N;
77         for (int n = 0; n < N; ++n) {
78             double diff = X[n][d] - mean[d];
79             var[d] += diff * diff;
80         }
81         var[d] /= (double)N;
82     }
83 
84     BatchNorm1DBackward bn_bw(D, eps);
85     bn_bw.cache_forward(X, mean, var);
86 
87     // Upstream gradient dY
88     vector<vector<double>> dY = {{0.1, -0.2, 0.3}, {-0.4, 0.5, -0.6}};
89 
90     vector<vector<double>> dX; vector<double> dGamma, dBeta;
91     bn_bw.backward(dY, dX, dGamma, dBeta);
92 
93     cout << fixed << setprecision(6);
94     cout << "dGamma: "; for (double v : dGamma) cout << v << ' '; cout << '\n';
95     cout << "dBeta : "; for (double v : dBeta)  cout << v << ' '; cout << '\n';
96 
97     cout << "dX:\n";
98     for (auto &row : dX) {
99         for (double v : row) cout << v << ' ';
100         cout << '\n';
101     }
102     return 0;
103 }
104

This code computes gradients for BatchNorm’s parameters and inputs. It uses the compact formula for dX that depends on batch-wise reductions of dY and dY·x̂, producing O(ND) computation. In a real training loop, you would receive the cache (mean, var, x̂) from the forward pass and then propagate gradients to previous layers using dX. The example demonstrates the mechanics without an optimizer.

Time: O(ND)Space: O(ND) for dX and x̂; O(D) for reductions and parameter gradients

Integrating BatchNorm between Linear Layer and ReLU (Training/Inference Toggle)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Minimal demo: Linear -> BatchNorm -> ReLU forward with train/inference toggle.
5 
6 struct Linear {
7     int inF, outF;
8     vector<vector<double>> W; // outF x inF
9     vector<double> b;         // outF
10     Linear(int inF_, int outF_) : inF(inF_), outF(outF_), W(outF_, vector<double>(inF_, 0.0)), b(outF_, 0.0) {
11         // Xavier init (simple)
12         std::mt19937 rng(42);
13         std::normal_distribution<double> nd(0.0, sqrt(2.0/(inF+outF)));
14         for (int o = 0; o < outF; ++o) for (int i = 0; i < inF; ++i) W[o][i] = nd(rng);
15     }
16     vector<vector<double>> forward(const vector<vector<double>>& x) const {
17         int N = (int)x.size();
18         vector<vector<double>> y(N, vector<double>(outF, 0.0));
19         for (int n = 0; n < N; ++n) {
20             for (int o = 0; o < outF; ++o) {
21                 double s = b[o];
22                 for (int i = 0; i < inF; ++i) s += W[o][i] * x[n][i];
23                 y[n][o] = s;
24             }
25         }
26         return y;
27     }
28 };
29 
30 struct BatchNorm1D {
31     int D; double eps, momentum; bool training = true;
32     vector<double> gamma, beta, run_mean, run_var;
33     BatchNorm1D(int D_, double eps_=1e-5, double momentum_=0.9)
34         : D(D_), eps(eps_), momentum(momentum_), gamma(D_, 1.0), beta(D_, 0.0), run_mean(D_, 0.0), run_var(D_, 1.0) {}
35     vector<vector<double>> forward(const vector<vector<double>>& x) {
36         int N = (int)x.size(); if (N==0) return {};
37         vector<vector<double>> y(N, vector<double>(D, 0.0));
38         vector<double> mean(D,0.0), var(D,0.0), invs(D,0.0);
39         if (training) {
40             for (int n=0;n<N;++n) for (int d=0;d<D;++d) mean[d]+=x[n][d];
41             for (int d=0;d<D;++d) mean[d]/=N;
42             for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double diff=x[n][d]-mean[d]; var[d]+=diff*diff; }
43             for (int d=0;d<D;++d){ var[d]/=N; invs[d]=1.0/sqrt(var[d]+eps);}    
44             for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double xhat=(x[n][d]-mean[d])*invs[d]; y[n][d]=gamma[d]*xhat+beta[d]; }
45             for (int d=0;d<D;++d){ run_mean[d]=momentum*run_mean[d]+(1.0-momentum)*mean[d]; run_var[d]=momentum*run_var[d]+(1.0-momentum)*var[d]; }
46         } else {
47             for (int d=0;d<D;++d) invs[d]=1.0/sqrt(run_var[d]+eps);
48             for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double xhat=(x[n][d]-run_mean[d])*invs[d]; y[n][d]=gamma[d]*xhat+beta[d]; }
49         }
50         return y;
51     }
52 };
53 
54 static inline vector<vector<double>> relu(const vector<vector<double>>& x){
55     int N=(int)x.size(); if(N==0) return {}; int D=(int)x[0].size();
56     vector<vector<double>> y(N, vector<double>(D,0.0));
57     for(int n=0;n<N;++n) for(int d=0;d<D;++d) y[n][d]=max(0.0,x[n][d]);
58     return y;
59 }
60 
61 int main(){
62     // Create random input (N=5, inF=3)
63     int N=5, inF=3, outF=4;
64     vector<vector<double>> X(N, vector<double>(inF, 0.0));
65     std::mt19937 rng(7); std::normal_distribution<double> nd(0.0,1.0);
66     for(int n=0;n<N;++n) for(int i=0;i<inF;++i) X[n][i]=nd(rng);
67 
68     Linear lin(inF, outF);
69     BatchNorm1D bn(outF, 1e-5, 0.9);
70 
71     // Training forward
72     bn.training = true;
73     auto H1 = lin.forward(X);
74     auto H2 = bn.forward(H1);
75     auto H3 = relu(H2);
76 
77     cout << "Training mode output (first sample):\n";
78     for(double v: H3[0]) cout << fixed << setprecision(5) << v << ' '; cout << '\n';
79 
80     // Inference forward (toggle)
81     bn.training = false;
82     auto H1_inf = lin.forward(X);
83     auto H2_inf = bn.forward(H1_inf);
84     auto H3_inf = relu(H2_inf);
85 
86     cout << "Inference mode output (first sample):\n";
87     for(double v: H3_inf[0]) cout << fixed << setprecision(5) << v << ' '; cout << '\n';
88 
89     return 0;
90 }
91

This demo places BatchNorm between a linear layer and ReLU and shows how toggling training versus inference mode changes which statistics are used. In training, batch mean/variance are computed; in inference, running statistics are used. This mirrors typical deep learning pipelines and highlights where BN slots in a layer stack.

Time: O(N·inF·outF) for the linear layer plus O(N·outF) for BatchNorm and ReLUSpace: O(N·outF) for intermediate activations; O(outF) for BN parameters and running stats

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// A simple 1D Batch Normalization for (N x D) matrices, double precision.
5	// - Per-feature statistics across the batch axis
6	// - Maintains running mean/variance for inference
7	// - Learnable gamma (scale) and beta (shift)
8
9	struct BatchNorm1D {
10	int D; // number of features
11	double eps; // epsilon for numerical stability
12	double momentum; // momentum for running stats (EMA: run = rhorun + (1-rho)batch)
13	vector<double> gamma; // scale parameters (size D)
14	vector<double> beta; // shift parameters (size D)
15	vector<double> running_mean; // running averages (size D)
16	vector<double> running_var; // running variances (size D)
17
18	// Cache for backward (optional if you plan to backprop later)
19	vector<double> last_mean; // batch mean (size D)
20	vector<double> last_var; // batch var (size D)
21	vector<vector<double>> last_xhat; // normalized inputs (N x D)
22	bool cached = false;
23
24	BatchNorm1D(int D_, double eps_=1e-5, double momentum_=0.9)
25	: D(D_), eps(eps_), momentum(momentum_),
26	gamma(D_, 1.0), beta(D_, 0.0),
27	running_mean(D_, 0.0), running_var(D_, 1.0) {}
28
29	// Forward pass
30	// x: N x D matrix
31	// train: if true, use batch stats and update running stats; else use running stats
32	vector<vector<double>> forward(const vector<vector<double>>& x, bool train=true) {
33	int N = (int)x.size();
34	if (N == 0) return {};
35	if ((int)x[0].size() != D) throw runtime_error("Input feature dimension mismatch");
36	vector<vector<double>> y(N, vector<double>(D, 0.0));
37
38	vector<double> mean(D, 0.0), var(D, 0.0), inv_std(D, 0.0);
39
40	if (train) {
41	// 1) compute per-feature mean
42	for (int n = 0; n < N; ++n)
43	for (int d = 0; d < D; ++d)
44	mean[d] += x[n][d];
45	for (int d = 0; d < D; ++d)
46	mean[d] /= (double)N;
47
48	// 2) compute per-feature variance (population form: divide by N)
49	for (int n = 0; n < N; ++n)
50	for (int d = 0; d < D; ++d) {
51	double diff = x[n][d] - mean[d];
52	var[d] += diff * diff;
53	}
54	for (int d = 0; d < D; ++d) {
55	var[d] /= (double)N;
56	inv_std[d] = 1.0 / sqrt(var[d] + eps);
57	}
58
59	// 3) normalize, scale and shift
60	last_xhat.assign(N, vector<double>(D, 0.0));
61	for (int n = 0; n < N; ++n) {
62	for (int d = 0; d < D; ++d) {
63	double xhat = (x[n][d] - mean[d]) * inv_std[d];
64	last_xhat[n][d] = xhat; // cache
65	y[n][d] = gamma[d] * xhat + beta[d];
66	}
67	}
68
69	// 4) update running stats (EMA)
70	for (int d = 0; d < D; ++d) {
71	running_mean[d] = momentum * running_mean[d] + (1.0 - momentum) * mean[d];
72	running_var[d] = momentum * running_var[d] + (1.0 - momentum) * var[d];
73	}
74
75	last_mean = mean; last_var = var; cached = true;
76	} else {
77	// Inference path: use running stats
78	for (int d = 0; d < D; ++d) inv_std[d] = 1.0 / sqrt(running_var[d] + eps);
79	for (int n = 0; n < N; ++n) {
80	for (int d = 0; d < D; ++d) {
81	double xhat = (x[n][d] - running_mean[d]) * inv_std[d];
82	y[n][d] = gamma[d] * xhat + beta[d];
83	}
84	}
85	cached = false; // no training cache updated
86	}
87	return y;
88	}
89	};
90
91	int main() {
92	// Example usage: N=3 samples, D=4 features
93	vector<vector<double>> X = {
94	{1.0, 2.0, 3.0, 4.0},
95	{2.0, 3.0, 4.0, 5.0},
96	{3.0, 4.0, 5.0, 6.0}
97	};
98
99	BatchNorm1D bn(4, 1e-5, 0.9);
100
101	// Training forward
102	auto Y_train = bn.forward(X, true);
103	cout << "Training output:\n";
104	for (auto &row : Y_train) {
105	for (double v : row) cout << fixed << setprecision(5) << v << ' ';
106	cout << '\n';
107	}
108
109	// Inference forward (uses running stats)
110	auto Y_infer = bn.forward(X, false);
111	cout << "\nInference output:\n";
112	for (auto &row : Y_infer) {
113	for (double v : row) cout << fixed << setprecision(5) << v << ' ';
114	cout << '\n';
115	}
116
117	return 0;
118	}
119

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct BatchNorm1DBackward {
5	int D;
6	double eps;
7	vector<double> gamma, beta;
8
9	// Saved from forward pass
10	vector<double> mean, var, inv_std; // size D
11	vector<vector<double>> xhat; // N x D
12
13	BatchNorm1DBackward(int D_, double eps_=1e-5)
14	: D(D_), eps(eps_), gamma(D_, 1.0), beta(D_, 0.0) {}
15
16	void cache_forward(const vector<vector<double>>& x, vector<double> mean_, vector<double> var_) {
17	int N = (int)x.size();
18	mean = move(mean_);
19	var = move(var_);
20	inv_std.assign(D, 0.0);
21	for (int d = 0; d < D; ++d) inv_std[d] = 1.0 / sqrt(var[d] + eps);
22	// compute xhat and cache
23	xhat.assign(N, vector<double>(D, 0.0));
24	for (int n = 0; n < N; ++n)
25	for (int d = 0; d < D; ++d)
26	xhat[n][d] = (x[n][d] - mean[d]) * inv_std[d];
27	}
28
29	// Backward pass: given upstream gradient dY (N x D), compute dX (N x D), dGamma (D), dBeta (D)
30	void backward(const vector<vector<double>>& dY,
31	vector<vector<double>>& dX,
32	vector<double>& dGamma,
33	vector<double>& dBeta) {
34	int N = (int)dY.size();
35	if (N == 0) return;
36	dX.assign(N, vector<double>(D, 0.0));
37	dGamma.assign(D, 0.0);
38	dBeta.assign(D, 0.0);
39
40	// Compute per-feature reductions: sum(dY) and sum(dY * xhat)
41	vector<double> sum_dY(D, 0.0), sum_dY_xhat(D, 0.0);
42	for (int n = 0; n < N; ++n) {
43	for (int d = 0; d < D; ++d) {
44	dBeta[d] += dY[n][d];
45	sum_dY[d] += dY[n][d];
46	sum_dY_xhat[d] += dY[n][d] * xhat[n][d];
47	}
48	}
49	// dGamma = sum(dY * xhat)
50	for (int d = 0; d < D; ++d) dGamma[d] = sum_dY_xhat[d];
51
52	// dX via compact formula
53	for (int n = 0; n < N; ++n) {
54	for (int d = 0; d < D; ++d) {
55	double coeff = gamma[d] / ((double)N * (1.0 / inv_std[d])); // gamma / (N * sqrt(var+eps))
56	// Note: 1/inv_std[d] = sqrt(var+eps)
57	double term = (double)N * dY[n][d] - sum_dY[d] - xhat[n][d] * sum_dY_xhat[d];
58	dX[n][d] = coeff * term * inv_std[d]; // multiply by inv_std to divide by sqrt(var+eps)
59	}
60	}
61	// The above simplifies to: dX = (1/N) * gamma * inv_std * (NdY - sum(dY) - xhatsum(dY*xhat))
62	// We wrote it in steps for clarity.
63	}
64	};
65
66	int main() {
67	// Toy example: N=2, D=3
68	vector<vector<double>> X = {{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}};
69
70	// Prepare forward cache (usually obtained from actual forward pass)
71	int D = 3; double eps = 1e-5;
72	vector<double> mean(D, 0.0), var(D, 0.0);
73	int N = (int)X.size();
74	for (int d = 0; d < D; ++d) {
75	for (int n = 0; n < N; ++n) mean[d] += X[n][d];
76	mean[d] /= (double)N;
77	for (int n = 0; n < N; ++n) {
78	double diff = X[n][d] - mean[d];
79	var[d] += diff * diff;
80	}
81	var[d] /= (double)N;
82	}
83
84	BatchNorm1DBackward bn_bw(D, eps);
85	bn_bw.cache_forward(X, mean, var);
86
87	// Upstream gradient dY
88	vector<vector<double>> dY = {{0.1, -0.2, 0.3}, {-0.4, 0.5, -0.6}};
89
90	vector<vector<double>> dX; vector<double> dGamma, dBeta;
91	bn_bw.backward(dY, dX, dGamma, dBeta);
92
93	cout << fixed << setprecision(6);
94	cout << "dGamma: "; for (double v : dGamma) cout << v << ' '; cout << '\n';
95	cout << "dBeta : "; for (double v : dBeta) cout << v << ' '; cout << '\n';
96
97	cout << "dX:\n";
98	for (auto &row : dX) {
99	for (double v : row) cout << v << ' ';
100	cout << '\n';
101	}
102	return 0;
103	}
104

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Minimal demo: Linear -> BatchNorm -> ReLU forward with train/inference toggle.
5
6	struct Linear {
7	int inF, outF;
8	vector<vector<double>> W; // outF x inF
9	vector<double> b; // outF
10	Linear(int inF_, int outF_) : inF(inF_), outF(outF_), W(outF_, vector<double>(inF_, 0.0)), b(outF_, 0.0) {
11	// Xavier init (simple)
12	std::mt19937 rng(42);
13	std::normal_distribution<double> nd(0.0, sqrt(2.0/(inF+outF)));
14	for (int o = 0; o < outF; ++o) for (int i = 0; i < inF; ++i) W[o][i] = nd(rng);
15	}
16	vector<vector<double>> forward(const vector<vector<double>>& x) const {
17	int N = (int)x.size();
18	vector<vector<double>> y(N, vector<double>(outF, 0.0));
19	for (int n = 0; n < N; ++n) {
20	for (int o = 0; o < outF; ++o) {
21	double s = b[o];
22	for (int i = 0; i < inF; ++i) s += W[o][i] * x[n][i];
23	y[n][o] = s;
24	}
25	}
26	return y;
27	}
28	};
29
30	struct BatchNorm1D {
31	int D; double eps, momentum; bool training = true;
32	vector<double> gamma, beta, run_mean, run_var;
33	BatchNorm1D(int D_, double eps_=1e-5, double momentum_=0.9)
34	: D(D_), eps(eps_), momentum(momentum_), gamma(D_, 1.0), beta(D_, 0.0), run_mean(D_, 0.0), run_var(D_, 1.0) {}
35	vector<vector<double>> forward(const vector<vector<double>>& x) {
36	int N = (int)x.size(); if (N==0) return {};
37	vector<vector<double>> y(N, vector<double>(D, 0.0));
38	vector<double> mean(D,0.0), var(D,0.0), invs(D,0.0);
39	if (training) {
40	for (int n=0;n<N;++n) for (int d=0;d<D;++d) mean[d]+=x[n][d];
41	for (int d=0;d<D;++d) mean[d]/=N;
42	for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double diff=x[n][d]-mean[d]; var[d]+=diff*diff; }
43	for (int d=0;d<D;++d){ var[d]/=N; invs[d]=1.0/sqrt(var[d]+eps);}
44	for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double xhat=(x[n][d]-mean[d])invs[d]; y[n][d]=gamma[d]xhat+beta[d]; }
45	for (int d=0;d<D;++d){ run_mean[d]=momentumrun_mean[d]+(1.0-momentum)mean[d]; run_var[d]=momentumrun_var[d]+(1.0-momentum)var[d]; }
46	} else {
47	for (int d=0;d<D;++d) invs[d]=1.0/sqrt(run_var[d]+eps);
48	for (int n=0;n<N;++n) for (int d=0;d<D;++d){ double xhat=(x[n][d]-run_mean[d])invs[d]; y[n][d]=gamma[d]xhat+beta[d]; }
49	}
50	return y;
51	}
52	};
53
54	static inline vector<vector<double>> relu(const vector<vector<double>>& x){
55	int N=(int)x.size(); if(N==0) return {}; int D=(int)x[0].size();
56	vector<vector<double>> y(N, vector<double>(D,0.0));
57	for(int n=0;n<N;++n) for(int d=0;d<D;++d) y[n][d]=max(0.0,x[n][d]);
58	return y;
59	}
60
61	int main(){
62	// Create random input (N=5, inF=3)
63	int N=5, inF=3, outF=4;
64	vector<vector<double>> X(N, vector<double>(inF, 0.0));
65	std::mt19937 rng(7); std::normal_distribution<double> nd(0.0,1.0);
66	for(int n=0;n<N;++n) for(int i=0;i<inF;++i) X[n][i]=nd(rng);
67
68	Linear lin(inF, outF);
69	BatchNorm1D bn(outF, 1e-5, 0.9);
70
71	// Training forward
72	bn.training = true;
73	auto H1 = lin.forward(X);
74	auto H2 = bn.forward(H1);
75	auto H3 = relu(H2);
76
77	cout << "Training mode output (first sample):\n";
78	for(double v: H3[0]) cout << fixed << setprecision(5) << v << ' '; cout << '\n';
79
80	// Inference forward (toggle)
81	bn.training = false;
82	auto H1_inf = lin.forward(X);
83	auto H2_inf = bn.forward(H1_inf);
84	auto H3_inf = relu(H2_inf);
85
86	cout << "Inference mode output (first sample):\n";
87	for(double v: H3_inf[0]) cout << fixed << setprecision(5) << v << ' '; cout << '\n';
88
89	return 0;
90	}
91