📚TheoryAdvanced

Variational Dropout & Bayesian Deep Learning

Key Points

•
Dropout can be interpreted as variational inference in a Bayesian neural network, where applying random masks approximates sampling from a posterior over weights.
•
The Bayesian view turns standard training into optimizing an evidence lower bound (ELBO) that balances data fit and model complexity using a KL divergence term.
•
Monte Carlo (MC) dropout at test time provides uncertainty estimates by averaging predictions over many random dropout masks.
•
Variational dropout replaces fixed Bernoulli masks with learned noise levels per weight using a Gaussian reparameterization, enabling principled sparsity.
•
Predictive uncertainty splits into epistemic (model) and aleatoric (data) parts; MC dropout primarily captures epistemic uncertainty.
•
Local reparameterization reduces gradient variance by sampling pre-activations instead of weights, improving training stability.
•
Careful scaling (inverted dropout) is essential so that expected activations remain consistent between training and evaluation.
•
C++ implementations can demonstrate MC dropout inference and simple variational Bayes for linear regression using the reparameterization trick.

Prerequisites

→Probability Theory Basics — Understanding distributions, expectations, and Bayes’ rule is essential for Bayesian modeling.
→Linear Algebra — Vector-matrix operations underlie neural networks and linear regression.
→Gradient-based Optimization — Training via ELBO maximization requires computing and applying gradients.
→Neural Network Fundamentals — Dropout, activations, and forward passes are assumed knowledge for interpreting MC dropout.
→Calculus and Chain Rule — Deriving gradients, especially through the reparameterization trick, depends on calculus.
→Statistics for Regression — Gaussian likelihoods, mean/variance, and MSE relate to the probabilistic objective.

Detailed Explanation

Tap terms for definitions

01Overview

Variational Dropout and Bayesian Deep Learning provide a probabilistic lens on neural networks. Instead of treating weights as fixed numbers, Bayesian Neural Networks (BNNs) view them as random variables with prior distributions. Learning becomes the task of approximating the posterior distribution over weights given the data. Variational inference is a popular approach: we choose a family of distributions (the variational family) and find the member closest to the true posterior by maximizing an objective called the Evidence Lower Bound (ELBO). A surprising and powerful connection is that standard dropout—randomly zeroing activations during training—can be interpreted as a form of variational inference, where the randomness acts like sampling from an approximate posterior. This perspective explains why dropout improves generalization and, with Monte Carlo sampling at test time, yields uncertainty estimates. Variational dropout extends this idea by learning continuous noise levels per weight using Gaussian parameterizations and the reparameterization trick, which reduces gradient variance. Together, these ideas enable practical Bayesian reasoning in deep networks without drastically changing standard training pipelines.

02Intuition & Analogies

Imagine you’re hiring a committee to make predictions. If you always ask the exact same committee members (a single deterministic network), you get one answer but no sense of how confident that answer is. Dropout turns this into a crowd: every time you ask, a random subset of committee members participates. If the crowd’s answers agree, you feel confident; if they disagree, you sense uncertainty. That’s the basic intuition of MC dropout: we keep dropout turned on at test time and ask many slightly different sub-networks to predict, then average. From a Bayesian viewpoint, this crowd behavior mimics drawing different weight samples from a posterior distribution—each mask corresponds to a different plausible model explaining the data. Variational dropout goes further by letting each committee member decide how loudly to speak. Instead of a hard on/off mask, each weight is perturbed with Gaussian noise whose scale is learned. If a weight isn’t helpful, the model inflates its noise (effectively silencing it); if it’s important, the model lowers its noise (keeping it consistent). This is like giving each committee member a volume knob that’s adjusted during training to balance fit and simplicity. Finally, the local reparameterization trick says: rather than roll dice for every weight, roll dice for the combined signal arriving at each neuron. This reduces randomness where it matters most (the neuron’s input), making learning more stable—like measuring the total chatter entering the room, not each whisper separately.

03Formal Definition

Let D = {(

x_{n}

y_{n}

)}_{

n=1}^{N}

be data, W the weights, and p(W) a prior. The Bayesian posterior is p(W

∣

\propto

p(D

∣

W) p(W). In variational inference, we posit a tractable family q_

θ

(W) and optimize the Evidence Lower Bound (ELBO):

L

(

θ

) =

E_{q_{θ} (W)}

[

lo g

p(D

∣

W)] -

KL

(q_

θ

(W)

∥

p(W)), which lower-bounds

lo g

p(D). Standard dropout can be cast as a variational approximation where the effective weights are

\tilde{W}

= W

⊙

Z with

Z_{ij}

\sim

Bernoulli

(1-p), and training with weight decay corresponds to placing a Gaussian prior on W. Test-time MC dropout estimates the predictive distribution p(y_*

∣

x_*, D)

\approx

\int

p(y_*

∣

x_*, W) q_

θ

(W) dW via Monte Carlo samples of masks Z. Variational dropout replaces Bernoulli multiplicative noise with a Gaussian posterior per weight: q_

θ

(

w_{ij}

) =

N

(

μ_{ij}

σ_{ij}^{2}

), often parameterized by

α_{ij}

σ_{ij}^{2}

μ_{ij}^{2}

. Sampling uses the reparameterization

w_{ij}

μ_{ij}

σ_{ij}

ϵ_{ij}

ϵ_{ij}

\sim

N

(0,1). The local reparameterization trick samples pre-activations

z_{j}

\sum_{i}

x_{i}

w_{ij}

from a Gaussian with mean

\sum_{i}

x_{i}

μ_{ij}

and variance

\sum_{i}

x_{i}

σ_{ij}^{2}

, reducing variance of stochastic gradients. The KL term between diagonal Gaussians and a Gaussian prior is available in closed form, enabling unbiased gradient estimates for ELBO optimization.

04When to Use

Use dropout-as-variational-inference (MC dropout) when you already rely on dropout for regularization and need uncertainty estimates with minimal changes: e.g., regression with safety constraints, medical triage, or forecasting where confidence intervals matter. It’s also useful for active learning, where you query labels for inputs with high predictive variance, and for out-of-distribution detection, where disagreement across MC samples flags potential anomalies. Choose variational dropout when you want learned, weight-specific uncertainty and potential sparsification. It can act as automatic relevance determination, pruning unhelpful connections by inflating their noise—handy for model compression. The local reparameterization trick is particularly attractive in large fully connected layers because it reduces training variance and can improve convergence. If you need calibrated uncertainty on small to medium datasets, Bayesian treatments (MC dropout or variational layers) are often preferable to purely deterministic ensembles because they incorporate a prior that tempers overfitting. Conversely, on extremely large datasets where epistemic uncertainty shrinks, the benefits may be smaller, and simpler regularizers may suffice. Finally, consider computational cost: MC dropout requires multiple stochastic forward passes at inference; plan for that latency or amortize via batching.

⚠️Common Mistakes

• Turning off dropout at test time yet expecting uncertainty: MC dropout requires keeping dropout active during inference and averaging many stochastic passes. • Using too few MC samples (e.g., T=5) and trusting the variance: uncertainty estimates may be noisy; use enough samples (e.g., T=20–100) for stable means/variances given your latency budget. • Forgetting inverted dropout scaling: without scaling by 1/(1-p) during training, the expected activation shifts, causing train-test mismatch. • Confusing aleatoric and epistemic uncertainty: MC dropout mainly captures epistemic (model) uncertainty; noisy labels/data variance (aleatoric) requires modeling the likelihood’s noise explicitly (e.g., heteroscedastic regression). • Mishandling variational parameters: parameterize standard deviations via log-std or softplus to keep them positive; directly optimizing \sigma can produce negative or unstable values. • Ignoring priors: the KL term encodes prior assumptions; mismatched priors (e.g., too tight) can underfit, while too loose priors can overfit; tune prior scale. • High-variance gradients: sampling weights per connection instead of using local reparameterization can slow convergence; prefer sampling pre-activations when possible. • Calibration blind spots: even Bayesian approximations can be miscalibrated; validate with reliability diagrams and consider temperature scaling if needed.

Key Formulas

ELBO

L (θ) = E_{q_{θ} (W)} [lo g p (D ∣ W)] - KL (q_{θ} (W) ∥ p (W))

Explanation: The ELBO trades off fitting the data (expected log-likelihood) against staying close to the prior (KL term). Maximizing it approximates the intractable posterior.

MC Predictive Approximation

p (y_{*} ∣ x_{*}, D) = \int p (y_{*} ∣ x_{*}, W) p (W ∣ D) d W \approx \frac{1}{T} t = 1 \sum T p (y_{*} ∣ x_{*}, W^{(t)})

Explanation: We approximate the predictive distribution by averaging predictions over T samples of the weights from the variational posterior or dropout masks.

Dropout as Random Weights

\tilde{W} = W ⊙ Z, Z_{ij} \sim Bernoulli (1 - p)

Explanation: Dropout applies multiplicative Bernoulli noise to weights or activations, which can be interpreted as sampling from an approximate posterior.

Inverted Dropout

y = \frac{x ⊙ m}{1 - p}, m_{i} \sim Bernoulli (1 - p)

Explanation: By scaling with 1/(1-p) during training, the expected activation matches the test-time activation without dropout.

Reparameterization Trick

w = μ + σ ⊙ ϵ, ϵ \sim N (0, I)

Explanation: Sampling is expressed as a differentiable transformation of noise, enabling unbiased, low-variance gradient estimates for variational parameters.

KL of Gaussians (General)

KL (N (μ, Σ) ∥ N (0, σ_{p}^{2} I)) = \frac{1}{2} (\frac{tr ( Σ )}{σ _{p}^{2}} + \frac{∥ μ ∥ _{2}^{2}}{σ _{p}^{2}} - k - lo g \frac{d e t ( Σ )}{σ _{p}^{2 k}})

Explanation: Closed-form KL between a Gaussian with mean mu and covariance Sigma and an isotropic Gaussian prior. For diagonal Sigma, this simplifies to element-wise terms.

Local Reparameterization

z_{j} \sim N (i \sum x_{i} μ_{ij}, i \sum x_{i}^{2} σ_{ij}^{2})

Explanation: Instead of sampling each weight, sample the pre-activation $z_{j}$ , whose mean and variance are determined by input and variational parameters.

Gaussian Likelihood for Regression

lo g p (D ∣ W) = - \frac{N}{2} lo g (2 π σ_{y}^{2}) - \frac{1}{2 σ _{y}^{2}} n = 1 \sum N (y_{n} - x_{n}^{⊤} W)^{2}

Explanation: Assumes outputs are normally distributed around linear predictions with variance $σ_{y}^{2}$ . It is commonly used in Bayesian linear regression.

MC Mean and Variance

E [y_{*}] = \frac{1}{T} t = 1 \sum T f (x_{*}; W^{(t)}), Var [y_{*}] = \frac{1}{T - 1} t = 1 \sum T (f (x_{*}; W^{(t)}) - E [y_{*}])^{2}

Explanation: Empirical mean and variance from T stochastic forward passes estimate the predictive mean and model uncertainty (epistemic).

KL for Diagonal Gaussians

KL (N (μ, diag (σ^{2})) ∥ N (0, σ_{p}^{2} I)) = \frac{1}{2} i \sum (\frac{σ _{i}^{2} + μ _{i}^{2}}{σ _{p}^{2}} - 1 - lo g \frac{σ _{i}^{2}}{σ _{p}^{2}})

Explanation: The KL penalty decomposes over dimensions and gives simple gradients for $μ_{i}$ and $σ_{i}$ , facilitating variational updates.

Complexity Analysis

For MC dropout inference, let d be input dimension, h the hidden size, o the output size, and T the number of Monte Carlo samples. A single forward pass through a 1-hidden-layer MLP costs O(dh + ho). Performing T stochastic passes costs O(T(dh + ho)). Memory is O(dh + ho) to store parameters and O(h) for activations per pass; storing all T outputs adds O(T) for scalar regression or O(To) for vector outputs. The runtime scales linearly in T, so there is a latency-accuracy trade-off: more samples yield smoother uncertainty estimates but take proportionally longer. For variational Bayesian linear regression with a diagonal Gaussian posterior over d weights, each iteration computing full-batch gradients costs O(Nd) where N is the number of data points (matrix-vector products Xw and

X^{T}

r). Sampling with the reparameterization trick is O(d). The KL term is O(d). Thus, per iteration complexity is dominated by O(Nd). Memory usage is O(Nd) for the design matrix (if stored), O(d) for parameters (

μ

ρ

), and O(N) for residuals. If minibatches are used, per-iteration cost reduces to O(Bd) for batch size B, at the expense of more iterations for convergence. In variational dropout with local reparameterization for a dense layer, the per-batch forward sampling of pre-activations is O(Bdh) to compute means and variances and O(Bh) for Gaussian sampling; this is comparable to a standard forward pass with modest overhead. The local trick avoids sampling per weight (which would cost O(Bdh) random draws) and reduces gradient variance, improving epochs-to-convergence even if per-epoch cost is similar. Overall, these Bayesian approximations add constant-factor overheads and a multiplicative T factor at inference for MC dropout.

Code Examples

Monte Carlo (MC) Dropout Inference for a Tiny MLP (Regression)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937_64 gen;
6     RNG(uint64_t seed=42) : gen(seed) {}
7     double normal(double mean=0.0, double stddev=1.0) {
8         normal_distribution<double> dist(mean, stddev);
9         return dist(gen);
10     }
11     bool bernoulli(double p_keep) {
12         bernoulli_distribution dist(p_keep);
13         return dist(gen);
14     }
15 };
16 
17 struct Dense {
18     int in_dim, out_dim;
19     vector<double> W; // row-major [out_dim x in_dim]
20     vector<double> b; // [out_dim]
21 
22     Dense(int in_d, int out_d, RNG &rng) : in_dim(in_d), out_dim(out_d), W(out_d*in_d), b(out_d) {
23         // He initialization for ReLU
24         double stddev = sqrt(2.0 / in_dim);
25         for (double &w : W) w = rng.normal(0.0, stddev);
26         for (double &bi : b) bi = 0.0;
27     }
28 
29     // y = W x + b
30     vector<double> forward_linear(const vector<double> &x) const {
31         vector<double> y(out_dim, 0.0);
32         for (int o = 0; o < out_dim; ++o) {
33             double sum = 0.0;
34             const double *wrow = &W[o * in_dim];
35             for (int i = 0; i < in_dim; ++i) sum += wrow[i] * x[i];
36             y[o] = sum + b[o];
37         }
38         return y;
39     }
40 };
41 
42 // ReLU activation
43 static inline void relu_inplace(vector<double> &v) {
44     for (double &x : v) if (x < 0) x = 0;
45 }
46 
47 // Apply inverted dropout to a vector in-place
48 static inline void inverted_dropout_inplace(vector<double> &v, double p_drop, RNG &rng) {
49     if (p_drop <= 0.0) return; // no dropout
50     double p_keep = 1.0 - p_drop;
51     double scale = 1.0 / p_keep;
52     for (double &x : v) {
53         bool keep = rng.bernoulli(p_keep);
54         x = keep ? (x * scale) : 0.0;
55     }
56 }
57 
58 // One-hidden-layer MLP: y = W2 * ReLU( Dropout( W1 * x + b1 ) ) + b2
59 struct MLP {
60     Dense l1, l2;
61     double p_drop_hidden; // dropout probability for hidden activations
62 
63     MLP(int in_dim, int hidden, int out_dim, double p_drop, RNG &rng)
64         : l1(in_dim, hidden, rng), l2(hidden, out_dim, rng), p_drop_hidden(p_drop) {}
65 
66     // Forward pass with optional dropout on hidden layer
67     vector<double> forward(const vector<double> &x, bool training, RNG &rng) const {
68         vector<double> h = l1.forward_linear(x);
69         relu_inplace(h);
70         if (training) {
71             inverted_dropout_inplace(h, p_drop_hidden, rng);
72         }
73         vector<double> y = l2.forward_linear(h);
74         return y; // linear output for regression
75     }
76 };
77 
78 int main() {
79     ios::sync_with_stdio(false);
80     cin.tie(nullptr);
81 
82     RNG rng(123);
83 
84     // Define a tiny MLP: input=3, hidden=16, output=1, dropout p=0.2 on hidden
85     MLP net(3, 16, 1, 0.2, rng);
86 
87     // Example input
88     vector<double> x = {0.5, -1.2, 2.0};
89 
90     // Deterministic evaluation (dropout off): single forward pass
91     vector<double> y_det = net.forward(x, /*training=*/false, rng);
92     cout << fixed << setprecision(6);
93     cout << "Deterministic output (no dropout): " << y_det[0] << "\n";
94 
95     // Monte Carlo dropout: keep dropout ON at test time, average T samples
96     int T = 100; // number of stochastic passes
97     vector<double> samples; samples.reserve(T);
98     for (int t = 0; t < T; ++t) {
99         vector<double> y = net.forward(x, /*training=*/true, rng); // dropout active
100         samples.push_back(y[0]);
101     }
102     // Compute mean and standard deviation
103     double mean = accumulate(samples.begin(), samples.end(), 0.0) / T;
104     double var = 0.0;
105     for (double s : samples) var += (s - mean) * (s - mean);
106     var /= max(1, T - 1);
107     double stddev = sqrt(var);
108 
109     cout << "MC Dropout mean: " << mean << ", std: " << stddev << " (T=" << T << ")\n";
110 
111     return 0;
112 }
113

This program builds a minimal 1-hidden-layer MLP with ReLU and dropout applied to hidden activations. It demonstrates inverted dropout scaling during training and shows how to obtain uncertainty at inference via MC dropout: perform T stochastic forward passes with dropout enabled and compute the empirical mean and standard deviation of predictions. The deterministic output (dropout off) represents the usual point prediction, while the MC mean and std approximate the Bayesian predictive mean and epistemic uncertainty.

Time: O(T(dh + ho)) for T MC passes; a single pass is O(dh + ho).Space: O(dh + ho) for parameters plus O(h) for activations; O(T) to store MC samples.

Variational Bayesian Linear Regression via Reparameterization (ELBO Ascent)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937_64 gen;
6     RNG(uint64_t seed=42) : gen(seed) {}
7     double normal(double mean=0.0, double stddev=1.0) {
8         normal_distribution<double> dist(mean, stddev);
9         return dist(gen);
10     }
11 };
12 
13 // Generate synthetic linear regression data: y = x^T w_true + noise
14 void make_data(int N, int d, const vector<double>& w_true, double sigma_y, RNG &rng,
15                vector<vector<double>>& X, vector<double>& y) {
16     X.assign(N, vector<double>(d));
17     y.assign(N, 0.0);
18     normal_distribution<double> noise(0.0, sigma_y);
19     for (int n = 0; n < N; ++n) {
20         for (int i = 0; i < d; ++i) X[n][i] = rng.normal(0.0, 1.0);
21         double yn = 0.0;
22         for (int i = 0; i < d; ++i) yn += X[n][i] * w_true[i];
23         yn += noise(rng.gen);
24         y[n] = yn;
25     }
26 }
27 
28 // Compute X^T * r where r = (y - X w), and y_pred = X w
29 void xt_residual(const vector<vector<double>>& X, const vector<double>& y,
30                  const vector<double>& w, double sigma_y2,
31                  vector<double>& grad_w, double &mse) {
32     int N = (int)X.size();
33     int d = (int)X[0].size();
34     grad_w.assign(d, 0.0);
35     mse = 0.0;
36     for (int n = 0; n < N; ++n) {
37         double yhat = 0.0;
38         for (int i = 0; i < d; ++i) yhat += X[n][i] * w[i];
39         double r = y[n] - yhat; // residual
40         mse += r * r;
41         for (int i = 0; i < d; ++i) grad_w[i] += X[n][i] * r; // X^T r
42     }
43     // gradient of log-likelihood wrt w: (1/sigma_y^2) * X^T r
44     for (int i = 0; i < d; ++i) grad_w[i] /= sigma_y2;
45     mse /= N;
46 }
47 
48 int main() {
49     ios::sync_with_stdio(false);
50     cin.tie(nullptr);
51 
52     RNG rng(123);
53 
54     // Problem setup
55     int N = 500;   // number of samples
56     int d = 5;     // number of features
57     double sigma_y = 0.3; // observation noise std
58     double sigma_y2 = sigma_y * sigma_y;
59     vector<double> w_true = {1.5, -2.0, 0.0, 0.7, -0.3};
60 
61     vector<vector<double>> X;
62     vector<double> y;
63     make_data(N, d, w_true, sigma_y, rng, X, y);
64 
65     // Variational parameters: q(w) = N(mu, diag(sigma^2)), with sigma = exp(rho)
66     vector<double> mu(d, 0.0);
67     vector<double> rho(d, -3.0); // sigma ~ exp(-3) ~ 0.05 initial
68 
69     // Prior p(w) = N(0, sigma_p^2 I)
70     double sigma_p = 1.0;
71     double sigma_p2 = sigma_p * sigma_p;
72 
73     // Optimization settings
74     int iters = 2000;
75     double lr = 1e-2;
76 
77     normal_distribution<double> standard_normal(0.0, 1.0);
78 
79     for (int it = 1; it <= iters; ++it) {
80         // Sample epsilon ~ N(0, I)
81         vector<double> eps(d);
82         for (int i = 0; i < d; ++i) eps[i] = standard_normal(rng.gen);
83 
84         // Compute sigma, sample w = mu + sigma * eps
85         vector<double> sigma(d), w(d);
86         for (int i = 0; i < d; ++i) {
87             sigma[i] = exp(rho[i]);
88             w[i] = mu[i] + sigma[i] * eps[i];
89         }
90 
91         // Gradient of expected log-likelihood via reparameterization: use one MC sample
92         vector<double> grad_w; double mse;
93         xt_residual(X, y, w, sigma_y2, grad_w, mse); // grad wrt w of log-lik
94 
95         // KL gradients: d/dmu KL = mu/sigma_p^2; d/dsigma KL = sigma/sigma_p^2 - 1/sigma
96         vector<double> dKL_dmu(d), dKL_dsigma(d);
97         for (int i = 0; i < d; ++i) {
98             dKL_dmu[i] = mu[i] / sigma_p2;
99             dKL_dsigma[i] = sigma[i] / sigma_p2 - 1.0 / max(1e-12, sigma[i]);
100         }
101 
102         // ELBO gradients (ascent): grad_mu = E[grad_w] - dKL/dmu; grad_sigma = E[grad_w]*eps - dKL/dsigma
103         vector<double> grad_mu(d), grad_sigma(d), grad_rho(d);
104         for (int i = 0; i < d; ++i) {
105             grad_mu[i] = grad_w[i] - dKL_dmu[i];
106             grad_sigma[i] = grad_w[i] * eps[i] - dKL_dsigma[i];
107             grad_rho[i] = grad_sigma[i] * sigma[i]; // chain: dsigma/drho = sigma
108         }
109 
110         // Parameter update (ascent on ELBO)
111         for (int i = 0; i < d; ++i) {
112             mu[i]  += lr * grad_mu[i];
113             rho[i] += lr * grad_rho[i];
114         }
115 
116         if (it % 200 == 0) {
117             // Compute KL for monitoring
118             double KL = 0.0;
119             for (int i = 0; i < d; ++i) {
120                 double s2 = exp(2.0 * rho[i]);
121                 KL += 0.5 * ((s2 + mu[i]*mu[i]) / sigma_p2 - 1.0 - log(s2 / sigma_p2));
122             }
123             cout << "Iter " << it << ": MSE ~ " << mse << ", KL ~ " << KL << "\n";
124         }
125     }
126 
127     // Predictive mean and variance for a new input x*: mean = x^T mu, var = sigma_y^2 + x^T diag(sigma^2) x
128     vector<double> x_star = {0.3, -1.0, 0.2, 0.1, 0.5};
129     double mean_pred = 0.0, var_model = 0.0;
130     for (int i = 0; i < d; ++i) {
131         double si = exp(rho[i]);
132         mean_pred += x_star[i] * mu[i];
133         var_model += (x_star[i] * x_star[i]) * (si * si);
134     }
135     double var_pred = sigma_y2 + var_model;
136 
137     cout << fixed << setprecision(6);
138     cout << "Predictive mean: " << mean_pred << ", std: " << sqrt(var_pred)
139          << " (model std: " << sqrt(var_model) << ")\n";
140 
141     return 0;
142 }
143

This example implements variational Bayesian linear regression with a diagonal Gaussian posterior over weights. It uses the reparameterization trick to obtain low-variance gradients of the ELBO. The expected log-likelihood gradient is computed in closed form (via X^T residuals), while the KL to an isotropic Gaussian prior is closed-form with simple derivatives. After training, the code reports predictive mean and variance for a test input, decomposing observation noise and model (epistemic) variance.

Time: Per iteration O(Nd) for full-batch gradient, plus O(d) for sampling and KL.Space: O(Nd) to store X, O(d) for parameters and gradients, O(N) for residuals.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct RNG {
5	mt19937_64 gen;
6	RNG(uint64_t seed=42) : gen(seed) {}
7	double normal(double mean=0.0, double stddev=1.0) {
8	normal_distribution<double> dist(mean, stddev);
9	return dist(gen);
10	}
11	bool bernoulli(double p_keep) {
12	bernoulli_distribution dist(p_keep);
13	return dist(gen);
14	}
15	};
16
17	struct Dense {
18	int in_dim, out_dim;
19	vector<double> W; // row-major [out_dim x in_dim]
20	vector<double> b; // [out_dim]
21
22	Dense(int in_d, int out_d, RNG &rng) : in_dim(in_d), out_dim(out_d), W(out_d*in_d), b(out_d) {
23	// He initialization for ReLU
24	double stddev = sqrt(2.0 / in_dim);
25	for (double &w : W) w = rng.normal(0.0, stddev);
26	for (double &bi : b) bi = 0.0;
27	}
28
29	// y = W x + b
30	vector<double> forward_linear(const vector<double> &x) const {
31	vector<double> y(out_dim, 0.0);
32	for (int o = 0; o < out_dim; ++o) {
33	double sum = 0.0;
34	const double wrow = &W[o in_dim];
35	for (int i = 0; i < in_dim; ++i) sum += wrow[i] * x[i];
36	y[o] = sum + b[o];
37	}
38	return y;
39	}
40	};
41
42	// ReLU activation
43	static inline void relu_inplace(vector<double> &v) {
44	for (double &x : v) if (x < 0) x = 0;
45	}
46
47	// Apply inverted dropout to a vector in-place
48	static inline void inverted_dropout_inplace(vector<double> &v, double p_drop, RNG &rng) {
49	if (p_drop <= 0.0) return; // no dropout
50	double p_keep = 1.0 - p_drop;
51	double scale = 1.0 / p_keep;
52	for (double &x : v) {
53	bool keep = rng.bernoulli(p_keep);
54	x = keep ? (x * scale) : 0.0;
55	}
56	}
57
58	// One-hidden-layer MLP: y = W2 * ReLU( Dropout( W1 * x + b1 ) ) + b2
59	struct MLP {
60	Dense l1, l2;
61	double p_drop_hidden; // dropout probability for hidden activations
62
63	MLP(int in_dim, int hidden, int out_dim, double p_drop, RNG &rng)
64	: l1(in_dim, hidden, rng), l2(hidden, out_dim, rng), p_drop_hidden(p_drop) {}
65
66	// Forward pass with optional dropout on hidden layer
67	vector<double> forward(const vector<double> &x, bool training, RNG &rng) const {
68	vector<double> h = l1.forward_linear(x);
69	relu_inplace(h);
70	if (training) {
71	inverted_dropout_inplace(h, p_drop_hidden, rng);
72	}
73	vector<double> y = l2.forward_linear(h);
74	return y; // linear output for regression
75	}
76	};
77
78	int main() {
79	ios::sync_with_stdio(false);
80	cin.tie(nullptr);
81
82	RNG rng(123);
83
84	// Define a tiny MLP: input=3, hidden=16, output=1, dropout p=0.2 on hidden
85	MLP net(3, 16, 1, 0.2, rng);
86
87	// Example input
88	vector<double> x = {0.5, -1.2, 2.0};
89
90	// Deterministic evaluation (dropout off): single forward pass
91	vector<double> y_det = net.forward(x, /training=/false, rng);
92	cout << fixed << setprecision(6);
93	cout << "Deterministic output (no dropout): " << y_det[0] << "\n";
94
95	// Monte Carlo dropout: keep dropout ON at test time, average T samples
96	int T = 100; // number of stochastic passes
97	vector<double> samples; samples.reserve(T);
98	for (int t = 0; t < T; ++t) {
99	vector<double> y = net.forward(x, /training=/true, rng); // dropout active
100	samples.push_back(y[0]);
101	}
102	// Compute mean and standard deviation
103	double mean = accumulate(samples.begin(), samples.end(), 0.0) / T;
104	double var = 0.0;
105	for (double s : samples) var += (s - mean) * (s - mean);
106	var /= max(1, T - 1);
107	double stddev = sqrt(var);
108
109	cout << "MC Dropout mean: " << mean << ", std: " << stddev << " (T=" << T << ")\n";
110
111	return 0;
112	}
113