🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Variational Inference

Key Points

  • •
    Variational Inference (VI) turns Bayesian inference into an optimization problem by choosing a simple family q(z) to approximate an intractable posterior p(z|x).
  • •
    The Evidence Lower Bound (ELBO) is the objective VI maximizes; increasing the ELBO tightens the approximation to the true posterior.
  • •
    Mean-field VI assumes independent factors in q(z), enabling fast coordinate-ascent updates using expected log-joint terms.
  • •
    Black-box VI uses Monte Carlo gradients of the ELBO and works even without conjugacy by leveraging the reparameterization trick.
  • •
    The direction of the KL divergence in VI (KL(q||p)) encourages covering high-density regions but can under-estimate uncertainty in multi-modal posteriors.
  • •
    Stochastic VI scales to large datasets by estimating gradients from mini-batches while maintaining an unbiased objective gradient.
  • •
    Monitoring ELBO and gradient variance is crucial for stable and correct optimization during VI.
  • •
    C++ implementations can demonstrate both closed-form CAVI (e.g., linear models) and reparameterized gradient-based VI (e.g., simple latent Gaussian models).

Prerequisites

  • →Basic probability and expectation — VI relies on expectations under q(z) and reasoning about densities.
  • →Bayes’ rule and posterior inference — Understanding p(z|x) and why it can be intractable motivates VI.
  • →Kullback–Leibler divergence — VI minimizes KL(q||p) equivalently by maximizing the ELBO.
  • →Multivariate Gaussian distributions — Common variational families and conjugate models use Gaussian forms.
  • →Jensen’s inequality — It justifies the ELBO as a lower bound on log evidence.
  • →Optimization and gradient descent — Black-box VI requires iterative gradient-based optimization.
  • →Monte Carlo estimation — ELBO and gradient estimates use sampling from q.
  • →Linear algebra (matrix–vector operations) — Efficient CAVI implementations depend on matrix–vector computations.
  • →Log-likelihoods and numerical stability — Working in log-space prevents underflow and overflow.
  • →Stochastic optimization (mini-batching) — Essential to scale VI to large datasets.

Detailed Explanation

Tap terms for definitions

01Overview

Variational Inference (VI) is a method to perform approximate Bayesian inference when the true posterior p(z|x) is computationally intractable. The key idea is to replace exact inference with optimization: we pick a set of candidate distributions Q (the variational family) and then find q*(z) in Q that is closest to the true posterior according to Kullback–Leibler (KL) divergence. Rather than minimizing KL directly, VI maximizes the Evidence Lower Bound (ELBO), which is a tractable lower bound on the log evidence log p(x). This bound captures a trade-off between fitting the data (via the expected log-likelihood) and staying close to the prior (via a KL regularizer). In practice, VI comes in two dominant flavors. First, mean-field VI uses a product-form q(z)=∏ q_i(z_i) and iteratively updates each factor with closed-form formulas when the model is conjugate, which is known as Coordinate Ascent Variational Inference (CAVI). Second, black-box VI uses Monte Carlo estimates of the ELBO gradient and can be applied to virtually any model; the reparameterization trick provides low-variance gradients when q is reparameterizable (e.g., Gaussian). VI scales naturally to large datasets through stochastic (mini-batch) optimization. It also underlies modern deep generative models such as Variational Autoencoders (VAEs), where neural networks amortize inference by predicting parameters of q(z|x).

02Intuition & Analogies

Imagine trying to describe a complicated, twisty shape (the true posterior) using a simpler clay mold (our variational family q). We can’t reproduce every wrinkle, but we can press, stretch, and shift the mold to cover as much of the important mass as possible. The better the mold fits, the better our approximate inference. The ELBO is like a score telling us how well the mold fits without needing to see the exact shape; we can raise this score by improving how well the model explains the data and by not straying too far from our prior beliefs. Another analogy: think of locating a radio signal in a noisy environment. The true posterior is the exact signal we’d like to isolate, but we only have noisy measurements. We choose a tunable filter q with knobs (its parameters). Turning the knobs to raise the ELBO is like improving the filter so more true signal passes and less noise remains. In mean-field VI, we make a simplifying assumption that each knob controls a separate band (independence), so we can tune one band at a time using closed-form rules. In black-box VI, we don’t have an instruction manual for the knobs; instead, we jiggle them in smart ways (Monte Carlo gradients via reparameterization) and follow the direction that most increases the signal (the ELBO). In both cases, we aim for a good balance: explain the data while not overfitting, which the ELBO enforces automatically through its built-in regularization.

03Formal Definition

Let x be observed data and z latent variables. We select a variational family Q = { q_ϕ(z) } parameterized by ϕ. Variational Inference solves argmaxϕ​ L(ϕ), where the ELBO is L(ϕ) = Eqϕ​(z)​[log p(x,z) - log q_ϕ(z)]. By Jensen’s inequality, L(ϕ) ≤ log p(x), and the gap equals the KL divergence from q_ϕ to the true posterior: log p(x) = L(ϕ) + KL(q_ϕ(z) ∥ p(z∣ x)). Therefore maximizing the ELBO is equivalent to minimizing KL(q_ϕ(z) ∥ p(z∣ x)). In mean-field VI, we restrict q(z) = ∏i=1m​ qi​(zi​). The optimal factor update, holding all others fixed, is q_i∗(zi​) ∝ exp\{Eq−i​​[log p(x,z)]\}, where q−i​ is the product of all factors except i. When p is in the conjugate exponential family, these expectations yield closed-form updates, enabling Coordinate Ascent Variational Inference (CAVI). In black-box VI, we estimate ∇_ϕ L(ϕ) with Monte Carlo. For reparameterizable q_ϕ, write z = g_ϕ(ϵ) with ϵ ∼ p(ϵ) independent of ϕ. Then ∇_ϕ L(ϕ) = Eϵ​[∇_ϕ (log p(x, g_ϕ(ϵ)) - log q_ϕ(g_ϕ(ϵ)))], which can be differentiated via the pathwise derivative (reparameterization trick).

04When to Use

  • Intractable posteriors: Use VI when exact p(z|x) is unavailable due to complex likelihoods or high-dimensional latent variables (e.g., topic models, large hierarchical Bayes).
  • Conjugate models with factorization: Use mean-field CAVI for fast, stable inference with closed-form updates (e.g., Bayesian linear regression, Gaussian mixture models with conjugate priors).
  • Nonconjugate or deep models: Use black-box VI with reparameterization (e.g., VAEs, Bayesian neural networks, probabilistic matrix factorization) to handle arbitrary likelihoods and priors.
  • Large-scale datasets: Use stochastic VI (mini-batches) for scalability; the ELBO gradients can be unbiasedly estimated from subsamples with appropriate weighting.
  • Need for speed and differentiability: VI often runs much faster than MCMC and provides a differentiable objective (the ELBO) suitable for integration with automatic differentiation and GPU acceleration.
  • Amortized inference: When you must infer z for many x quickly (e.g., at test time in VAEs), learn an inference network that outputs q_\phi(z\mid x) parameters directly.

⚠️Common Mistakes

  • Overly restrictive variational family: A mean-field diagonal Gaussian may under-estimate posterior uncertainty or miss multimodality. Remedy: Enrich q with covariance, mixtures, or normalizing flows.
  • Ignoring ELBO monitoring: Not tracking the ELBO can hide divergence or numerical issues. Always report ELBO, its moving average, and gradient norms.
  • High-variance gradients in black-box VI: Using the score-function estimator without control variates can stall learning. Prefer reparameterization when possible, increase the number of samples, and use variance reduction (baselines, Rao–Blackwellization).
  • KL direction confusion: Minimizing KL(q||p) is mode-seeking; it can avoid low-probability regions between modes. If mode-covering is desired, consider alternative divergences or tempered posteriors.
  • Poor initialization and step sizes: Starting with too narrow q or using aggressive learning rates can cause numerical instability. Initialize with reasonable scale and use adaptive optimizers (e.g., Adam) or line searches.
  • Broken factor updates in CAVI: Forgetting to exclude the current coordinate when computing residuals leads to biased updates. Maintain and update residuals carefully to get O(nd) per sweep.
  • Numerical underflow/overflow: Log-probabilities with tiny variances or large norms can cause -inf or nan. Work in log-space and clip/regularize variances.

Key Formulas

ELBO Definition

L(q)=Eq(z)​[logp(x,z)−logq(z)]

Explanation: The ELBO measures how well q explains the joint p(x,z) while penalizing complexity via the entropy of q. Maximizing it yields a better posterior approximation.

ELBO–Evidence Decomposition

logp(x)=L(q)+KL(q(z)∥p(z∣x))

Explanation: The gap between log evidence and the ELBO is the KL divergence from q to the true posterior. As ELBO increases, KL decreases and q approaches p(z|x).

KL Divergence

KL(q∥p)=Eq​[logq(z)−logp(z)]

Explanation: This is the direction minimized by VI. It prefers placing mass where q can safely fit within p’s high-density regions, often resulting in mode-seeking behavior.

Mean-field Coordinate Update

qi∗​(zi​)∝exp{Eq−i​​[logp(x,z)]}

Explanation: Given all other factors, the optimal update for factor i exponentiates the expected log-joint. This yields closed-form updates in conjugate exponential-family models.

Reparameterization

z=μ+σϵ,ϵ∼N(0,1)

Explanation: Samples from q are expressed as a deterministic transform of noise independent of parameters. This enables low-variance pathwise gradients for the ELBO.

Pathwise Gradient

∇ϕ​Eqϕ​(z)​[f(z)]=Eϵ​[∇ϕ​f(gϕ​(ϵ))]

Explanation: By reparameterizing, we can move the gradient operator inside the expectation. This is central to black-box VI with differentiable models.

Score-Function Estimator

∇ϕ​Eqϕ​(z)​[f(z)]=Eqϕ​(z)​[f(z)∇ϕ​logqϕ​(z)]

Explanation: Also known as REINFORCE, it requires only samples from q and log q’s gradient. It has higher variance than pathwise gradients but works without reparameterization.

Gaussian–Standard Normal KL

KL(N(μ,Σ)∥N(0,I))=21​(tr(Σ)+μ⊤μ−k−logdetΣ)

Explanation: A closed form for the KL from a Gaussian to a standard normal in k dimensions. Useful for ELBOs that include a Gaussian prior on latents.

Gaussian Entropy

H(N(μ,Σ))=21​log((2πe)kdetΣ)

Explanation: The entropy of a multivariate normal depends on its covariance determinant. It appears in the ELBO via the -Eq​[log q(z)] term.

Quadratic Expectation under Mean-field

Eq​[∥y−Xw∥22​]=∥y−XE[w]∥22​+j=1∑d​∥X⋅j​∥22​Var(wj​)

Explanation: With independent coordinates, cross-covariances vanish. This identity yields simple ELBO terms for Gaussian linear models in CAVI.

Complexity Analysis

Let n be the number of data points, d the number of parameters/latent dimensions, and I the number of outer iterations. - CAVI for mean-field Gaussian linear regression: A naive sweep re-computing residuals costs O(nd) per coordinate, i.e., O(nd2) per iteration. Using maintained residuals r=y − X m and per-column updates, each coordinate update becomes O(n), giving O(nd) per sweep. Precomputations (column squared norms) cost O(nd). Memory is O(nd) to store X (dense) plus O(n + d) for vectors. Convergence typically occurs in tens to a few hundred sweeps depending on conditioning and prior strength. - Black-box VI with reparameterization (single latent, full-batch): Per iteration, drawing S Monte Carlo samples and evaluating the joint over n observations costs O(S n). Gradients are computed in the same pass. Memory is O(n) for data and O(1) for parameters. With mini-batches of size b, the cost is O(S b) per step; to see one effective epoch, multiply by n/b. - Numerical costs: ELBO evaluation adds negligible overhead compared to gradient computations (same asymptotic order). If we enrich q with covariance (full Gaussian), CAVI updates involve O(d2) storage and O(d2)–O(d3) linear algebra per iteration. For models with local latents per data point (e.g., GMM responsibilities or VAE latents), mean-field VI incurs O(nk) updates for k clusters/components, while amortized VI shifts that cost to a learned encoder evaluated in O(n) forward passes. - Practical runtime considerations: Using vectorized BLAS for X–vector products and careful cache-friendly memory layouts can reduce constants substantially. Gradient variance in black-box VI scales down as OS1​; doubling S halves variance but doubles compute, so adaptive S or control variates can optimize wall-clock convergence.

Code Examples

CAVI for Bayesian Linear Regression with Mean-field Diagonal q(w)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Coordinate Ascent Variational Inference (CAVI)
5// Model: y = X w + noise, noise ~ N(0, sigma2 I)
6// Prior: w ~ N(0, alpha^{-1} I)
7// Variational family: q(w) = \prod_j N(w_j | m_j, s2_j)
8// Updates:
9// s2_j = 1 / (alpha + (1/sigma2) * ||X_col_j||^2)
10// m_j = s2_j * (1/sigma2) * sum_i X_{ij} * ( y_i - sum_{k!=j} X_{ik} m_k )
11// We implement O(nd) per sweep using a maintained residual r = y - X m.
12
13struct Dataset {
14 int n, d;
15 vector<vector<double>> X_cols; // d columns, each of length n
16 vector<double> y; // length n
17 vector<double> col_norm2; // ||X_col_j||^2
18};
19
20Dataset make_synthetic(int n, int d, double sigma, double alpha_true, unsigned seed=42) {
21 mt19937 rng(seed);
22 normal_distribution<double> N01(0.0, 1.0);
23 Dataset data; data.n = n; data.d = d;
24 data.X_cols.assign(d, vector<double>(n));
25 data.y.assign(n, 0.0);
26 vector<double> w_true(d);
27 // True weights ~ N(0, alpha_true^{-1} I)
28 for (int j = 0; j < d; ++j) w_true[j] = N01(rng) / sqrt(alpha_true);
29 // Design matrix X ~ N(0,1)
30 for (int j = 0; j < d; ++j) {
31 for (int i = 0; i < n; ++i) data.X_cols[j][i] = N01(rng);
32 }
33 // Generate y = X w_true + noise
34 normal_distribution<double> noise(0.0, sigma);
35 for (int i = 0; i < n; ++i) {
36 double yi = 0.0;
37 for (int j = 0; j < d; ++j) yi += data.X_cols[j][i] * w_true[j];
38 yi += noise(rng);
39 data.y[i] = yi;
40 }
41 // Precompute column norms
42 data.col_norm2.assign(d, 0.0);
43 for (int j = 0; j < d; ++j) {
44 double s = 0.0; for (int i = 0; i < n; ++i) s += data.X_cols[j][i] * data.X_cols[j][i];
45 data.col_norm2[j] = s;
46 }
47 return data;
48}
49
50struct VariationalParams {
51 vector<double> m; // means
52 vector<double> s2; // variances (diagonal)
53};
54
55// Compute ELBO under mean-field diagonal q(w)
56// E_q[log p(y|w)] + E_q[log p(w)] - E_q[log q(w)]
57// E[||y - X w||^2] = ||y - X m||^2 + sum_j s2_j * ||X_col_j||^2
58// E[||w||^2] = ||m||^2 + sum_j s2_j
59
60double elbo(const Dataset& D, const VariationalParams& vp, double sigma2, double alpha) {
61 int n = D.n, d = D.d;
62 // Compute residual r = y - X m
63 vector<double> r = D.y;
64 for (int j = 0; j < d; ++j) {
65 double mj = vp.m[j];
66 if (mj == 0.0) continue;
67 for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * mj;
68 }
69 // ||r||^2
70 double r2 = 0.0; for (double v : r) r2 += v * v;
71 // sum_j s2_j * ||X_col_j||^2
72 double trace_term = 0.0;
73 for (int j = 0; j < d; ++j) trace_term += vp.s2[j] * D.col_norm2[j];
74 // E_q[log p(y|w)]
75 double ll = -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * (r2 + trace_term) / sigma2;
76 // E_q[log p(w)] with prior N(0, alpha^{-1} I)
77 double m2 = 0.0, s2sum = 0.0;
78 for (int j = 0; j < d; ++j) { m2 += vp.m[j] * vp.m[j]; s2sum += vp.s2[j]; }
79 double lpw = -0.5 * d * log(2.0 * M_PI / alpha) - 0.5 * alpha * (m2 + s2sum);
80 // Entropy of q (sum of 1D Gaussians)
81 double H = 0.0;
82 for (int j = 0; j < d; ++j) H += 0.5 * log(2.0 * M_E * M_PI * vp.s2[j]);
83 return ll + lpw + H;
84}
85
86VariationalParams cavi_bayesian_lr(const Dataset& D, double sigma2, double alpha, int max_iters=200, double tol=1e-6) {
87 int n = D.n, d = D.d;
88 VariationalParams vp; vp.m.assign(d, 0.0); vp.s2.assign(d, 0.0);
89 // Precompute s2_j which doesn't depend on other coordinates in this model
90 for (int j = 0; j < d; ++j) vp.s2[j] = 1.0 / (alpha + (1.0 / sigma2) * D.col_norm2[j]);
91 // Maintain residual r = y - X m
92 vector<double> r = D.y; // initially m=0 so r=y
93 double prev_elbo = elbo(D, vp, sigma2, alpha);
94 for (int it = 0; it < max_iters; ++it) {
95 for (int j = 0; j < d; ++j) {
96 double old_mj = vp.m[j];
97 // Compute sum_i X_{ij} * (r_i + X_{ij} * m_j) efficiently
98 double tmp = 0.0;
99 for (int i = 0; i < n; ++i) tmp += D.X_cols[j][i] * (r[i] + D.X_cols[j][i] * old_mj);
100 double new_mj = vp.s2[j] * (tmp / sigma2);
101 double delta = new_mj - old_mj;
102 if (delta != 0.0) {
103 for (int i = 0; i < n; ++i) r[i] -= D.X_cols[j][i] * delta;
104 vp.m[j] = new_mj;
105 }
106 }
107 double cur_elbo = elbo(D, vp, sigma2, alpha);
108 if (it % 10 == 0) {
109 cerr << "Iter " << it << ": ELBO = " << cur_elbo << "\n";
110 }
111 if (abs(cur_elbo - prev_elbo) < tol * (1.0 + abs(prev_elbo))) break;
112 prev_elbo = cur_elbo;
113 }
114 return vp;
115}
116
117int main() {
118 ios::sync_with_stdio(false);
119 cin.tie(nullptr);
120
121 int n = 500, d = 20;
122 double sigma = 0.5; // noise std
123 double sigma2 = sigma * sigma;
124 double alpha = 1.0; // prior precision
125
126 Dataset D = make_synthetic(n, d, sigma, alpha, 123);
127 VariationalParams vp = cavi_bayesian_lr(D, sigma2, alpha, 300, 1e-7);
128
129 // Report learned mean and average variance
130 double avg_s2 = accumulate(vp.s2.begin(), vp.s2.end(), 0.0) / d;
131 cout << fixed << setprecision(6);
132 cout << "Learned mean w (first 5): ";
133 for (int j = 0; j < min(d, 5); ++j) cout << vp.m[j] << (j+1<min(d,5)?", ":"\n");
134 cout << "Average posterior variance per weight: " << avg_s2 << "\n";
135 cout << "Final ELBO: " << elbo(D, vp, sigma2, alpha) << "\n";
136 return 0;
137}
138

We approximate the posterior over regression weights w with a factorized Gaussian q(w)=∏ N(w_j|m_j,s_j^2). For Gaussian likelihood and Gaussian prior, the CAVI updates are closed-form. We maintain the residual r=y−X m so each coordinate update is O(n), yielding O(nd) per sweep. We also compute the ELBO exactly under the mean-field q to monitor convergence. The code generates synthetic data, runs CAVI, and prints the learned posterior means and average variances.

Time: Per sweep O(n d); typically tens to hundreds of sweeps for convergence.Space: O(n d) to store X, plus O(n + d) for vectors and variational parameters.
Black-box VI with Reparameterization for a Single Latent Gaussian
1#include <bits/stdc++.h>
2using namespace std;
3
4// Model: z ~ N(0,1); x_i | z ~ N(z, sigma^2) for i=1..n (shared latent z)
5// Variational family: q(z) = N(mu, s^2) with parameters (mu, l=log s^2)
6// ELBO(mu,l) = E_q[ log p(z) + sum_i log p(x_i|z) - log q(z) ]
7// Reparameterization: z = mu + s * eps, eps ~ N(0,1)
8// Pathwise gradients:
9// Let A'(z) = d/dz [ log p(z) + sum_i log p(x_i|z) ] = -z + (sum x_i)/sigma^2 - (n/sigma^2) z
10// With s = exp(0.5*l): dL/dmu = A'(z), dL/dl = 0.5 * (A'(z) * s * eps + 1)
11
12struct BBVIConfig {
13 int iters = 2000; // optimization steps
14 int S = 8; // MC samples per step
15 double lr_mu = 1e-2; // learning rate for mu
16 double lr_l = 1e-2; // learning rate for log-variance l
17};
18
19struct Data { vector<double> x; double sigma; };
20
21struct VarParams { double mu, l; }; // l = log s^2
22
23// Compute a Monte Carlo estimate of ELBO and its gradients w.r.t. (mu,l)
24struct Est { double elbo; double gmu; double gl; };
25
26Est estimate_elbo_and_grad(const Data& D, const VarParams& vp, int S, mt19937& rng) {
27 normal_distribution<double> N01(0.0, 1.0);
28 int n = (int)D.x.size();
29 double sigma2 = D.sigma * D.sigma;
30 double s = exp(0.5 * vp.l);
31 double sumx = accumulate(D.x.begin(), D.x.end(), 0.0);
32 const double c_log_pz = -0.5 * log(2.0 * M_PI);
33 const double c_log_px = -0.5 * log(2.0 * M_PI * sigma2);
34 double elbo_acc = 0.0, gmu_acc = 0.0, gl_acc = 0.0;
35
36 for (int t = 0; t < S; ++t) {
37 double eps = N01(rng);
38 double z = vp.mu + s * eps;
39 // log p(z)
40 double log_pz = c_log_pz - 0.5 * (z * z);
41 // sum log p(x_i | z)
42 double quad = 0.0;
43 for (double xi : D.x) {
44 double diff = xi - z;
45 quad += diff * diff;
46 }
47 double log_px = n * c_log_px - 0.5 * quad / sigma2;
48 // log q(z; mu, s)
49 double log_q = -0.5 * ( (z - vp.mu) * (z - vp.mu) / (s * s) + 2.0 * log(s) + log(2.0 * M_PI) );
50 double L = log_pz + log_px - log_q;
51 elbo_acc += L;
52 // Gradients
53 double Aprime = -z + (sumx / sigma2) - (n / sigma2) * z; // d/dz of log p(z) + sum log p(x|z)
54 double dL_dmu = Aprime; // pathwise derivative
55 double dL_dl = 0.5 * (Aprime * s * eps + 1.0); // chain rule via s = exp(0.5*l)
56 gmu_acc += dL_dmu;
57 gl_acc += dL_dl;
58 }
59 Est out; out.elbo = elbo_acc / S; out.gmu = gmu_acc / S; out.gl = gl_acc / S;
60 return out;
61}
62
63VarParams fit_bbvi(const Data& D, const BBVIConfig& cfg, unsigned seed=7) {
64 mt19937 rng(seed);
65 VarParams vp{0.0, 0.0}; // initialize mu=0, l=0 => s^2=1
66 for (int it = 1; it <= cfg.iters; ++it) {
67 Est e = estimate_elbo_and_grad(D, vp, cfg.S, rng);
68 // Simple SGD updates
69 vp.mu += cfg.lr_mu * e.gmu;
70 vp.l += cfg.lr_l * e.gl;
71 if (it % 200 == 0) {
72 double s2 = exp(vp.l);
73 cerr << "Iter " << it << ": ELBO = " << e.elbo
74 << ", mu = " << vp.mu << ", s2 = " << s2 << "\n";
75 }
76 }
77 return vp;
78}
79
80int main() {
81 ios::sync_with_stdio(false);
82 cin.tie(nullptr);
83
84 // Generate synthetic data from the model with a true latent z_true
85 int n = 200; double sigma = 0.75; double z_true = 1.75;
86 mt19937 rng(123);
87 normal_distribution<double> N01(0.0, 1.0);
88 normal_distribution<double> Nzt(z_true, sigma);
89 Data D; D.sigma = sigma; D.x.resize(n);
90 for (int i = 0; i < n; ++i) D.x[i] = Nzt(rng);
91
92 BBVIConfig cfg; cfg.iters = 2000; cfg.S = 8; cfg.lr_mu = 5e-3; cfg.lr_l = 5e-3;
93 VarParams vp = fit_bbvi(D, cfg, 321);
94
95 cout << fixed << setprecision(6);
96 cout << "Learned mu: " << vp.mu << "\n";
97 cout << "Learned s2: " << exp(vp.l) << "\n";
98 return 0;
99}
100

This example performs black-box VI for a simple latent Gaussian model where a single latent z explains all observations. We use a diagonal Gaussian q(z)=N(μ,s^2) and optimize the ELBO via the reparameterization trick: z=μ+sε. The pathwise gradients have simple closed forms, enabling a concise SGD implementation without external libraries. The code reports ELBO progress and converges toward the true posterior (which is Gaussian for this model).

Time: O(S n) per iteration, where S is the number of Monte Carlo samples.Space: O(n) for data and O(1) for the two variational parameters.
#variational inference#elbo#kl divergence#mean-field#cavi#black-box vi#reparameterization trick#stochastic variational inference#gaussian posterior#bayesian linear regression#vae#amortized inference#monte carlo gradients#coordinate ascent#jensen's inequality