Evidence Lower Bound (ELBO)
Key Points
- •The Evidence Lower Bound (ELBO) is a tractable lower bound on the log evidence log p(x) used to perform approximate Bayesian inference.
- •ELBO splits into a reconstruction term [log p(xx) || p(z)) that keeps the posterior close to the prior.
- •Maximizing ELBO is equivalent to minimizing the KL divergence between the variational posterior q(z|x) and the true posterior p(z|x).
- •Jensen’s inequality guarantees ELB p(x), and the gap equals KL(q(zx)), which is always nonnegative.
- •For Gaussian models, the KL and many terms have closed-form expressions, enabling efficient and stable C++ implementations.
- •Monte Carlo with the reparameterization trick provides unbiased, low-variance gradient estimates of the ELBO.
- •Importance sampling with the log-sum-exp trick can estimate log p(x) and empirically verify the ELBO bound.
- •ELBO underpins VAEs, variational inference in probabilistic models, and modern latent-variable learning.
Prerequisites
- →Basic probability and random variables — Understanding priors, likelihoods, and expectations is necessary to interpret ELBO terms.
- →Bayes’ rule — ELBO arises from approximating the intractable posterior p(z|x) defined via Bayes’ rule.
- →KL divergence and information theory — ELBO uses KL(q||p) as a regularizer and its properties ensure the lower bound.
- →Monte Carlo estimation — ELBO expectations are typically approximated via sampling.
- →Gaussian distributions — Common closed-form KL and entropy results for Gaussians are heavily used in practice.
- →Calculus and chain rule — Needed to derive gradients for ELBO, especially with reparameterization.
- →Numerical stability in logs and exponentials — Stable computations (e.g., log-sum-exp) prevent underflow/overflow during ELBO estimation.
Detailed Explanation
Tap terms for definitions01Overview
The Evidence Lower Bound (ELBO) is a cornerstone concept in variational inference, a technique for approximating intractable Bayesian posteriors. In latent-variable models, we posit hidden variables z that generate observations x through a likelihood p(x|z) and are themselves drawn from a prior p(z). The true posterior p(z|x) is often hard to compute because it requires the evidence log p(x) = log ∫ p(x, z) dz. ELBO introduces a family of simpler distributions q(z|x) and optimizes them so that they approximate p(z|x). The central result is a bound: log p(x) ≥ E_q[log p(x, z)] − E_q[log q(z)], which can be rewritten as E_q[log p(x|z)] − KL(q(z|x) || p(z)). This form decomposes the objective into a data reconstruction term and a complexity penalty via the Kullback–Leibler divergence to the prior. Maximizing ELBO both encourages accurate explanations of the data and prevents overfitting by limiting how much information is carried by z. Practically, ELBO can be estimated with Monte Carlo and differentiated via the reparameterization trick, enabling scalable learning in high dimensions. ELBO is foundational for Variational Autoencoders (VAEs), Bayesian neural networks, topic models (e.g., LDA), and many probabilistic graphical models.
02Intuition & Analogies
Imagine compressing a photo album onto a USB drive. You want the files to look like the originals when decompressed (good reconstruction), but you also want the USB drive to be small (limited capacity). ELBO captures this exact trade-off for probabilistic models. The latent variable z is like the compressed code; the decoder p(x|z) reconstructs the data; the prior p(z) sets the shape and capacity of the codebook; and q(z|x) is the encoder that creates codes for each example. The first ELBO term, E_q[log p(x|z)], rewards reconstructions that look like the data—akin to having crisp photos after decompression. The second term, −KL(q(z|x) || p(z)), penalizes using too many unusual codes—like forcing us to pack information efficiently so that the codes remain near the prior’s typical set. Why is it a lower bound? Think of log p(x) as the ideal score for compressing and reconstructing images without constraints. Because we restrict ourselves to a simpler family q(z|x), we can’t reach that ideal unless our q matches the true posterior. Jensen’s inequality formalizes that the expected log of a random variable is less than or equal to the log of its expectation—hence a bound. The gap between the ELBO and the true log evidence is exactly how much our encoder q(z|x) deviates from the true posterior p(z|x). When the encoder is perfect, the gap vanishes and the bound becomes tight. In practice, we optimize the encoder and decoder jointly so that reconstructions are good yet the latent codes remain simple and smooth, which improves generalization.
03Formal Definition
04When to Use
Use ELBO whenever you want Bayesian inference but the true posterior p(z|x) is intractable. This includes: (1) Latent variable models such as VAEs, where z represents compressed features and p(x|z) is a neural decoder; (2) Probabilistic topic models like LDA, where documents are mixtures over topics; (3) Hierarchical Bayesian models where conjugacy is absent or partial; (4) Large-scale problems that require amortized inference—learning a global encoder q(z|x) that works for many x efficiently; and (5) Scenarios requiring differentiable objectives for gradient-based optimization, where the reparameterization trick applies. ELBO is also useful as a training objective balancing data fit and model complexity via an interpretable regularizer. When q can be expressive (e.g., normalizing flows), ELBO can be very tight. When q is limited, ELBO still provides a principled target, and diagnostics like the bound gap (via importance sampling) inform whether q needs to be improved. If your latent variables are discrete, ELBO can still be used, but gradients may require REINFORCE-like estimators or continuous relaxations (e.g., Gumbel-Softmax).
⚠️Common Mistakes
• Ignoring numerical stability in log-probabilities. Computing log p(x|z), log p(z), and log q(z|x) can underflow for extreme values; always work in log space and use tricks like log-sum-exp. • Miscomputing the KL divergence. For Gaussians, use the correct closed-form KL with variance (not standard deviation) and ensure positivity using log-variance parametrization. • Forgetting the sign: ELBO includes −KL(q || p). Accidentally maximizing reconstruction minus negative KL (i.e., adding KL) will over-regularize or destabilize training. • Using too few Monte Carlo samples. This increases variance of gradient estimates and can obscure learning progress; use several samples per data point or average over minibatches. • Confusing bound tightness with model quality. A tight ELBO can indicate a good variational family, but poor generative assumptions (bad p(x|z) or p(z)) can still yield poor samples. • Mixing parameterizations. If you parametrize q with log-variance, derive gradients accordingly; treating it as variance can lead to exploding or vanishing gradients. • Not checking the bound. Use importance sampling estimates of log p(x) to verify ELBO ≤ estimate; large gaps may suggest a more expressive q (e.g., flows) or improved encoder/decoder architectures.
Key Formulas
Evidence (Marginal Likelihood)
Explanation: The evidence is the log probability of data after integrating out latent variables. It is often intractable to compute directly.
ELBO (Joint Form)
Explanation: This is the basic definition of ELBO as an expectation under the variational posterior. It trades off data fit and complexity.
ELBO (Reconstruction + KL)
Explanation: Decomposes ELBO into a reconstruction term and a KL regularizer to the prior. This is the standard VAE objective.
Bound Decomposition
Explanation: The gap between log evidence and ELBO is the KL divergence between q and the true posterior, which is always nonnegative.
KL Divergence
Explanation: Definition of KL divergence. It measures how much q differs from p and is zero only when q equals p almost everywhere.
Gaussian-to-Standard Gaussian KL
Explanation: Closed-form KL for univariate Gaussian to standard normal. Useful for efficient ELBO computation in VAEs.
Reparameterization Trick
Explanation: Expresses sampling from q as a deterministic transform of noise, enabling gradients to pass through samples.
Jensen’s Inequality for ELBO
Explanation: The concavity of log implies the ELBO is a lower bound of the log evidence.
Monte Carlo ELBO
Explanation: Approximates the ELBO with S samples from q. Variance decreases with more samples.
Importance Sampling Estimate
Explanation: Estimates log evidence using weighted samples from q. Using log-sum-exp yields numerical stability.
Gaussian Entropy
Explanation: Closed-form entropy of a multivariate Gaussian. Appears in ELBO via the −[log q] term.
Gradients of Gaussian KL
Explanation: Convenient gradients when parametrizing with mean and log-variance. Useful for manual gradient checks.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <vector> 5 #include <numeric> 6 #include <iomanip> 7 8 // This example computes ELBO for a simple model: 9 // z ~ N(0, 1) 10 // x | z ~ N(z, sigma_x^2) 11 // q(z|x) = N(mu, var) with var = exp(logvar) 12 // ELBO(x) = E_q[log p(x|z)] - KL(q(z|x) || p(z)) 13 14 static const double PI = 3.14159265358979323846; 15 16 // Log PDF of univariate normal N(mean, var) 17 double normal_logpdf(double x, double mean, double var) { 18 // add a small epsilon for numerical safety 19 double v = std::max(var, 1e-12); 20 double diff = x - mean; 21 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v); 22 } 23 24 // KL(N(mu, var) || N(0,1)) in closed form 25 double kl_gaussian_to_standard(double mu, double var) { 26 // 0.5 * (mu^2 + var - 1 - log var) 27 double v = std::max(var, 1e-12); 28 return 0.5 * (mu * mu + v - 1.0 - std::log(v)); 29 } 30 31 int main() { 32 // Fixed generative noise sigma_x 33 double sigma_x = 0.5; // std dev of p(x|z) 34 double var_x = sigma_x * sigma_x; 35 36 // A single observation x 37 double x = 1.2; 38 39 // Variational parameters q(z|x) = N(mu, var), with var = exp(logvar) 40 double mu = 0.8; 41 double logvar = -0.5; // so var ~ 0.60653 42 double var = std::exp(logvar); 43 double sigma = std::sqrt(var); 44 45 // Monte Carlo samples 46 int S = 10000; 47 48 std::mt19937 rng(123); 49 std::normal_distribution<double> standard_normal(0.0, 1.0); 50 51 // Estimate E_q[log p(x|z)] via Monte Carlo 52 double sum_loglik = 0.0; 53 for (int s = 0; s < S; ++s) { 54 double eps = standard_normal(rng); 55 double z = mu + sigma * eps; // reparameterization 56 sum_loglik += normal_logpdf(x, z, var_x); 57 } 58 double recon_term = sum_loglik / static_cast<double>(S); 59 60 // Analytic KL term 61 double kl = kl_gaussian_to_standard(mu, var); 62 63 // ELBO = recon - KL 64 double elbo = recon_term - kl; 65 66 // True log evidence log p(x) is available in closed form for this model: 67 // z ~ N(0,1), x|z ~ N(z, sigma_x^2) => x ~ N(0, 1 + sigma_x^2) 68 double log_p_x = normal_logpdf(x, 0.0, 1.0 + var_x); 69 70 std::cout << std::fixed << std::setprecision(6); 71 std::cout << "Reconstruction term E_q[log p(x|z)]: " << recon_term << "\n"; 72 std::cout << "KL(q||p): " << kl << "\n"; 73 std::cout << "ELBO: " << elbo << "\n"; 74 std::cout << "True log p(x): " << log_p_x << "\n"; 75 std::cout << "ELBO <= log p(x)? " << (elbo <= log_p_x ? "yes" : "no") << "\n"; 76 77 return 0; 78 } 79
We define a simple 1D latent Gaussian model with Gaussian likelihood and a Gaussian variational posterior. The reconstruction term is estimated by Monte Carlo using the reparameterization trick. The KL(q||p) to a standard normal prior is computed in closed form. Because the marginal p(x) is also Gaussian, we compute log p(x) exactly and verify the ELBO bound.
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <iomanip> 5 6 // Model: z ~ N(0,1), x|z ~ N(z, sigma_x^2) 7 // q(z|x) = N(mu, var), var = exp(logvar) 8 // ELBO = E_q[log p(x|z)] - KL(q||p) 9 // We compute gradients wrt mu and logvar using reparameterization. 10 11 static const double PI = 3.14159265358979323846; 12 13 double normal_logpdf(double x, double mean, double var) { 14 double v = std::max(var, 1e-12); 15 double diff = x - mean; 16 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v); 17 } 18 19 // Derivative of log p(x|z) wrt z for Gaussian likelihood N(z, var_x) 20 double dloglik_dz(double x, double z, double var_x) { 21 // loglik = -0.5*log(2pi var_x) - 0.5*(x - z)^2/var_x 22 // derivative wrt z is (x - z)/var_x 23 return (x - z) / var_x; 24 } 25 26 // Closed-form KL and its gradients w.r.t. mu and logvar (for q to N(0,1)) 27 struct KLGrads { 28 double value; 29 double d_mu; 30 double d_logvar; 31 }; 32 33 KLGrads kl_and_grads(double mu, double logvar) { 34 double var = std::exp(logvar); 35 double kl = 0.5 * (mu * mu + var - 1.0 - std::log(std::max(var, 1e-12))); 36 double d_mu = mu; // d/dmu KL = mu 37 double d_logvar = 0.5 * (var - 1.0); // d/d logvar KL = 0.5*(var - 1) 38 return {kl, d_mu, d_logvar}; 39 } 40 41 int main() { 42 double x = 1.2; 43 double sigma_x = 0.5; 44 double var_x = sigma_x * sigma_x; 45 46 // Initialize variational params 47 double mu = 0.8; 48 double logvar = -0.5; 49 50 int S = 4096; // MC samples for gradient estimate 51 double lr = 0.05; // learning rate for one gradient step 52 53 std::mt19937 rng(42); 54 std::normal_distribution<double> standard_normal(0.0, 1.0); 55 56 double var = std::exp(logvar); 57 double sigma = std::sqrt(var); 58 59 // Reconstruction gradients via reparameterization 60 double grad_mu_recon = 0.0; 61 double grad_logvar_recon = 0.0; 62 double recon_estimate = 0.0; 63 64 for (int s = 0; s < S; ++s) { 65 double eps = standard_normal(rng); 66 double z = mu + sigma * eps; 67 // accumulate reconstruction value (optional, for logging) 68 recon_estimate += normal_logpdf(x, z, var_x); 69 // derivative of loglik wrt z 70 double dl_dz = dloglik_dz(x, z, var_x); 71 // chain rule: z = mu + sigma * eps, sigma = exp(0.5*logvar) 72 // dsigma/dlogvar = 0.5 * sigma 73 grad_mu_recon += dl_dz * 1.0; // dz/dmu = 1 74 grad_logvar_recon += dl_dz * eps * (0.5 * sigma); // dz/dlogvar = eps * 0.5*sigma 75 } 76 77 grad_mu_recon /= static_cast<double>(S); 78 grad_logvar_recon /= static_cast<double>(S); 79 recon_estimate /= static_cast<double>(S); 80 81 // KL and its gradients 82 KLGrads k = kl_and_grads(mu, logvar); 83 84 // ELBO gradients: grad = grad_recon - grad_KL 85 double grad_mu = grad_mu_recon - k.d_mu; 86 double grad_logvar = grad_logvar_recon - k.d_logvar; 87 88 // One gradient ascent step on ELBO 89 mu += lr * grad_mu; 90 logvar += lr * grad_logvar; 91 92 std::cout << std::fixed << std::setprecision(6); 93 std::cout << "Reconstruction term estimate: " << recon_estimate << "\n"; 94 std::cout << "KL value: " << k.value << "\n"; 95 std::cout << "Grad mu (recon, KL, total): " << grad_mu_recon << ", " << k.d_mu << ", " << grad_mu << "\n"; 96 std::cout << "Grad logvar (recon, KL, total): " << grad_logvar_recon << ", " << k.d_logvar << ", " << grad_logvar << "\n"; 97 std::cout << "Updated mu: " << mu << ", updated logvar: " << logvar << "\n"; 98 99 return 0; 100 } 101
This program computes Monte Carlo gradients of the ELBO with respect to the mean and log-variance of a Gaussian variational posterior using the reparameterization trick. The reconstruction term gradients are estimated via samples; the KL term and its gradients are analytic. We perform one gradient ascent step to illustrate parameter updates.
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <vector> 5 #include <algorithm> 6 #include <numeric> 7 #include <iomanip> 8 9 static const double PI = 3.14159265358979323846; 10 11 double normal_logpdf(double x, double mean, double var) { 12 double v = std::max(var, 1e-12); 13 double diff = x - mean; 14 return -0.5 * (std::log(2.0 * PI * v) + (diff * diff) / v); 15 } 16 17 // Stable log-sum-exp of a vector of log-weights 18 double logsumexp(const std::vector<double>& logw) { 19 double m = *std::max_element(logw.begin(), logw.end()); 20 double sum = 0.0; 21 for (double lw : logw) sum += std::exp(lw - m); 22 return m + std::log(std::max(sum, 1e-300)); 23 } 24 25 int main() { 26 // Model as before 27 double sigma_x = 0.5; // std dev of p(x|z) 28 double var_x = sigma_x * sigma_x; 29 double x = 1.2; 30 31 // Variational q(z|x) 32 double mu = 0.8; 33 double logvar = -0.5; 34 double var = std::exp(logvar); 35 double sigma = std::sqrt(var); 36 37 int S = 20000; // number of importance samples 38 39 std::mt19937 rng(7); 40 std::normal_distribution<double> standard_normal(0.0, 1.0); 41 42 std::vector<double> log_w; log_w.reserve(S); 43 double sum_elbo_terms = 0.0; // average of log w approximates the single-sample ELBO 44 45 for (int s = 0; s < S; ++s) { 46 double eps = standard_normal(rng); 47 double z = mu + sigma * eps; // sample from q via reparameterization 48 49 double log_p_x_given_z = normal_logpdf(x, z, var_x); 50 double log_p_z = normal_logpdf(z, 0.0, 1.0); 51 double log_q = normal_logpdf(z, mu, var); 52 53 double lw = log_p_x_given_z + log_p_z - log_q; // log importance weight 54 log_w.push_back(lw); 55 sum_elbo_terms += lw; 56 } 57 58 // Importance sampling estimate of log p(x) 59 double lse = logsumexp(log_w); 60 double log_p_x_hat = lse - std::log(static_cast<double>(S)); 61 62 // Monte Carlo ELBO estimate (using single-sample form E_q[log w]) 63 double elbo_mc = sum_elbo_terms / static_cast<double>(S); 64 65 // True log p(x) for verification 66 double log_p_x_true = normal_logpdf(x, 0.0, 1.0 + var_x); 67 68 std::cout << std::fixed << std::setprecision(6); 69 std::cout << "MC ELBO (E[log w]): " << elbo_mc << "\n"; 70 std::cout << "IS log p(x) (log E[w]): " << log_p_x_hat << "\n"; 71 std::cout << "True log p(x): " << log_p_x_true << "\n"; 72 std::cout << "Check: ELBO <= IS estimate? " << (elbo_mc <= log_p_x_hat ? "yes" : "no") << "\n"; 73 74 return 0; 75 } 76
We compare the Monte Carlo ELBO estimate E_q[log w] with an importance sampling estimate log E_q[w], where w = p(x, z)/q(z|x). By Jensen’s inequality, E[log w] ≤ log E[w], so ELBO should be below the IS estimate. We implement a numerically stable log-sum-exp to compute log E[w] reliably.