šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
šŸ“šTheoryAdvanced

Variational Autoencoders (VAE) Theory

Key Points

  • •
    A Variational Autoencoder (VAE) is a probabilistic autoencoder that learns to generate data by inferring hidden causes (latent variables) and decoding them back to observations.
  • •
    Because exact inference is intractable, VAEs optimize the Evidence Lower BOund (ELBO), which balances reconstruction quality and how close the latent distribution is to a simple prior.
  • •
    The ELBO is log p(x) ≄ Eq​[log p(x∣z)]āˆ’DK​L(q(z∣x) || p(z)), trading off fidelity and regularization.
  • •
    The reparameterization trick z = μ(x) + σ(x) āŠ™ ε with ε ∼ N(0, I) allows low-variance, differentiable Monte Carlo estimates of gradients.
  • •
    For Gaussian decoders the reconstruction term becomes a scaled mean-squared error, and the KL to a standard normal has a closed form.
  • •
    Amortized inference uses a neural encoder qϕ(z|x) so inference at test time is fast and shared across data points.
  • •
    Common pitfalls include posterior collapse, unstable variance parameterization, and mis-scaled KL (often mitigated with Ī²āˆ’VAEs or KL warm-up).
  • •
    In C++, you can compute ELBO terms, KL divergences, and pathwise gradients using standard libraries without deep-learning frameworks.

Prerequisites

  • →Basic probability and random variables — VAEs are probabilistic models; understanding expectations, variances, and independence is essential.
  • →Gaussian distributions — Common VAE priors and posteriors are Gaussian; closed-form KL and sampling rely on Gaussian identities.
  • →KL divergence and information theory — The ELBO trades off reconstruction and KL; knowing KL properties clarifies regularization behavior.
  • →Bayes’ rule and latent-variable models — VAEs approximate the intractable posterior pĪø(z|x) with qĻ•(z|x) via variational inference.
  • →Linear algebra — Neural networks and likelihoods involve vectors, matrices, and norms.
  • →Calculus and chain rule — Backpropagation and reparameterization gradients use derivatives through composite functions.
  • →Monte Carlo estimation — ELBO expectations are approximated by sampling; variance–bias trade-offs matter.
  • →Neural networks (MLP/CNN basics) — Encoders/decoders are neural nets mapping between x and z distributions.
  • →Optimization and SGD — Training maximizes the ELBO with stochastic gradient-based methods.
  • →Numerical stability practices — Working with log-variances, clipping, and epsilons prevents NaNs during training.

Detailed Explanation

Tap terms for definitions

01Overview

Hook → Concept → Example: Imagine compressing a photo into a short code that, when expanded, recreates a realistic photo—not necessarily the exact same pixels, but one that looks as if it came from the same world. That’s the idea behind Variational Autoencoders (VAEs). A VAE is a probabilistic model that assumes each observation x is generated from a low-dimensional hidden variable z through a decoder pĪø(x|z), while z itself comes from a simple prior p(z), often a standard normal. Because computing log pĪø(x) = log ∫ pĪø(x|z)p(z) dz is usually impossible exactly, VAEs introduce an encoder qĻ•(z|x) to approximate the true posterior pĪø(z|x). The training objective maximizes the Evidence Lower BOund (ELBO), balancing reconstruction fidelity with a regularization that keeps qĻ•(z|x) close to p(z). Conceptually, this creates a learned, stochastic compression scheme: the encoder maps data to a distribution over codes; the decoder maps codes back to data. Example: With a diagonal Gaussian encoder and an isotropic Gaussian decoder, the ELBO reduces to a mean-squared reconstruction term plus a closed-form KL divergence. Optimizing this objective yields a model that can sample new, realistic data by first sampling z ∼ p(z) and then decoding.

02Intuition & Analogies

Hook: Think of a film studio building a digital actor. They don’t store every possible pose pixel-by-pixel; they store a small set of controls—sliders for head turn, smile, lighting—and a renderer that turns these sliders into a full image. Concept: VAEs learn exactly this: a set of hidden sliders (z) and a renderer (decoder) that turns them into data. Instead of finding a single exact slider position for each image, VAEs learn a distribution over sliders: some uncertainty is allowed and even encouraged, because many different slider settings can plausibly explain a given image. The encoder learns, for each input, how to set the mean and spread (variance) of these sliders. The decoder then tries to reconstruct the input from a sample of sliders. The training rule encourages two things: (1) reconstructions should match the input (so the renderer is accurate), and (2) slider settings should look like simple, standard noise (so the sliders are easy to sample at test time). Example: Suppose you model handwritten digits. The latent z may capture stroke thickness, slant, and general digit identity. For a specific ā€˜3’, the encoder might say ā€œz around (0.7, āˆ’0.2) with some uncertainty.ā€ Sampling z from this distribution and rendering it gives slightly different but plausible ā€˜3’s. The regularizer nudges these z’s toward standard normal, preventing the model from memorizing each digit with unique, incomparable codes. The reparameterization trick is like saying: instead of rolling dice inside the network in a way that blocks gradients, roll dice (ε) outside and then transform them deterministically (z = μ + σ āŠ™ ε), so you can still use calculus to learn μ and σ.

03Formal Definition

Let p(z) be a prior over latent variables (commonly N(0, I)) and pĪø(x∣z) a decoder (likelihood) parameterized by Īø. The joint model is pĪø(x, z) = p(z)pĪø(x∣z). For an observation x, the marginal likelihood is log pĪø(x) = log ∫ pĪø(x∣z)p(z) dz, which is generally intractable. Introduce a variational posterior qĻ•(z|x) to approximate pĪø(z∣x). The ELBO on log pĪø(x) is L(Īø, Ļ•; x) = EqĻ•(z∣x)​[log pĪø(x∣z)] āˆ’ DKL​(qĻ•(z|x) |∣p(z)),andsatisfieslogpĪø(x)≄L(Īø,Ļ•;x).Equivalently,logpĪø(x)=L(Īø,Ļ•;x)+DKL​(qĻ•(z∣x) || pĪø(z∣x)), showing the gap is exactly the inference error. For Gaussian choices, qĻ•(z|x) = N(μϕ(x), diag(ĻƒĻ•2​(x))) and p(z) = N(0, I), the KL admits a closed form. To optimize with stochastic gradients, sample z via the reparameterization trick: z = μϕ(x) + ĻƒĻ•(x) āŠ™ ε with ε ∼ N(0, I). A single-sample Monte Carlo estimate of the ELBO provides unbiased gradients for Īø and low-variance pathwise gradients for Ļ• (minus analytic KL gradients). Variants include Ī²āˆ’VAEs (scaling the KL term) and alternative likelihoods (Bernoulli for binary data, Gaussian for continuous).

04When to Use

Use VAEs when you want: (1) Generative modeling with controllable, continuous latent representations (e.g., image, audio, or text embeddings that can be smoothly interpolated). (2) Fast amortized inference—given x, you instantly get a distribution over z without running an expensive per-example optimizer. (3) Principled likelihood-based training, enabling model comparison via ELBO. (4) Semi-supervised learning—combine a supervised term with the ELBO. (5) Anomaly detection—unlikely x under the model (low ELBO) flags anomalies. (6) Missing data imputation—sample latent z and decode missing parts. (7) Compression—store μ and σ for z instead of raw x. Prefer VAEs over GANs when you need likelihoods, uncertainty, or reproducible training with stable gradients. Be cautious if your data are discrete with complex likelihoods (reparameterization for discrete z is harder), or when powerful decoders (e.g., autoregressive transformers) might cause posterior collapse; mitigations include β-VAEs, KL warm-up, free-bits, structured priors, or weaker decoders. VAEs are especially effective when latent structure is roughly continuous and the chosen likelihood fits the data type.

āš ļøCommon Mistakes

  • Ignoring variance parameterization: Predicting σ instead of log σ^2 can produce negative or numerically unstable variances. Prefer predicting log-variance (logvar) and transforming with σ = exp(0.5Ā·logvar). - Using too few Monte Carlo samples: One sample works in practice, but for debugging and evaluation the variance of the ELBO estimate can mislead. Average across multiple samples when validating. - Forgetting analytic KL: For diagonal Gaussians, always use the closed-form KL to standard normal. Estimating it via sampling adds noise without benefit. - Mismatched decoder likelihood: Using a Gaussian likelihood for inherently binary pixels or counts can distort training signals. Choose Bernoulli for binary, Gaussian for continuous, Poisson/negative-binomial for counts. - Posterior collapse: An overly expressive decoder can ignore z, driving the KL to zero. Use β < 1 initially (warm-up), free-bits, architectural constraints, or weaker decoders. - Scaling issues: The reconstruction term scales with data dimension; forgetting this can make the KL seem too small. Monitor per-dimension terms or use β-VAEs to rebalance. - Numerical stability: Directly computing log σ^2 or norms without eps can cause NaNs. Add small epsilons to logs/variances and clip logvars. - Misinterpreting ELBO: Higher ELBO means a better model on average, but sample quality can still lag GANs; assess with downstream metrics (FID, log-likelihood estimates) and reconstructions.

Key Formulas

ELBO

logpθ​(x)≄Eqϕ​(z∣x)​[logpθ​(x∣z)]āˆ’DKL​(qϕ​(z∣x)∄p(z))

Explanation: The ELBO lower-bounds the log evidence by a reconstruction term minus a KL regularizer. Maximizing it yields both good reconstructions and a posterior close to the prior.

Evidence Decomposition

logpθ​(x)=L(Īø,Ļ•;x)+DKL​(qϕ​(z∣x)∄pθ​(z∣x))

Explanation: The gap between the true log-likelihood and the ELBO is exactly the KL between the approximate and true posterior. As q approaches the true posterior, the ELBO becomes tight.

KL to Standard Normal (Diagonal)

DKL​(N(μ,diag(σ2))∄N(0,I))=21​i=1āˆ‘d​(μi2​+σi2ā€‹āˆ’1āˆ’logσi2​)

Explanation: Closed-form KL for a diagonal Gaussian against a standard normal. This is used in nearly all practical VAEs to avoid sampling noise in the KL term.

Reparameterization Trick

z=μϕ​(x)+ĻƒĻ•ā€‹(x)āŠ™Īµ,ε∼N(0,I)

Explanation: Express the random variable z as a deterministic transform of base noise ε. This allows gradients to pass through z by differentiating the deterministic mapping.

Monte Carlo ELBO

L(Īø,Ļ•;x)=K1​k=1āˆ‘K​logpθ​(x∣z(k))āˆ’DKL​(qϕ​(z∣x)∄p(z)),z(k)∼qϕ​(z∣x)

Explanation: A finite-sample estimate of the ELBO used in practice. Using reparameterized samples reduces estimator variance for gradients.

Gaussian Likelihood

logpθ​(x∣z)=āˆ’2σx2​1ā€‹āˆ„xāˆ’Ī¼Īøā€‹(z)∄22ā€‹āˆ’2dx​​log(2Ļ€Ļƒx2​)

Explanation: For an isotropic Gaussian decoder, the log-likelihood decomposes into a scaled squared error term and a constant. This links VAEs with MSE-style reconstruction losses.

Decoder Gradient

āˆ‡Īøā€‹L(Īø,Ļ•;x)=Eqϕ​(z∣x)​[āˆ‡Īøā€‹logpθ​(x∣z)]

Explanation: The gradient of the ELBO with respect to decoder parameters depends only on how the likelihood changes with z. It is estimated via reparameterized samples of z.

Pathwise Gradient for Encoder

āˆ‡Ļ•ā€‹L(Īø,Ļ•;x)=Eε​[āˆ‡z​logpθ​(x∣z)āˆ‚z/āˆ‚Ļ•]āˆ’āˆ‡Ļ•ā€‹DKL​(qϕ​(z∣x)∄p(z))

Explanation: Using reparameterization, encoder gradients flow through z’s dependence on φ plus the analytic gradient of the KL term. This yields low-variance updates.

β-VAE Objective

Lβ​(Īø,Ļ•;x)=Eqϕ​(z∣x)​[logpθ​(x∣z)]āˆ’Ī²DKL​(qϕ​(z∣x)∄p(z))

Explanation: Scaling the KL by β trades off reconstruction fidelity against latent regularization. Larger β encourages more disentangled or compressed representations.

Complexity Analysis

Consider a VAE with latent dimension d, observation dimension m, encoder/decoder forward costs Ce​nc and Cd​ec per example (often O(mĀ·h + hĀ·d) for hidden width h), batch size B, and K Monte Carlo samples per data point. A single ELBO estimate requires: (1) running the encoder once per example to obtain μ and logvar (cost BĀ·Ce​nc), (2) drawing K samples of ε and forming z = μ + σ āŠ™ ε (cost O(BĀ·KĀ·d)), (3) running the decoder K times per example to compute likelihood terms (cost BĀ·KĀ·Cd​ec), and (4) computing the analytic KL (cost O(BĀ·d)). Thus the total time per batch is O(BĀ·Ce​nc + BĀ·KĀ·Cd​ec + BĀ·KĀ·d + BĀ·d). In practice, the dominant term is typically BĀ·KĀ·Cd​ec, since decoder passes are far costlier than vector arithmetic in latent space. With K=1 (common in practice), time is roughly O(BĀ·(Ce​nc + Cd​ec)). Space complexity is driven by (a) parameters P=Pe​nc+Pd​ec, (b) activations for backprop (roughly proportional to batch size times layer widths), and (c) temporary storage for K latent samples. This gives O(P + BĀ·A + BĀ·KĀ·d), where A summarizes intermediate feature maps. When using diagonal Gaussians, KL computation adds negligible overhead (O(BĀ·d)). If one averages ELBO over multiple samples or uses ensembles, both time (linear in K) and space (linear in K for cached activations if needed) increase accordingly. Numerical stability measures (storing log-variances) have no asymptotic cost but are critical for robust training. Standalone C++ examples here operate in O(KĀ·d + mĀ·d) time with O(d + m) memory, since they omit deep network layers.

Code Examples

Compute ELBO for a simple linear–Gaussian VAE (single data point)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Helper: dot product
5double dot(const vector<double>& a, const vector<double>& b){
6 double s = 0.0; for(size_t i=0;i<a.size();++i) s += a[i]*b[i]; return s;
7}
8
9// Helper: squared L2 norm of (a - b)
10double sq_l2(const vector<double>& a, const vector<double>& b){
11 double s = 0.0; for(size_t i=0;i<a.size();++i){ double d = a[i]-b[i]; s += d*d; } return s;
12}
13
14// Matrix-vector multiply: y = W z + b
15vector<double> matvec_add(const vector<vector<double>>& W, const vector<double>& z, const vector<double>& b){
16 size_t m = W.size(); size_t d = z.size();
17 vector<double> y(m, 0.0);
18 for(size_t i=0;i<m;++i){
19 double s = 0.0;
20 for(size_t j=0;j<d;++j) s += W[i][j]*z[j];
21 y[i] = s + b[i];
22 }
23 return y;
24}
25
26// Log-density of isotropic Gaussian N(mean, sigma2 I) at x
27double log_gaussian_isotropic(const vector<double>& x, const vector<double>& mean, double sigma2){
28 const double TWO_PI = 2.0 * M_PI;
29 size_t m = x.size();
30 double quad = sq_l2(x, mean);
31 return -0.5 * ( (quad / sigma2) + m * log(TWO_PI * sigma2) );
32}
33
34// Sample from diagonal Gaussian N(mu, diag(exp(logvar)))
35vector<double> sample_diag_gaussian(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
36 normal_distribution<double> stdn(0.0, 1.0);
37 size_t d = mu.size();
38 vector<double> z(d);
39 for(size_t i=0;i<d;++i){
40 double eps = stdn(rng);
41 double sigma = exp(0.5 * logvar[i]);
42 z[i] = mu[i] + sigma * eps;
43 }
44 return z;
45}
46
47// Closed-form KL( N(mu, diag(exp(logvar))) || N(0, I) )
48double kl_standard_normal(const vector<double>& mu, const vector<double>& logvar){
49 double kl = 0.0;
50 for(size_t i=0;i<mu.size(); ++i){
51 double m2 = mu[i]*mu[i];
52 double v = exp(logvar[i]);
53 kl += (m2 + v - 1.0 - logvar[i]);
54 }
55 return 0.5 * kl;
56}
57
58int main(){
59 // Dimensions
60 const size_t m = 3; // observation dim
61 const size_t d = 2; // latent dim
62
63 // One observed example x
64 vector<double> x = {0.5, -1.0, 2.0};
65
66 // Linear decoder: mean_x(z) = W z + b, with fixed isotropic variance sigma2_x
67 vector<vector<double>> W = {{1.0, -0.5}, {0.2, 0.7}, {-0.3, 0.4}}; // m x d
68 vector<double> b = {0.1, -0.2, 0.3};
69 double sigma2_x = 0.05; // decoder observation noise variance
70
71 // Encoder output for this x: q(z|x) = N(mu_q, diag(exp(logvar_q)))
72 vector<double> mu_q = {0.3, -0.1};
73 vector<double> logvar_q = {-0.2, 0.1}; // can be any real numbers
74
75 // Monte Carlo estimate of E_q[ log p(x|z) ]
76 mt19937 rng(42);
77 size_t K = 2000; // number of samples for expectation
78 double recon_sum = 0.0;
79 for(size_t k=0;k<K;++k){
80 vector<double> z = sample_diag_gaussian(mu_q, logvar_q, rng);
81 vector<double> mean_x = matvec_add(W, z, b);
82 recon_sum += log_gaussian_isotropic(x, mean_x, sigma2_x);
83 }
84 double recon_term = recon_sum / static_cast<double>(K);
85
86 // Analytic KL to N(0, I)
87 double kl = kl_standard_normal(mu_q, logvar_q);
88
89 // ELBO = E_q[ log p(x|z) ] - KL
90 double elbo = recon_term - kl;
91
92 cout.setf(ios::fixed); cout<<setprecision(6);
93 cout << "Reconstruction term E_q[log p(x|z)] ā‰ˆ " << recon_term << "\n";
94 cout << "KL(q||p) = " << kl << "\n";
95 cout << "ELBO ā‰ˆ " << elbo << "\n";
96 return 0;
97}
98

This program constructs a tiny linear–Gaussian VAE: a linear decoder with isotropic Gaussian likelihood and a diagonal Gaussian encoder. It estimates the reconstruction term via Monte Carlo by sampling z from q(z|x) using the reparameterization trick internally (μ, logvar → σ, ε). The KL to a standard normal is computed in closed form. Subtracting the KL from the reconstruction term yields an ELBO estimate for a single data point.

Time: O(K Ā· (mĀ·d + m) + d) ā‰ˆ O(KĀ·mĀ·d) for K samples, observation dim m, latent dim d.Space: O(mĀ·d + m + d) for parameters and temporary vectors; dominated by storing W (mĀ·d).
Analytical vs Monte Carlo KL for diagonal Gaussians
1#include <bits/stdc++.h>
2using namespace std;
3
4double kl_diag_to_standard_normal_analytic(const vector<double>& mu, const vector<double>& logvar){
5 double s = 0.0; for(size_t i=0;i<mu.size();++i){ double v = exp(logvar[i]); s += (mu[i]*mu[i] + v - 1.0 - logvar[i]); } return 0.5*s; }
6
7// Log-density of diagonal Gaussian q and standard normal p at z
8double log_q_diag(const vector<double>& z, const vector<double>& mu, const vector<double>& logvar){
9 const double TWO_PI = 2.0 * M_PI; size_t d = z.size();
10 double s = 0.0; for(size_t i=0;i<d;++i){ double v = exp(logvar[i]); double diff = z[i] - mu[i]; s += (diff*diff)/v + log(2.0*M_PI*v); }
11 return -0.5 * s;
12}
13
14double log_p_standard_normal(const vector<double>& z){
15 double s = 0.0; for(double zi: z) s += zi*zi; return -0.5 * ( s + z.size()*log(2.0*M_PI) );
16}
17
18vector<double> sample_q(const vector<double>& mu, const vector<double>& logvar, mt19937& rng){
19 normal_distribution<double> stdn(0.0,1.0); size_t d = mu.size(); vector<double> z(d);
20 for(size_t i=0;i<d;++i){ double eps = stdn(rng); double sigma = exp(0.5*logvar[i]); z[i] = mu[i] + sigma*eps; }
21 return z;
22}
23
24int main(){
25 size_t d = 5; vector<double> mu = {0.2, -0.1, 0.3, 0.0, -0.4}; vector<double> logvar = {-0.3, 0.2, -0.1, 0.0, 0.4};
26 double kl_analytic = kl_diag_to_standard_normal_analytic(mu, logvar);
27
28 // Monte Carlo estimate: E_q[ log q(z) - log p(z) ]
29 mt19937 rng(123);
30 size_t K = 20000; // increase for lower variance
31 double sum_vals = 0.0;
32 for(size_t k=0;k<K;++k){
33 vector<double> z = sample_q(mu, logvar, rng);
34 double val = log_q_diag(z, mu, logvar) - log_p_standard_normal(z);
35 sum_vals += val;
36 }
37 double kl_mc = sum_vals / static_cast<double>(K);
38
39 cout.setf(ios::fixed); cout<<setprecision(6);
40 cout << "Analytic KL = " << kl_analytic << "\n";
41 cout << "Monte Carlo KL ā‰ˆ " << kl_mc << "\n";
42 cout << "Absolute error ā‰ˆ " << fabs(kl_analytic - kl_mc) << "\n";
43 return 0;
44}
45

This program verifies the closed-form KL divergence from a diagonal Gaussian to a standard normal using a Monte Carlo estimate E_q[log q(z) āˆ’ log p(z)]. As K grows, the Monte Carlo estimate concentrates around the analytic value, illustrating variance–accuracy trade-offs in stochastic estimation.

Time: O(K Ā· d) to sample and evaluate densities for K samples in d dimensions.Space: O(d) for vectors; constant additional memory aside from parameters.
Reparameterization trick: pathwise gradient estimate for a quadratic loss
1#include <bits/stdc++.h>
2using namespace std;
3
4// Computes Monte Carlo estimates of gradients of E_q[f(z)] where
5// q = N(mu, diag(exp(logvar))) and f(z) = 0.5 * ||z - a||^2.
6// Pathwise (reparameterized) gradients:
7// z_i = mu_i + sigma_i * eps_i, sigma_i = exp(0.5*logvar_i)
8// df/dz_i = z_i - a_i
9// d/dmu_i E[f] ā‰ˆ E[ z_i - a_i ]
10// d/dlogvar_i E[f] ā‰ˆ E[ (z_i - a_i) * (0.5*sigma_i*eps_i) ]
11// Exact gradients for comparison:
12// E[f] = 0.5 * sum_i ( (mu_i - a_i)^2 + sigma_i^2 )
13// d/dmu_i = mu_i - a_i
14// d/dlogvar_i = 0.5 * sigma_i^2
15
16int main(){
17 size_t d = 3;
18 vector<double> a = {0.5, -1.0, 0.3}; // target vector in the quadratic
19 vector<double> mu = {0.2, -0.4, 0.1}; // encoder mean
20 vector<double> logvar = {-0.1, 0.2, -0.3}; // encoder log-variance
21
22 // Precompute sigmas
23 vector<double> sigma(d), sigma2(d);
24 for(size_t i=0;i<d;++i){ sigma[i] = exp(0.5*logvar[i]); sigma2[i] = sigma[i]*sigma[i]; }
25
26 // Monte Carlo estimates
27 mt19937 rng(7);
28 normal_distribution<double> stdn(0.0,1.0);
29 size_t K = 30000; // number of samples
30 vector<double> g_mu(d, 0.0), g_logvar(d, 0.0);
31
32 for(size_t k=0;k<K;++k){
33 vector<double> eps(d);
34 for(size_t i=0;i<d;++i) eps[i] = stdn(rng);
35 vector<double> z(d);
36 for(size_t i=0;i<d;++i) z[i] = mu[i] + sigma[i]*eps[i];
37 for(size_t i=0;i<d;++i){
38 double df_dz = z[i] - a[i];
39 g_mu[i] += df_dz; // āˆ‚f/āˆ‚mu_i
40 g_logvar[i] += df_dz * (0.5 * sigma[i] * eps[i]); // āˆ‚f/āˆ‚logvar_i
41 }
42 }
43 for(size_t i=0;i<d;++i){ g_mu[i] /= (double)K; g_logvar[i] /= (double)K; }
44
45 // Exact gradients
46 vector<double> g_mu_exact(d), g_logvar_exact(d);
47 for(size_t i=0;i<d;++i){
48 g_mu_exact[i] = mu[i] - a[i];
49 g_logvar_exact[i] = 0.5 * sigma2[i];
50 }
51
52 cout.setf(ios::fixed); cout<<setprecision(6);
53 cout << "Dimension-wise gradients (Monte Carlo ā‰ˆ Exact)\n";
54 for(size_t i=0;i<d;++i){
55 cout << "i="<<i
56 << " d/dmu: " << g_mu[i] << " vs " << g_mu_exact[i]
57 << " d/dlogvar: " << g_logvar[i] << " vs " << g_logvar_exact[i]
58 << "\n";
59 }
60 return 0;
61}
62

This example demonstrates the reparameterization trick by estimating pathwise gradients of an expectation using samples ε ∼ N(0, I) and chain rule through z = μ + σ āŠ™ ε. Because the function f is quadratic, exact gradients are known; the Monte Carlo estimates closely match them, showcasing low-variance, unbiased gradient estimation central to VAE training.

Time: O(K Ā· d) for K samples and latent dimension d.Space: O(d) for storing vectors and gradient accumulators.
#variational autoencoder#elbo#kl divergence#reparameterization trick#amortized inference#gaussian likelihood#diagonal covariance#beta-vae#pathwise gradient#monte carlo#latent variable model#posterior collapse#likelihood-based modeling#generative model#evidence lower bound