Langevin Dynamics & Score-Based Sampling
Key Points
- •Langevin dynamics is a noisy gradient-ascent method that moves particles toward high probability regions while adding Gaussian noise to ensure proper exploration.
- •Unadjusted Langevin (ULA) uses + p() + and targets p approximately; MALA adds a Metropolis–Hastings accept step to obtain exact samples from p.
- •Score-based sampling replaces the true score p with a learned score s_ that estimates gradients of log-densities at different noise levels.
- •Annealed Langevin sampling runs Langevin steps from large to small noise scales, gradually sharpening samples toward the data distribution.
- •In practice, stability depends heavily on the step size , preconditioning, and whether you include a Metropolis correction.
- •Computational cost per step is dominated by evaluating the score/gradient; MALA roughly doubles this cost due to proposal density evaluations.
- •SGLD uses stochastic gradients and the same noise scaling as ULA, enabling scalable Bayesian posterior sampling from minibatches.
- •C++ implementations require careful random number handling, numerical stability, and consistent noise scaling .
Prerequisites
- →Multivariable calculus (gradients, Jacobians) — Computing and interpreting the score \nabla \log p(x) requires gradient operations.
- →Probability and Gaussian distributions — Understanding log-densities, Gaussian noise, and mixture models is essential for Langevin updates and proposal densities.
- →Markov chains and Metropolis–Hastings — MALA’s correctness and acceptance rules rely on MH theory and detailed balance.
- →Stochastic differential equations (basics) — Langevin dynamics originates from an SDE; discretization and stability follow from SDE numerics.
- →Numerical methods (Euler–Maruyama) — ULA is a specific discretization; tuning step sizes relates to numerical stability.
- →Neural networks (for score-based models) — Learning s_\theta requires understanding training objectives like denoising score matching.
- →Random number generation — Implementing Gaussian and uniform RNGs correctly is critical for Langevin noise and acceptance tests.
- →Bayesian inference (posterior, priors, likelihood) — Common application of Langevin methods is sampling from posteriors with gradient-accessible log-likelihoods.
Detailed Explanation
Tap terms for definitions01Overview
Langevin dynamics is a way to sample from a probability distribution by combining gradient information (which points toward higher probability regions) with random noise (which keeps the process exploring). Imagine you want to draw samples from a complicated density p(x) in high dimensions. Classic random-walk Metropolis can be slow because it ignores the shape of p. Langevin dynamics instead follows the gradient of the log-density, often called the score, to move more intelligently. The continuous-time formulation is an SDE whose stationary distribution is p. A simple time-discretization gives the Unadjusted Langevin Algorithm (ULA), which is easy to implement but introduces discretization bias. Adding a Metropolis–Hastings correction yields MALA, which removes this bias at the cost of more computation. In modern machine learning, score-based generative models learn the score function with a neural network and then use annealed Langevin sampling to synthesize data. This connects Bayesian sampling, energy-based modeling, and diffusion-style generation. Practically, you choose a step size, compute a score/gradient at the current point, add Gaussian noise scaled by the step, and iterate. You also manage burn-in and thinning to reduce correlation between samples. Langevin methods are powerful when gradients are available or can be learned, and they scale better than gradient-free MCMC in high dimensions.
02Intuition & Analogies
Suppose you are hiking in fog toward the most likely places to find water (high probability). If you only walk uphill using the steepest ascent (pure gradient), you’ll converge to a peak and stop exploring, missing other lakes. If you wander completely randomly (random walk), you might eventually find lakes but very slowly. Langevin dynamics blends these: you mostly head uphill (following the score, the gradient of log probability) but with a calibrated jitter that keeps you exploring. The size of your steps is like the step size \eta; the jitter’s intensity is scaled as \sqrt{2\eta}. With small \eta, your motion better approximates a physical process (a particle buffeted by thermal noise) whose long-run position frequency matches p(x). Now picture a blurred photograph that becomes sharper as you clean your glasses. Score-based sampling uses a learned guide (the score network) that tells you, for a given blur level, which direction to nudge pixels to look more like real photos. You start with a lot of blur (large noise), make many Langevin-like adjustments (following the score plus noise), then reduce blur gradually (annealing), refining details at each level. This staircase of blurs prevents you from getting stuck and helps you home in on realistic images. In Bayesian inference, think of the score as a tug from the posterior density; the noise prevents overconfidence and maintains proper uncertainty. The Metropolis correction in MALA is like a quality-check that sometimes rejects steps that would otherwise miscalibrate the distribution due to discretization errors.
03Formal Definition
04When to Use
Use Langevin dynamics when you can compute (or approximate) gradients of the log-density. Typical settings include Bayesian posterior sampling where the log posterior’s gradient is available up to a constant, energy-based models where p(x) \propto \exp(-U(x)) and \nabla \log p(x) = -\nabla U(x), and high-dimensional problems where random-walk proposals mix poorly. Choose ULA for quick approximate sampling when small bias is acceptable and you want simplicity. Choose MALA when exactness matters or when larger steps are desired while maintaining correctness. Use SGLD when datasets are large and you can only afford minibatch gradients; the injected noise compensates for stochastic gradient noise. Use score-based sampling (annealed Langevin) when you have a learned score network s_\theta that estimates gradients across noise levels, as in modern generative modeling (images, audio, 3D). It is preferred when you can’t write p(x) explicitly but can train s_\theta via denoising score matching. Avoid plain Langevin if gradients are unavailable or extremely noisy without control variates, or when the target is multimodal with narrow modes and you cannot tune steps/temperatures. In very complex geometries with strong correlations, Hamiltonian Monte Carlo or Riemannian preconditioning may be more efficient.
⚠️Common Mistakes
- Wrong noise scaling: using \sqrt{\eta} instead of \sqrt{2\eta} breaks the stationary distribution. Always ensure noise standard deviation is \sqrt{2\eta}.
- Sign errors: if p(x) \propto \exp(-U(x)), then \nabla \log p(x) = -\nabla U(x). Using +\nabla U(x) causes divergence.
- Step size too large: large \eta leads to instability and low acceptance in MALA. Start small, increase cautiously, and monitor acceptance rates (aim roughly 50–70% in moderate d).
- Ignoring burn-in and autocorrelation: early samples haven’t mixed; later samples can be highly correlated. Discard burn-in and thin if necessary.
- Forgetting Metropolis correction for exactness: ULA is biased; use MALA if unbiased stationary sampling is required.
- Inconsistent random seeds across threads: correlated noise can bias results; use independent RNG streams.
- Poor preconditioning: isotropic steps in highly anisotropic targets cause slow mixing. Use a mass/preconditioner matrix approximating the target’s covariance or Hessian.
- For score-based sampling: mismatch between noise schedule and step sizes can cause oversmoothing or artifacts; use steps proportional to \sigma^2 and enough iterations per level.
Key Formulas
Langevin SDE
Explanation: This stochastic differential equation has p as a stationary distribution under mild conditions. The drift follows the score and the diffusion adds isotropic Gaussian noise.
Unadjusted Langevin (ULA)
Explanation: Euler–Maruyama discretization of Langevin diffusion. Small step size reduces bias but increases computation.
MALA Acceptance Probability
Explanation: Given proposal q(y|x) = (x+ s(x), 2 I), this accept/reject rule corrects discretization error to target p exactly.
Gaussian Proposal Density
Explanation: The MALA proposal is Gaussian with mean x + s(x) and covariance 2 I. Only differences of log q matter in acceptance.
Fokker–Planck for Langevin
Explanation: This PDE describes how densities evolve under Langevin dynamics. Setting = p makes the right-hand side zero, showing p is stationary.
Preconditioned Langevin
Explanation: A positive-definite matrix M adapts step sizes along different directions to improve mixing on anisotropic targets.
SGLD Update
Explanation: Uses a stochastic (minibatch) estimate of the score. The added Gaussian noise maintains the correct stationary behavior asymptotically.
Denoising Score Matching
Explanation: Trains a score network to predict the score of Gaussian-smoothed data at multiple noise levels. Minimizing this recovers scores needed for annealed Langevin sampling.
Annealed Step Size Rule
Explanation: A practical heuristic sets the Langevin step at level proportional to the squared noise level _ to balance drift and diffusion.
KL Descent Under Langevin Flow
Explanation: In continuous time, Langevin diffusion decreases KL divergence to p monotonically, justifying its convergence in distribution.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple 2D vector utilities 5 struct Vec2 { 6 double x, y; 7 }; 8 9 Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; } 10 Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; } 11 Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; } 12 Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; } 13 14 double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; } 15 16 typedef vector<Vec2> Vec2List; 17 18 // Target: mixture of K isotropic Gaussians N(mu_k, s2 I) 19 struct GaussianMixture2D { 20 Vec2List means; // component means 21 vector<double> w; // weights (sum to 1) 22 double s2; // component variance (scalar) 23 24 // log unnormalized density at x 25 double logp(const Vec2 &x) const { 26 // log sum_k w_k * N(x|mu_k, s2 I) 27 // N(x|mu, s2 I) = (2pi s2)^(-d/2) * exp(-||x-mu||^2/(2 s2)) 28 // constants cancel in gradients and MH ratios, but we keep them for clarity 29 double log_norm = -0.5 * 2 * log(2*M_PI*s2); // d=2 30 vector<double> logs; 31 logs.reserve(means.size()); 32 for(size_t k=0;k<means.size();++k){ 33 Vec2 diff = x - means[k]; 34 double quad = dot(diff, diff); 35 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2)); 36 } 37 // log-sum-exp for numerical stability 38 double m = *max_element(logs.begin(), logs.end()); 39 double sumexp = 0.0; 40 for(double v: logs) sumexp += exp(v - m); 41 return m + log(sumexp); 42 } 43 44 // score: grad_x log p(x) 45 Vec2 score(const Vec2 &x) const { 46 // \nabla log p(x) = sum_k r_k(x) * (mu_k - x)/s2, where r_k are posterior weights 47 vector<double> logs; 48 logs.reserve(means.size()); 49 double log_norm = -0.5 * 2 * log(2*M_PI*s2); 50 for(size_t k=0;k<means.size();++k){ 51 Vec2 diff = x - means[k]; 52 double quad = dot(diff, diff); 53 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2)); 54 } 55 double m = *max_element(logs.begin(), logs.end()); 56 vector<double> r(means.size()); 57 double Z = 0.0; 58 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; } 59 for(size_t k=0;k<means.size();++k){ r[k] /= Z; } 60 Vec2 g{0.0, 0.0}; 61 for(size_t k=0;k<means.size();++k){ 62 Vec2 term = means[k] - x; // (mu_k - x) 63 g = g + (r[k] / s2) * term; 64 } 65 return g; 66 } 67 }; 68 69 // Draw standard normal N(0,1) 70 double randn(std::mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); } 71 72 int main(){ 73 ios::sync_with_stdio(false); 74 cin.tie(nullptr); 75 76 // Define a simple 2-component mixture 77 GaussianMixture2D mix; 78 mix.means = { Vec2{2.0, 2.0}, Vec2{-2.0, -2.0} }; 79 mix.w = {0.5, 0.5}; 80 mix.s2 = 0.5; // component variance 81 82 // Langevin parameters 83 double eta = 0.02; // step size (must be small enough for stability) 84 size_t burn_in = 2000; // discard initial samples 85 size_t total_iters = 12000; // total iterations 86 87 // RNG 88 std::random_device rd; 89 std::mt19937_64 rng(rd()); 90 91 // Initialize at the origin 92 Vec2 x{0.0, 0.0}; 93 94 // Collect some samples after burn-in 95 vector<Vec2> samples; 96 for(size_t t=0; t<total_iters; ++t){ 97 // Compute score (gradient of log-density) 98 Vec2 g = mix.score(x); 99 // Gaussian noise with variance 2*eta per coordinate 100 Vec2 noise{ randn(rng), randn(rng) }; 101 // ULA update: x <- x + eta * score + sqrt(2*eta) * noise 102 x = x + eta * g + sqrt(2.0*eta) * noise; 103 if(t >= burn_in && (t - burn_in) % 10 == 0){ // thin every 10 steps 104 samples.push_back(x); 105 } 106 } 107 108 // Print a few samples 109 cout.setf(std::ios::fixed); cout<<setprecision(4); 110 for(size_t i=0; i<min<size_t>(samples.size(), 20); ++i){ 111 cout << samples[i].x << "," << samples[i].y << "\n"; 112 } 113 cerr << "Collected " << samples.size() << " samples after burn-in.\n"; 114 return 0; 115 } 116
Implements ULA for a 2D Gaussian mixture with known score. The update moves toward nearby mixture means while adding Gaussian noise scaled by \sqrt{2\eta}. After burn-in and thinning, printed points should cluster around the two modes near (2,2) and (-2,-2).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Vec2 { double x, y; }; 5 Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; } 6 Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; } 7 Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; } 8 Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; } 9 double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; } 10 11 typedef vector<Vec2> Vec2List; 12 13 struct GaussianMixture2D { 14 Vec2List means; vector<double> w; double s2; 15 double logp(const Vec2 &x) const { 16 double log_norm = -0.5 * 2 * log(2*M_PI*s2); 17 vector<double> logs; logs.reserve(means.size()); 18 for(size_t k=0;k<means.size();++k){ 19 Vec2 d = {x.x - means[k].x, x.y - means[k].y}; 20 double quad = dot(d, d); 21 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2)); 22 } 23 double m = *max_element(logs.begin(), logs.end()); 24 double sumexp = 0.0; for(double v: logs) sumexp += exp(v-m); 25 return m + log(sumexp); 26 } 27 Vec2 score(const Vec2 &x) const { 28 vector<double> logs; logs.reserve(means.size()); 29 double log_norm = -0.5 * 2 * log(2*M_PI*s2); 30 for(size_t k=0;k<means.size();++k){ 31 Vec2 d = {x.x - means[k].x, x.y - means[k].y}; 32 double quad = dot(d, d); 33 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2)); 34 } 35 double m = *max_element(logs.begin(), logs.end()); 36 double Z = 0.0; vector<double> r(means.size()); 37 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; } 38 for(size_t k=0;k<means.size();++k){ r[k] /= Z; } 39 Vec2 g{0.0, 0.0}; 40 for(size_t k=0;k<means.size();++k){ Vec2 term = {means[k].x - x.x, means[k].y - x.y}; g = g + (r[k]/s2) * term; } 41 return g; 42 } 43 }; 44 45 double randn(mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); } 46 double randu(mt19937_64 &rng){ static uniform_real_distribution<double> ud(0.0,1.0); return ud(rng); } 47 48 double log_q_gaussian(const Vec2 &to, const Vec2 &from, const Vec2 &score_from, double eta){ 49 // q(to | from) = N(from + eta * score_from, 2 eta I) 50 Vec2 mean = from + eta * score_from; 51 Vec2 diff = {to.x - mean.x, to.y - mean.y}; 52 double quad = dot(diff, diff); 53 // log density up to constant: -||diff||^2 / (4 eta) + C 54 return - quad / (4.0 * eta); 55 } 56 57 int main(){ 58 ios::sync_with_stdio(false); 59 cin.tie(nullptr); 60 61 GaussianMixture2D mix; 62 mix.means = { Vec2{2.0, 2.0}, Vec2{-2.0, -2.0} }; 63 mix.w = {0.5, 0.5}; 64 mix.s2 = 0.5; 65 66 double eta = 0.05; // MALA can often use larger steps due to MH correction; still tune carefully 67 size_t iters = 15000, burn_in = 3000; 68 69 mt19937_64 rng(random_device{}()); 70 71 Vec2 x{0.0, 0.0}; 72 size_t accepted = 0, proposed = 0; 73 vector<Vec2> samples; 74 75 for(size_t t=0; t<iters; ++t){ 76 Vec2 g_x = mix.score(x); 77 Vec2 noise{ randn(rng), randn(rng) }; 78 // Propose y 79 Vec2 y = x + eta * g_x + sqrt(2.0*eta) * noise; 80 81 // Compute acceptance log ratio 82 double logp_x = mix.logp(x); 83 double logp_y = mix.logp(y); 84 Vec2 g_y = mix.score(y); 85 double log_q_yx = log_q_gaussian(x, y, g_y, eta); 86 double log_q_xy = log_q_gaussian(y, x, g_x, eta); 87 double log_alpha = (logp_y - logp_x) + (log_q_yx - log_q_xy); 88 89 double u = randu(rng); 90 bool accept = (log(u) < log_alpha); 91 ++proposed; 92 if(accept){ x = y; ++accepted; } 93 94 if(t >= burn_in && (t-burn_in)%10==0){ samples.push_back(x); } 95 } 96 97 cout.setf(std::ios::fixed); cout<<setprecision(4); 98 for(size_t i=0;i<min<size_t>(samples.size(),20);++i){ cout<<samples[i].x<<","<<samples[i].y<<"\n"; } 99 cerr << "Acceptance rate: " << (100.0*accepted/proposed) << "%\n"; 100 return 0; 101 } 102
Implements MALA on the same Gaussian mixture. The proposal follows a Langevin step and is accepted/rejected with the Metropolis–Hastings rule based on target and proposal densities. This yields exact samples from the target in the limit of long runs.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Vec2 { double x, y; }; 5 Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; } 6 Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; } 7 Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; } 8 Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; } 9 double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; } 10 11 // Mixture model used as "data"; we will assume the oracle provides the score of the noise-convolved density p_\sigma 12 struct Mixture2D { 13 vector<Vec2> means; vector<double> w; double s2; // intrinsic variance of components 14 15 // Score of p_\sigma(x) = (p * N(0, sigma^2 I))(x) for an isotropic mixture 16 // Each component's covariance becomes s2 + sigma^2 17 Vec2 score_sigma(const Vec2 &x, double sigma) const { 18 double v = s2 + sigma*sigma; // inflated variance 19 double log_norm = -0.5 * 2 * log(2*M_PI*v); 20 vector<double> logs; logs.reserve(means.size()); 21 for(size_t k=0;k<means.size();++k){ 22 Vec2 d = {x.x - means[k].x, x.y - means[k].y}; 23 double quad = dot(d, d); 24 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / v)); 25 } 26 double m = *max_element(logs.begin(), logs.end()); 27 double Z = 0.0; vector<double> r(means.size()); 28 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; } 29 for(size_t k=0;k<means.size();++k){ r[k] /= Z; } 30 Vec2 g{0.0, 0.0}; 31 for(size_t k=0;k<means.size();++k){ Vec2 term = {means[k].x - x.x, means[k].y - x.y}; g = g + (r[k] / v) * term; } 32 return g; // equals \nabla_x log p_\sigma(x) 33 } 34 }; 35 36 double randn(mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); } 37 38 int main(){ 39 ios::sync_with_stdio(false); 40 cin.tie(nullptr); 41 42 // Define a simple "data" mixture; treat score_sigma as an oracle (e.g., from a trained network) 43 Mixture2D data; 44 data.means = { {2.0, 0.0}, {-2.0, 0.0}, {0.0, 2.5} }; 45 data.w = {0.4, 0.4, 0.2}; 46 data.s2 = 0.3; // intrinsic variance 47 48 // Noise schedule (from large to small) 49 vector<double> sigmas; // geometric schedule 50 double sigma_max = 2.0, sigma_min = 0.1; int L = 8; 51 for(int i=0;i<L;++i){ 52 double p = (double)i/(L-1); 53 double s = sigma_max * pow(sigma_min/sigma_max, p); // geometric interpolation 54 sigmas.push_back(s); 55 } 56 57 // Steps per level and step-size rule: eta_l = c * sigma_l^2 58 int steps_per_level = 100; 59 double c = 0.1; // tune for stability/quality 60 61 // RNG and initialization from pure noise 62 mt19937_64 rng(random_device{}()); 63 Vec2 x{ randn(rng)*sigma_max, randn(rng)*sigma_max }; 64 65 for(int l=0; l<L; ++l){ 66 double sigma = sigmas[l]; 67 double eta = c * sigma * sigma; 68 for(int k=0; k<steps_per_level; ++k){ 69 Vec2 g = data.score_sigma(x, sigma); // oracle score at this noise level 70 Vec2 z{ randn(rng), randn(rng) }; 71 x = x + eta * g + sqrt(2.0*eta) * z; 72 } 73 // Optional: small printout to track progress 74 cerr << "Level " << l+1 << "/" << L << ", sigma=" << sigma << ", x=(" << x.x << "," << x.y << ")\n"; 75 } 76 77 cout.setf(std::ios::fixed); cout<<setprecision(4); 78 cout << x.x << "," << x.y << "\n"; // final synthesized sample 79 return 0; 80 } 81
Demonstrates annealed Langevin sampling using an oracle score \nabla_x log p_\sigma(x) for a noise-convolved target. In practice, the oracle would be a trained score network s_\theta(x,\sigma). The schedule starts with large noise (easy landscape) and gradually sharpens, guiding x toward realistic samples.