🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmAdvanced

Langevin Dynamics & Score-Based Sampling

Key Points

  • •
    Langevin dynamics is a noisy gradient-ascent method that moves particles toward high probability regions while adding Gaussian noise to ensure proper exploration.
  • •
    Unadjusted Langevin (ULA) uses xk+1​=xk​ + η ∇ log p(xk​) + 2η​ zk​ and targets p approximately; MALA adds a Metropolis–Hastings accept step to obtain exact samples from p.
  • •
    Score-based sampling replaces the true score ∇ log p with a learned score s_θ that estimates gradients of log-densities at different noise levels.
  • •
    Annealed Langevin sampling runs Langevin steps from large to small noise scales, gradually sharpening samples toward the data distribution.
  • •
    In practice, stability depends heavily on the step size η, preconditioning, and whether you include a Metropolis correction.
  • •
    Computational cost per step is dominated by evaluating the score/gradient; MALA roughly doubles this cost due to proposal density evaluations.
  • •
    SGLD uses stochastic gradients and the same noise scaling as ULA, enabling scalable Bayesian posterior sampling from minibatches.
  • •
    C++ implementations require careful random number handling, numerical stability, and consistent noise scaling 2η​.

Prerequisites

  • →Multivariable calculus (gradients, Jacobians) — Computing and interpreting the score \nabla \log p(x) requires gradient operations.
  • →Probability and Gaussian distributions — Understanding log-densities, Gaussian noise, and mixture models is essential for Langevin updates and proposal densities.
  • →Markov chains and Metropolis–Hastings — MALA’s correctness and acceptance rules rely on MH theory and detailed balance.
  • →Stochastic differential equations (basics) — Langevin dynamics originates from an SDE; discretization and stability follow from SDE numerics.
  • →Numerical methods (Euler–Maruyama) — ULA is a specific discretization; tuning step sizes relates to numerical stability.
  • →Neural networks (for score-based models) — Learning s_\theta requires understanding training objectives like denoising score matching.
  • →Random number generation — Implementing Gaussian and uniform RNGs correctly is critical for Langevin noise and acceptance tests.
  • →Bayesian inference (posterior, priors, likelihood) — Common application of Langevin methods is sampling from posteriors with gradient-accessible log-likelihoods.

Detailed Explanation

Tap terms for definitions

01Overview

Langevin dynamics is a way to sample from a probability distribution by combining gradient information (which points toward higher probability regions) with random noise (which keeps the process exploring). Imagine you want to draw samples from a complicated density p(x) in high dimensions. Classic random-walk Metropolis can be slow because it ignores the shape of p. Langevin dynamics instead follows the gradient of the log-density, often called the score, to move more intelligently. The continuous-time formulation is an SDE whose stationary distribution is p. A simple time-discretization gives the Unadjusted Langevin Algorithm (ULA), which is easy to implement but introduces discretization bias. Adding a Metropolis–Hastings correction yields MALA, which removes this bias at the cost of more computation. In modern machine learning, score-based generative models learn the score function with a neural network and then use annealed Langevin sampling to synthesize data. This connects Bayesian sampling, energy-based modeling, and diffusion-style generation. Practically, you choose a step size, compute a score/gradient at the current point, add Gaussian noise scaled by the step, and iterate. You also manage burn-in and thinning to reduce correlation between samples. Langevin methods are powerful when gradients are available or can be learned, and they scale better than gradient-free MCMC in high dimensions.

02Intuition & Analogies

Suppose you are hiking in fog toward the most likely places to find water (high probability). If you only walk uphill using the steepest ascent (pure gradient), you’ll converge to a peak and stop exploring, missing other lakes. If you wander completely randomly (random walk), you might eventually find lakes but very slowly. Langevin dynamics blends these: you mostly head uphill (following the score, the gradient of log probability) but with a calibrated jitter that keeps you exploring. The size of your steps is like the step size \eta; the jitter’s intensity is scaled as \sqrt{2\eta}. With small \eta, your motion better approximates a physical process (a particle buffeted by thermal noise) whose long-run position frequency matches p(x). Now picture a blurred photograph that becomes sharper as you clean your glasses. Score-based sampling uses a learned guide (the score network) that tells you, for a given blur level, which direction to nudge pixels to look more like real photos. You start with a lot of blur (large noise), make many Langevin-like adjustments (following the score plus noise), then reduce blur gradually (annealing), refining details at each level. This staircase of blurs prevents you from getting stuck and helps you home in on realistic images. In Bayesian inference, think of the score as a tug from the posterior density; the noise prevents overconfidence and maintains proper uncertainty. The Metropolis correction in MALA is like a quality-check that sometimes rejects steps that would otherwise miscalibrate the distribution due to discretization errors.

03Formal Definition

The overdamped Langevin diffusion targeting density p(x) on Rd is the stochastic differential equation dXt​ = ∇ log p(Xt​)\,dt + 2​\,dWt​, where Wt​ is standard Brownian motion. Under regularity conditions (smoothness, confinement), p is the stationary distribution of this diffusion. Discretizing with step η > 0 yields the Unadjusted Langevin Algorithm (ULA): xk+1​=xk​ + η ∇ log p(xk​) + 2η​\,zk​, where zk​ ∼ N(0, I). ULA has discretization bias that shrinks as η → 0. The Metropolis-Adjusted Langevin Algorithm (MALA) uses the same proposal y=x + η ∇ log p(x) + 2η​\,z but accepts it with probability α(x,y) = min\{1, p(x)q(y∣x)p(y)q(x∣y)​\}, where q(⋅|x) is N(x + η ∇ log p(x), 2η I). This correction yields an exact Markov chain with stationary distribution p. Score-based sampling generalizes by using a score function s(x,t) ≈ ∇x​ log pt​(x), the gradient of the log-density of x under a noise-perturbed distribution pt​ (e.g., Gaussian convolution with variance σ(t)^2). Annealed Langevin sampling runs a sequence of ULA steps across decreasing noise scales σ1​ > σ2​ > ⋯ > σL​, using s(x,σ_ℓ) to guide updates, gradually transforming noise into samples from the target (data) distribution.

04When to Use

Use Langevin dynamics when you can compute (or approximate) gradients of the log-density. Typical settings include Bayesian posterior sampling where the log posterior’s gradient is available up to a constant, energy-based models where p(x) \propto \exp(-U(x)) and \nabla \log p(x) = -\nabla U(x), and high-dimensional problems where random-walk proposals mix poorly. Choose ULA for quick approximate sampling when small bias is acceptable and you want simplicity. Choose MALA when exactness matters or when larger steps are desired while maintaining correctness. Use SGLD when datasets are large and you can only afford minibatch gradients; the injected noise compensates for stochastic gradient noise. Use score-based sampling (annealed Langevin) when you have a learned score network s_\theta that estimates gradients across noise levels, as in modern generative modeling (images, audio, 3D). It is preferred when you can’t write p(x) explicitly but can train s_\theta via denoising score matching. Avoid plain Langevin if gradients are unavailable or extremely noisy without control variates, or when the target is multimodal with narrow modes and you cannot tune steps/temperatures. In very complex geometries with strong correlations, Hamiltonian Monte Carlo or Riemannian preconditioning may be more efficient.

⚠️Common Mistakes

  • Wrong noise scaling: using \sqrt{\eta} instead of \sqrt{2\eta} breaks the stationary distribution. Always ensure noise standard deviation is \sqrt{2\eta}.
  • Sign errors: if p(x) \propto \exp(-U(x)), then \nabla \log p(x) = -\nabla U(x). Using +\nabla U(x) causes divergence.
  • Step size too large: large \eta leads to instability and low acceptance in MALA. Start small, increase cautiously, and monitor acceptance rates (aim roughly 50–70% in moderate d).
  • Ignoring burn-in and autocorrelation: early samples haven’t mixed; later samples can be highly correlated. Discard burn-in and thin if necessary.
  • Forgetting Metropolis correction for exactness: ULA is biased; use MALA if unbiased stationary sampling is required.
  • Inconsistent random seeds across threads: correlated noise can bias results; use independent RNG streams.
  • Poor preconditioning: isotropic steps in highly anisotropic targets cause slow mixing. Use a mass/preconditioner matrix approximating the target’s covariance or Hessian.
  • For score-based sampling: mismatch between noise schedule and step sizes can cause oversmoothing or artifacts; use steps proportional to \sigma^2 and enough iterations per level.

Key Formulas

Langevin SDE

dXt​=∇logp(Xt​)dt+2​dWt​

Explanation: This stochastic differential equation has p as a stationary distribution under mild conditions. The drift follows the score and the diffusion adds isotropic Gaussian noise.

Unadjusted Langevin (ULA)

xk+1​=xk​+η∇logp(xk​)+2η​zk​,zk​∼N(0,I)

Explanation: Euler–Maruyama discretization of Langevin diffusion. Small step size η reduces bias but increases computation.

MALA Acceptance Probability

α(x,y)=min{1,exp(logp(y)−logp(x)+logq(x∣y)−logq(y∣x))}

Explanation: Given proposal q(y|x) = N(x+η s(x), 2η I), this accept/reject rule corrects discretization error to target p exactly.

Gaussian Proposal Density

logq(y∣x)=−4η1​∥y−x−ηs(x)∥2+C

Explanation: The MALA proposal is Gaussian with mean x + η s(x) and covariance 2η I. Only differences of log q matter in acceptance.

Fokker–Planck for Langevin

∂t​ρt​=−∇⋅(ρt​∇logp)+Δρt​

Explanation: This PDE describes how densities evolve under Langevin dynamics. Setting ρt​ = p makes the right-hand side zero, showing p is stationary.

Preconditioned Langevin

xk+1​=xk​+ηM∇logp(xk​)+2η​M1/2zk​

Explanation: A positive-definite matrix M adapts step sizes along different directions to improve mixing on anisotropic targets.

SGLD Update

xk+1​=xk​+η∇logp​(xk​)+2η​zk​

Explanation: Uses a stochastic (minibatch) estimate of the score. The added Gaussian noise maintains the correct stationary behavior asymptotically.

Denoising Score Matching

LDSM​(θ)=Eσ​Ex∼pdata​​Ez∼N(0,I)​[​sθ​(x+σz,σ)+σz​​2]

Explanation: Trains a score network to predict the score of Gaussian-smoothed data at multiple noise levels. Minimizing this recovers scores needed for annealed Langevin sampling.

Annealed Step Size Rule

ηℓ​=cσℓ2​

Explanation: A practical heuristic sets the Langevin step at level ℓ proportional to the squared noise level σ_ℓ2 to balance drift and diffusion.

KL Descent Under Langevin Flow

dtd​KL(ρt​∥p)=−Eρt​​[​∇logpρt​​​2]≤0

Explanation: In continuous time, Langevin diffusion decreases KL divergence to p monotonically, justifying its convergence in distribution.

Complexity Analysis

Let d be the dimension and Cs​ the cost to evaluate a score/gradient and, if needed, the log density. For many analytic targets Cs​=O(d); for neural score networks, Cs​ is proportional to the network’s FLOPs and may dominate runtime. One ULA iteration performs one score evaluation and generates d Gaussian random numbers, costing O(Cs​ + d). Over T iterations, time is O(T(Cs​ + d)); memory is O(d) to hold the current state, plus O(N d) if you store N samples. MALA requires evaluating both the forward and reverse proposal densities, which entails two score evaluations (at x and y) and two quadratic forms. Thus, per iteration it is roughly 2× more expensive than ULA: O(Cs​ + d) for proposing, then another O(Cs​ + d) to compute acceptance, totaling O(2Cs​ + d). The acceptance test also adds branching but negligible asymptotic overhead. In annealed Langevin with L noise levels and K_ℓ steps at each level, total time is O((∑ℓ=1L​ K_ℓ)(Cs​ + d)). Choosing K_ℓ proportional to 1/η_ℓ is common, so finer levels may require more steps. SGLD replaces exact gradients with minibatch estimates; if a minibatch of size m is used for a dataset of size n, Cs​ scales like O(m d) instead of O(n d), trading bias/variance for speed. Space complexity for all methods remains O(d) for the running state and temporary buffers. If you cache gradients or store full chains, memory grows with the number of stored samples. Parallel chains increase memory linearly with the number of chains but can reduce wall-clock time via independent RNG streams. Stability and effective sample size depend on η, preconditioning, and the target’s geometry; too large η can increase rejection (MALA) or bias (ULA), while too small η slows mixing.

Code Examples

ULA: Unadjusted Langevin sampling from a 2D Gaussian mixture
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple 2D vector utilities
5struct Vec2 {
6 double x, y;
7};
8
9Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; }
10Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; }
11Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; }
12Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; }
13
14double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; }
15
16typedef vector<Vec2> Vec2List;
17
18// Target: mixture of K isotropic Gaussians N(mu_k, s2 I)
19struct GaussianMixture2D {
20 Vec2List means; // component means
21 vector<double> w; // weights (sum to 1)
22 double s2; // component variance (scalar)
23
24 // log unnormalized density at x
25 double logp(const Vec2 &x) const {
26 // log sum_k w_k * N(x|mu_k, s2 I)
27 // N(x|mu, s2 I) = (2pi s2)^(-d/2) * exp(-||x-mu||^2/(2 s2))
28 // constants cancel in gradients and MH ratios, but we keep them for clarity
29 double log_norm = -0.5 * 2 * log(2*M_PI*s2); // d=2
30 vector<double> logs;
31 logs.reserve(means.size());
32 for(size_t k=0;k<means.size();++k){
33 Vec2 diff = x - means[k];
34 double quad = dot(diff, diff);
35 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2));
36 }
37 // log-sum-exp for numerical stability
38 double m = *max_element(logs.begin(), logs.end());
39 double sumexp = 0.0;
40 for(double v: logs) sumexp += exp(v - m);
41 return m + log(sumexp);
42 }
43
44 // score: grad_x log p(x)
45 Vec2 score(const Vec2 &x) const {
46 // \nabla log p(x) = sum_k r_k(x) * (mu_k - x)/s2, where r_k are posterior weights
47 vector<double> logs;
48 logs.reserve(means.size());
49 double log_norm = -0.5 * 2 * log(2*M_PI*s2);
50 for(size_t k=0;k<means.size();++k){
51 Vec2 diff = x - means[k];
52 double quad = dot(diff, diff);
53 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2));
54 }
55 double m = *max_element(logs.begin(), logs.end());
56 vector<double> r(means.size());
57 double Z = 0.0;
58 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; }
59 for(size_t k=0;k<means.size();++k){ r[k] /= Z; }
60 Vec2 g{0.0, 0.0};
61 for(size_t k=0;k<means.size();++k){
62 Vec2 term = means[k] - x; // (mu_k - x)
63 g = g + (r[k] / s2) * term;
64 }
65 return g;
66 }
67};
68
69// Draw standard normal N(0,1)
70double randn(std::mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); }
71
72int main(){
73 ios::sync_with_stdio(false);
74 cin.tie(nullptr);
75
76 // Define a simple 2-component mixture
77 GaussianMixture2D mix;
78 mix.means = { Vec2{2.0, 2.0}, Vec2{-2.0, -2.0} };
79 mix.w = {0.5, 0.5};
80 mix.s2 = 0.5; // component variance
81
82 // Langevin parameters
83 double eta = 0.02; // step size (must be small enough for stability)
84 size_t burn_in = 2000; // discard initial samples
85 size_t total_iters = 12000; // total iterations
86
87 // RNG
88 std::random_device rd;
89 std::mt19937_64 rng(rd());
90
91 // Initialize at the origin
92 Vec2 x{0.0, 0.0};
93
94 // Collect some samples after burn-in
95 vector<Vec2> samples;
96 for(size_t t=0; t<total_iters; ++t){
97 // Compute score (gradient of log-density)
98 Vec2 g = mix.score(x);
99 // Gaussian noise with variance 2*eta per coordinate
100 Vec2 noise{ randn(rng), randn(rng) };
101 // ULA update: x <- x + eta * score + sqrt(2*eta) * noise
102 x = x + eta * g + sqrt(2.0*eta) * noise;
103 if(t >= burn_in && (t - burn_in) % 10 == 0){ // thin every 10 steps
104 samples.push_back(x);
105 }
106 }
107
108 // Print a few samples
109 cout.setf(std::ios::fixed); cout<<setprecision(4);
110 for(size_t i=0; i<min<size_t>(samples.size(), 20); ++i){
111 cout << samples[i].x << "," << samples[i].y << "\n";
112 }
113 cerr << "Collected " << samples.size() << " samples after burn-in.\n";
114 return 0;
115}
116

Implements ULA for a 2D Gaussian mixture with known score. The update moves toward nearby mixture means while adding Gaussian noise scaled by \sqrt{2\eta}. After burn-in and thinning, printed points should cluster around the two modes near (2,2) and (-2,-2).

Time: O(T K d) where T is iterations, K is number of mixture components (here 2), and d=2.Space: O(d) for the state; O(N d) if storing N samples.
MALA: Metropolis-Adjusted Langevin for exact sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Vec2 { double x, y; };
5Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; }
6Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; }
7Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; }
8Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; }
9double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; }
10
11typedef vector<Vec2> Vec2List;
12
13struct GaussianMixture2D {
14 Vec2List means; vector<double> w; double s2;
15 double logp(const Vec2 &x) const {
16 double log_norm = -0.5 * 2 * log(2*M_PI*s2);
17 vector<double> logs; logs.reserve(means.size());
18 for(size_t k=0;k<means.size();++k){
19 Vec2 d = {x.x - means[k].x, x.y - means[k].y};
20 double quad = dot(d, d);
21 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2));
22 }
23 double m = *max_element(logs.begin(), logs.end());
24 double sumexp = 0.0; for(double v: logs) sumexp += exp(v-m);
25 return m + log(sumexp);
26 }
27 Vec2 score(const Vec2 &x) const {
28 vector<double> logs; logs.reserve(means.size());
29 double log_norm = -0.5 * 2 * log(2*M_PI*s2);
30 for(size_t k=0;k<means.size();++k){
31 Vec2 d = {x.x - means[k].x, x.y - means[k].y};
32 double quad = dot(d, d);
33 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / s2));
34 }
35 double m = *max_element(logs.begin(), logs.end());
36 double Z = 0.0; vector<double> r(means.size());
37 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; }
38 for(size_t k=0;k<means.size();++k){ r[k] /= Z; }
39 Vec2 g{0.0, 0.0};
40 for(size_t k=0;k<means.size();++k){ Vec2 term = {means[k].x - x.x, means[k].y - x.y}; g = g + (r[k]/s2) * term; }
41 return g;
42 }
43};
44
45double randn(mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); }
46double randu(mt19937_64 &rng){ static uniform_real_distribution<double> ud(0.0,1.0); return ud(rng); }
47
48double log_q_gaussian(const Vec2 &to, const Vec2 &from, const Vec2 &score_from, double eta){
49 // q(to | from) = N(from + eta * score_from, 2 eta I)
50 Vec2 mean = from + eta * score_from;
51 Vec2 diff = {to.x - mean.x, to.y - mean.y};
52 double quad = dot(diff, diff);
53 // log density up to constant: -||diff||^2 / (4 eta) + C
54 return - quad / (4.0 * eta);
55}
56
57int main(){
58 ios::sync_with_stdio(false);
59 cin.tie(nullptr);
60
61 GaussianMixture2D mix;
62 mix.means = { Vec2{2.0, 2.0}, Vec2{-2.0, -2.0} };
63 mix.w = {0.5, 0.5};
64 mix.s2 = 0.5;
65
66 double eta = 0.05; // MALA can often use larger steps due to MH correction; still tune carefully
67 size_t iters = 15000, burn_in = 3000;
68
69 mt19937_64 rng(random_device{}());
70
71 Vec2 x{0.0, 0.0};
72 size_t accepted = 0, proposed = 0;
73 vector<Vec2> samples;
74
75 for(size_t t=0; t<iters; ++t){
76 Vec2 g_x = mix.score(x);
77 Vec2 noise{ randn(rng), randn(rng) };
78 // Propose y
79 Vec2 y = x + eta * g_x + sqrt(2.0*eta) * noise;
80
81 // Compute acceptance log ratio
82 double logp_x = mix.logp(x);
83 double logp_y = mix.logp(y);
84 Vec2 g_y = mix.score(y);
85 double log_q_yx = log_q_gaussian(x, y, g_y, eta);
86 double log_q_xy = log_q_gaussian(y, x, g_x, eta);
87 double log_alpha = (logp_y - logp_x) + (log_q_yx - log_q_xy);
88
89 double u = randu(rng);
90 bool accept = (log(u) < log_alpha);
91 ++proposed;
92 if(accept){ x = y; ++accepted; }
93
94 if(t >= burn_in && (t-burn_in)%10==0){ samples.push_back(x); }
95 }
96
97 cout.setf(std::ios::fixed); cout<<setprecision(4);
98 for(size_t i=0;i<min<size_t>(samples.size(),20);++i){ cout<<samples[i].x<<","<<samples[i].y<<"\n"; }
99 cerr << "Acceptance rate: " << (100.0*accepted/proposed) << "%\n";
100 return 0;
101}
102

Implements MALA on the same Gaussian mixture. The proposal follows a Langevin step and is accepted/rejected with the Metropolis–Hastings rule based on target and proposal densities. This yields exact samples from the target in the limit of long runs.

Time: O(T K d) with a larger constant than ULA (two score/logp evaluations per iteration).Space: O(d) for state; O(N d) to store N samples.
Annealed Langevin sampling with an oracle score across noise levels
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Vec2 { double x, y; };
5Vec2 operator+(const Vec2 &a, const Vec2 &b){ return {a.x+b.x, a.y+b.y}; }
6Vec2 operator-(const Vec2 &a, const Vec2 &b){ return {a.x-b.x, a.y-b.y}; }
7Vec2 operator*(double s, const Vec2 &a){ return {s*a.x, s*a.y}; }
8Vec2 operator*(const Vec2 &a, double s){ return {s*a.x, s*a.y}; }
9double dot(const Vec2 &a, const Vec2 &b){ return a.x*b.x + a.y*b.y; }
10
11// Mixture model used as "data"; we will assume the oracle provides the score of the noise-convolved density p_\sigma
12struct Mixture2D {
13 vector<Vec2> means; vector<double> w; double s2; // intrinsic variance of components
14
15 // Score of p_\sigma(x) = (p * N(0, sigma^2 I))(x) for an isotropic mixture
16 // Each component's covariance becomes s2 + sigma^2
17 Vec2 score_sigma(const Vec2 &x, double sigma) const {
18 double v = s2 + sigma*sigma; // inflated variance
19 double log_norm = -0.5 * 2 * log(2*M_PI*v);
20 vector<double> logs; logs.reserve(means.size());
21 for(size_t k=0;k<means.size();++k){
22 Vec2 d = {x.x - means[k].x, x.y - means[k].y};
23 double quad = dot(d, d);
24 logs.push_back(log(w[k]) + log_norm + (-0.5 * quad / v));
25 }
26 double m = *max_element(logs.begin(), logs.end());
27 double Z = 0.0; vector<double> r(means.size());
28 for(size_t k=0;k<means.size();++k){ r[k] = exp(logs[k]-m); Z += r[k]; }
29 for(size_t k=0;k<means.size();++k){ r[k] /= Z; }
30 Vec2 g{0.0, 0.0};
31 for(size_t k=0;k<means.size();++k){ Vec2 term = {means[k].x - x.x, means[k].y - x.y}; g = g + (r[k] / v) * term; }
32 return g; // equals \nabla_x log p_\sigma(x)
33 }
34};
35
36double randn(mt19937_64 &rng){ static normal_distribution<double> nd(0.0,1.0); return nd(rng); }
37
38int main(){
39 ios::sync_with_stdio(false);
40 cin.tie(nullptr);
41
42 // Define a simple "data" mixture; treat score_sigma as an oracle (e.g., from a trained network)
43 Mixture2D data;
44 data.means = { {2.0, 0.0}, {-2.0, 0.0}, {0.0, 2.5} };
45 data.w = {0.4, 0.4, 0.2};
46 data.s2 = 0.3; // intrinsic variance
47
48 // Noise schedule (from large to small)
49 vector<double> sigmas; // geometric schedule
50 double sigma_max = 2.0, sigma_min = 0.1; int L = 8;
51 for(int i=0;i<L;++i){
52 double p = (double)i/(L-1);
53 double s = sigma_max * pow(sigma_min/sigma_max, p); // geometric interpolation
54 sigmas.push_back(s);
55 }
56
57 // Steps per level and step-size rule: eta_l = c * sigma_l^2
58 int steps_per_level = 100;
59 double c = 0.1; // tune for stability/quality
60
61 // RNG and initialization from pure noise
62 mt19937_64 rng(random_device{}());
63 Vec2 x{ randn(rng)*sigma_max, randn(rng)*sigma_max };
64
65 for(int l=0; l<L; ++l){
66 double sigma = sigmas[l];
67 double eta = c * sigma * sigma;
68 for(int k=0; k<steps_per_level; ++k){
69 Vec2 g = data.score_sigma(x, sigma); // oracle score at this noise level
70 Vec2 z{ randn(rng), randn(rng) };
71 x = x + eta * g + sqrt(2.0*eta) * z;
72 }
73 // Optional: small printout to track progress
74 cerr << "Level " << l+1 << "/" << L << ", sigma=" << sigma << ", x=(" << x.x << "," << x.y << ")\n";
75 }
76
77 cout.setf(std::ios::fixed); cout<<setprecision(4);
78 cout << x.x << "," << x.y << "\n"; // final synthesized sample
79 return 0;
80}
81

Demonstrates annealed Langevin sampling using an oracle score \nabla_x log p_\sigma(x) for a noise-convolved target. In practice, the oracle would be a trained score network s_\theta(x,\sigma). The schedule starts with large noise (easy landscape) and gradually sharpens, guiding x toward realistic samples.

Time: O(L K C_s) where L is the number of noise levels, K steps per level, and C_s the cost of a score evaluation (here O(K_mix d)).Space: O(d).
#langevin dynamics#mala#ula#sgld#score-based sampling#annealed langevin#denoising score matching#fokker-planck#energy-based models#bayesian mcmc#diffusion models#metropolis hastings#stochastic differential equations#gradient-based mcmc#preconditioning