🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Diffusion Models (Score-Based)

Key Points

  • •
    Score-based diffusion models corrupt data by gradually adding Gaussian noise and then learn to reverse this process by estimating the score, the gradient of the log-density.
  • •
    The forward process q(xt​ | x0​) has a closed form Gaussian: xt​ = αˉt​​ x0​ + 1−αˉt​​ ε where ε ∼ N(0, I).
  • •
    Training commonly uses the epsilon-prediction objective, a mean-squared error between the true noise and a model’s predicted noise conditioned on xt​ and t.
  • •
    Sampling runs the reverse diffusion using the learned score/noise predictor, stepping from pure noise back to data through many discrete time steps.
  • •
    Careful choice of the noise schedule (βt​), timestep embeddings, and numerical stability tricks strongly affects quality and speed.
  • •
    Score-based methods relate to SDEs/ODEs; reverse-time SDEs use the score to transport noise into data, while probability flow ODEs allow deterministic sampling.
  • •
    In C++, we can simulate the forward process, implement annealed Langevin dynamics with analytic scores, and even train a tiny epsilon-predictor for 2D toy data.
  • •
    Common pitfalls include wrong time scaling, mismatched variance in reverse updates, forgetting data normalization, and unstable discretization.

Prerequisites

  • →Gaussian distributions and multivariate normal — Forward noising and many formulas assume properties of Gaussian random variables.
  • →Gradients and vector calculus — The score is defined as a gradient of the log-density and appears in SDEs and reverse updates.
  • →Stochastic differential equations (basics) — Continuous-time score-based models use forward/reverse SDEs and related solvers.
  • →Numerical integration and discretization — Reverse diffusion requires stable discrete updates (Euler–Maruyama, ODE solvers).
  • →Linear algebra and matrix calculus — Model parameterization, feature mappings, and variance computations rely on linear algebra.
  • →Optimization and SGD — Training minimizes mean squared error between predicted and true noise via gradient-based updates.
  • →Probability density functions and Bayes’ rule — Posterior mean/variance in reverse steps and responsibilities in mixtures use Bayes’ rule.

Detailed Explanation

Tap terms for definitions

01Overview

Score-based diffusion models are a class of generative models that synthesize data (images, audio, point clouds) by reversing a progressive noising process. The idea is simple: take a clean sample x_0 from the data distribution and iteratively add small amounts of Gaussian noise, producing a sequence x_1, x_2, …, x_T that becomes nearly pure noise. This forward process has a tractable Gaussian form that we can write and sample from exactly. The magic happens in reverse: we train a neural network to approximate the score, the gradient of the log probability density of the noisy data at various noise levels. With this score, we numerically integrate the reverse stochastic differential equation (SDE) or use a discrete-time update to transport noise back into clean samples. These models are robust and stable to train (no adversarial game like GANs), and they reach state-of-the-art quality in many domains. Practical implementations rely on a variance schedule that spreads noise across time steps, a time-conditioning mechanism so the model knows the current noise level, and a loss that is easy to compute—typically predicting the added Gaussian noise (epsilon) and minimizing mean squared error. Sampling quality trades off with speed: more steps usually means better fidelity but longer runtime. Recent improvements (e.g., improved SDE solvers, distillation, consistency models) reduce the number of steps while preserving quality.

02Intuition & Analogies

Imagine taking a sharp photograph and repeatedly overlaying a faint layer of fog. After many layers, everything looks like gray noise, and the original picture is hidden. That’s the forward diffusion: easy to do, just keep adding a tiny bit of Gaussian fog at each step. Now imagine trying to “un-fog” the image. If you simply subtract a fixed amount of fog, you might sharpen some areas but over-correct others. What you really need is a direction field telling you, for each pixel arrangement, how to move slightly so that the image becomes a bit more like a real photograph and less like random fog. That direction field is the score—the gradient of the log-density—pointing toward regions where real data are more likely. Another analogy: think of a mountain landscape hidden in morning mist. The forward process is like thickening the mist until visibility drops to almost zero; you can do this blindly. The reverse process needs a guide. The score is like tiny arrows everywhere in space indicating which way the terrain slopes upward toward the ridgelines where true data live. If you follow these arrows carefully—taking small steps and adding a little random jiggle to match the physics of diffusion—you can walk from the foggy plains (pure noise) back up into the mountain ridges (clean samples). Different times t correspond to different mist densities; the arrows change with t, so your guide must know the time. A learned model takes a noisy input and the current t, and returns the arrow (or equivalently the noise direction) to follow for the next small step out of the fog.

03Formal Definition

Let x0​ ∼ p_{data} be a data sample in Rd. The forward (diffusion) process is a Markov chain that adds Gaussian noise according to a variance schedule \{βt​\}_{t=1}^{T}. Define αt​ = 1 - βt​ and αˉt​ = ∏s=1t​ αs​. Then the marginal noising has the closed form q(xt​ | x0​) = N(αˉt​​ x0​, (1 - αˉt​) I). A common training parameterization predicts the noise ε in xt​ = αˉt​​ x0​ + 1−αˉt​​ ε, with loss Ex0​,t,ε​[∥ ε - ε_θ(xt​, t) ∥2]. In continuous time, score-based diffusion uses SDEs. A variance-preserving (VP) SDE can be written dx=f(x, t) dt + g(t) d wt​, with g(t) controlling the noise intensity. The reverse-time SDE that produces samples is dx = [f(x, t) - g(t)^2 ∇x​ log pt​(x)] dt + g(t) d wˉt​, where ∇x​ log pt​(x) is the score of the noisy marginal at time t, approximated by a neural network s_θ(x, t). A probability flow ODE dx = [f(x, t) - 21​ g(t)^2 ∇x​ log pt​(x)] dt yields deterministic sampling. In discrete-time DDPMs, the reverse step has closed form: xt−1​ is Gaussian with mean depending on xt​ and the predicted noise, and variance set by the schedule. Tweedie’s formula relates the predictor to an estimate of x0​: x^0​ = (xt​ - 1−αˉt​​ ε_θ)/αˉt​​.

04When to Use

Use diffusion models when you need high-fidelity, diverse generative samples and you can afford iterative sampling. They shine in image, audio, and 3D generative tasks, super-resolution, inpainting, and conditional synthesis (text-to-image, class-conditional) by adding conditioning signals. If training stability is a concern (e.g., GANs collapse), diffusion models are attractive because the loss is a simple regression and the forward process is fixed and tractable. They also integrate naturally with SDE/ODE solvers and can leverage advances in numerical integration and control variates. Choose VP/VE/sub-VP variants based on data scaling and desired properties: VP keeps variance near constant (suitable for bounded, normalized data), while VE diffuses variance to infinity and uses annealed Langevin dynamics; sub-VP offers improved likelihood training. When compute or latency is limited, consider reduced step samplers (e.g., DDIM, higher-order solvers) or distillation to fewer steps. For small, low-dimensional problems (e.g., 2D toy data), simple feature models can learn usable scores, making C++ from-scratch demos feasible without deep libraries.

⚠️Common Mistakes

  • Mismatched variance in reverse updates: the posterior variance \tilde{\beta}_t differs from \beta_t; using the wrong \sigma_t inflates or destroys samples. Always use \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t in DDPM sampling unless intentionally using an alternative schedule.
  • Ignoring data normalization: training on unnormalized data (e.g., images not scaled to [−1,1]) shifts the effective noise scale and breaks the time embedding’s meaning. Normalize consistently and reverse it for output.
  • Poor timestep conditioning: feeding raw integer t without embedding often hurts learning. Use sinusoidal or learned embeddings so the model understands continuous noise levels.
  • Inadequate noise schedule: overly small early \beta_t produce slow mixing; overly large late \beta_t cause numerical instability. Start with standard linear or cosine schedules, then tune.
  • Too few sampling steps: aggressive step reduction without a better solver (e.g., DDIM, higher-order ODE methods) leads to artifacts. Balance speed-quality or use solvers designed for few steps.
  • Overfitting the score: tiny models on trivial datasets may memorize. Monitor validation loss on held-out data and visualize samples at various t.
  • Confusing score vs. epsilon prediction: they are equivalent up to scale factors tied to \bar{\alpha}_t; ensure conversion is correct when mixing formulas or code paths.

Key Formulas

Forward marginal

q(xt​∣x0​)=N(αˉt​​x0​,(1−αˉt​)I)

Explanation: Given the product of alphas up to time t, the noised variable xt​ is a Gaussian around a scaled version of x0​ with variance matching the accumulated noise. This provides a closed-form way to sample xt​ at any t without simulating all earlier steps.

Alpha definitions

αt​=1−βt​,αˉt​=s=1∏t​αs​

Explanation: Alphas are complements of betas (per-step noise). The product αˉt​ determines how much signal from x0​ remains after t steps.

Closed-form noising

xt​=αˉt​​x0​+1−αˉt​​ε,ε∼N(0,I)

Explanation: This reparameterization directly constructs xt​ from x0​ and Gaussian noise epsilon. It is used for efficient training and computing losses.

Epsilon-prediction loss

Lε​=Ex0​,t,ε​​ε−εθ​(xt​,t)​22​

Explanation: The model predicts the noise added at time t. Minimizing mean squared error is simple and stable, and it implicitly trains a score network.

Score–epsilon relation (VP)

sθ​(xt​,t)=∇x​logqt​(x)=−1−αˉt​​1​εθ​(xt​,t)−1−αˉt​αˉt​​​xt​

Explanation: For VP diffusions, the score and epsilon predictor are related by scale factors depending on αˉt​. This allows converting between parameterizations.

Tweedie’s formula

x^0​(xt​,t)=αˉt​​xt​−1−αˉt​​εθ​(xt​,t)​

Explanation: An estimate of the clean sample from a noisy input and the predicted noise. Useful in reverse updates and diagnostics.

Posterior variance

β~​t​=1−αˉt​1−αˉt−1​​βt​

Explanation: The conditional variance in the reverse step xt​ → xt−1​. Using this instead of βt​ improves numerical correctness and sample quality.

DDPM reverse update

xt−1​=αt​​1​(xt​−1−αˉt​​1−αt​​εθ​(xt​,t))+σt​z,z∼N(0,I)

Explanation: A discrete-time sampling step using the noise predictor. The variance σt​ is typically set to β~​t​​ or chosen for specific samplers (e.g., DDIM sets it to 0).

Forward SDE (generic)

dx=f(x,t)dt+g(t)dwt​

Explanation: A stochastic differential equation describes the continuous-time diffusion. The functions f and g control drift and diffusion (noise intensity).

Reverse-time SDE

dx=[f(x,t)−g(t)2∇x​logpt​(x)]dt+g(t)dwˉt​

Explanation: The reverse dynamics that generate data from noise require the score. Replacing the true score with a learned approximation yields practical samplers.

Probability flow ODE

dx=[f(x,t)−21​g(t)2∇x​logpt​(x)]dt

Explanation: A deterministic ODE sharing the same marginals as the SDE. This enables sampling without additional randomness and is useful with ODE solvers.

Signal-to-noise ratio

SNRt​=1−αˉt​αˉt​​

Explanation: Measures how much of the original signal remains at time t relative to noise. It guides schedule design and timestep selection.

Complexity Analysis

Let d be data dimensionality, P the number of model parameters (or feature dimension for simple linear/RFF models), T the number of diffusion steps, and N the number of training samples per epoch. Training with epsilon-prediction processes batches of noised samples. Using the closed-form q(xt​ | x0​), each batch requires O(B d) to generate xt​ and noise ε. A forward pass through a neural model typically costs O(P) to O(P d) depending on architecture; backprop doubles to triples that. So per step the cost is roughly O(B (d + P)). Over N samples and E epochs, training is O(E N (d + P)). Memory usage is dominated by parameters and activations: O(P + B d) for simple feed-forward networks; with deeper models it becomes O(P + sum of layer activations), often manageable on CPU for toy 2D data, but large for images. Sampling runs T reverse steps. Each step applies the model once, O(P) to O(P d), so total sampling cost is O(T (d + P)) per sample (or O(S T (d + P)) for S samples). Reducing T directly speeds sampling but may degrade quality unless better solvers (e.g., DDIM, ODE solvers) are used. The forward simulation alone (to visualize noising) is cheap: O(T d) per trajectory. Annealed Langevin dynamics with K noise levels and M inner steps costs O(K M d) when using an analytic score; replacing with a learned score adds O(K M (d + P)). Space during sampling is light: O(P + d) per sample. Numerical stability constraints require small steps when T is small or βt​ is large; otherwise discretization error increases, effectively amplifying computational cost to reach a fixed quality target. Overall, diffusion models exchange compute for stability and high sample quality: more steps and larger P yield better fidelity but higher time and memory usage.

Code Examples

Forward diffusion: schedule, closed-form noising, and trajectory sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility: generate a linear beta schedule
5vector<double> linear_betas(int T, double beta_start=1e-4, double beta_end=0.02) {
6 vector<double> betas(T+1, 0.0); // 1-indexed for convenience
7 for (int t = 1; t <= T; ++t) {
8 betas[t] = beta_start + (beta_end - beta_start) * (double)(t - 1) / (double)(T - 1);
9 }
10 return betas;
11}
12
13// Precompute alpha_t, alpha_bar_t, and posterior variance tilde_beta_t
14struct Schedule {
15 int T;
16 vector<double> beta, alpha, alpha_bar, tilde_beta;
17};
18
19Schedule make_schedule(int T, double beta_start=1e-4, double beta_end=0.02) {
20 Schedule s; s.T = T;
21 s.beta = linear_betas(T, beta_start, beta_end);
22 s.alpha.assign(T+1, 0.0);
23 s.alpha_bar.assign(T+1, 0.0);
24 s.tilde_beta.assign(T+1, 0.0);
25 s.alpha[0] = 1.0;
26 s.alpha_bar[0] = 1.0;
27 for (int t = 1; t <= T; ++t) {
28 s.alpha[t] = 1.0 - s.beta[t];
29 s.alpha_bar[t] = s.alpha_bar[t-1] * s.alpha[t];
30 }
31 for (int t = 1; t <= T; ++t) {
32 double ab_t = s.alpha_bar[t];
33 double ab_tm1 = (t==1) ? 1.0 : s.alpha_bar[t-1];
34 s.tilde_beta[t] = (1.0 - ab_tm1) / (1.0 - ab_t) * s.beta[t];
35 }
36 return s;
37}
38
39// Sample x_t from q(x_t | x_0) in closed form
40vector<double> sample_q_xt_given_x0(const vector<double>& x0, int t, const Schedule& s, mt19937& rng) {
41 int d = (int)x0.size();
42 normal_distribution<double> nd(0.0, 1.0);
43 vector<double> xt(d);
44 double mean_scale = sqrt(max(0.0, s.alpha_bar[t]));
45 double std_scale = sqrt(max(0.0, 1.0 - s.alpha_bar[t]));
46 for (int i = 0; i < d; ++i) {
47 double eps = nd(rng);
48 xt[i] = mean_scale * x0[i] + std_scale * eps;
49 }
50 return xt;
51}
52
53int main() {
54 ios::sync_with_stdio(false);
55 cin.tie(nullptr);
56
57 // Example: 2D point diffused over T steps
58 int T = 1000;
59 Schedule sched = make_schedule(T);
60 // A single clean point
61 vector<double> x0 = {2.0, -1.0};
62
63 // RNG
64 random_device rd; mt19937 rng(rd());
65
66 // Sample x_t at arbitrary t
67 int t = 500;
68 auto xt = sample_q_xt_given_x0(x0, t, sched, rng);
69 cout << fixed << setprecision(4);
70 cout << "x_t at t=" << t << ": (" << xt[0] << ", " << xt[1] << ")\n";
71
72 // Sample a full trajectory x_0 -> x_T (iterating t=1..T)
73 vector<double> x = x0;
74 normal_distribution<double> nd(0.0, 1.0);
75 for (int step = 1; step <= T; ++step) {
76 double a = sqrt(max(0.0, sched.alpha[step]));
77 double b = sqrt(max(0.0, 1.0 - sched.alpha[step]));
78 // Per-step form: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1 - alpha_t) * eps_t
79 for (int i = 0; i < (int)x.size(); ++i) {
80 x[i] = a * x[i] + b * nd(rng);
81 }
82 }
83 cout << "x_T after iterative forward diffusion: (" << x[0] << ", " << x[1] << ")\n";
84
85 // Demonstrate that closed-form at t=T matches iterative (distributionally)
86 auto xT_closed = sample_q_xt_given_x0(x0, T, sched, rng);
87 cout << "x_T (closed-form): (" << xT_closed[0] << ", " << xT_closed[1] << ")\n";
88
89 return 0;
90}
91

This program builds a linear beta schedule, precomputes alpha, alpha_bar, and posterior variances, and then samples noised data x_t in two ways: directly from the closed-form q(x_t | x_0) and iteratively by applying per-step noising. It illustrates the equivalence of the marginal and the Markov chain and provides a foundation for training and visualization.

Time: O(T d) to simulate a full trajectory; O(d) to sample a single x_t directly using the closed form.Space: O(T) to store the schedule plus O(d) for vectors.
Annealed Langevin dynamics with analytic score for a 2D Gaussian mixture
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Gaussian2D {
5 array<double,2> mu; // mean
6 double sigma; // isotropic std
7 double weight; // mixture weight (sum to 1)
8};
9
10// Compute log N(x; mu, (sigma2 + s2) I) and its gradient wrt x
11static inline double log_gauss(const array<double,2>& x, const array<double,2>& mu, double var) {
12 // var = sigma^2
13 double dx = x[0]-mu[0], dy = x[1]-mu[1];
14 double quad = (dx*dx + dy*dy)/var;
15 double logdet = log(var);
16 double logc = -0.5*(2*log(2*M_PI) + 2*logdet);
17 return logc - 0.5*quad;
18}
19
20// Score of smoothed mixture p_sigma(x) = sum_k w_k N(x; mu_k, (sigma_k^2 + sigma^2) I)
21array<double,2> score_mixture(const array<double,2>& x, const vector<Gaussian2D>& comps, double sigma) {
22 vector<double> logw; logw.reserve(comps.size());
23 for (auto &g : comps) {
24 double var = (g.sigma*g.sigma + sigma*sigma);
25 logw.push_back(log(g.weight) + log_gauss(x, g.mu, var));
26 }
27 double m = *max_element(logw.begin(), logw.end());
28 double denom = 0.0; for (double v: logw) denom += exp(v - m);
29 // responsibilities
30 vector<double> gamma(logw.size());
31 for (size_t k=0;k<logw.size();++k) gamma[k] = exp(logw[k]-m)/denom;
32
33 array<double,2> s = {0.0, 0.0};
34 for (size_t k=0;k<comps.size();++k) {
35 double var = comps[k].sigma*comps[k].sigma + sigma*sigma;
36 // grad log N wrt x = -(x - mu)/var
37 s[0] += gamma[k] * (-(x[0]-comps[k].mu[0]) / var);
38 s[1] += gamma[k] * (-(x[1]-comps[k].mu[1]) / var);
39 }
40 return s;
41}
42
43int main(){
44 ios::sync_with_stdio(false);
45 cin.tie(nullptr);
46
47 // Define a 2D mixture (two clusters)
48 vector<Gaussian2D> mix = {
49 {{-2.0, 0.0}, 0.4, 0.5},
50 {{+2.0, 0.0}, 0.6, 0.5}
51 };
52
53 // Annealed noise levels (geometric from large to small)
54 int K = 10;
55 vector<double> sigmas(K);
56 double sigma_max = 3.0, sigma_min = 0.1;
57 for (int i = 0; i < K; ++i) {
58 double t = (double)i/(K-1);
59 sigmas[i] = sigma_max * pow(sigma_min/sigma_max, t);
60 }
61
62 // RNG
63 random_device rd; mt19937 rng(rd());
64 normal_distribution<double> nd(0.0, 1.0);
65
66 // Start from standard normal
67 int S = 1000; // number of samples to draw
68 vector<array<double,2>> X(S);
69 for (int i=0;i<S;++i) X[i] = {nd(rng), nd(rng)};
70
71 // ALD parameters
72 int M = 50; // steps per sigma (increase for better quality)
73 for (double sigma : sigmas) {
74 double step = 0.1 * (sigma * sigma); // step size scales with sigma^2
75 for (int m = 0; m < M; ++m) {
76 for (int i = 0; i < S; ++i) {
77 auto s = score_mixture(X[i], mix, sigma);
78 // Langevin update: x <- x + eta * score + sqrt(2 eta) * z
79 double z1 = nd(rng), z2 = nd(rng);
80 X[i][0] += step * s[0] + sqrt(2.0 * step) * z1;
81 X[i][1] += step * s[1] + sqrt(2.0 * step) * z2;
82 }
83 }
84 }
85
86 // Print a few samples (x will form two clusters near the means)
87 cout << fixed << setprecision(3);
88 for (int i=0;i<20;++i) {
89 cout << X[i][0] << ", " << X[i][1] << "\n";
90 }
91 return 0;
92}
93

This implements annealed Langevin dynamics (ALD) to sample from a known 2D Gaussian mixture using an analytic score. At each noise level sigma, it computes the score of the Gaussian-smoothed mixture and performs Langevin updates. As sigma decreases, samples concentrate near mixture modes, illustrating score-based generation without training a network.

Time: O(K M S d + K K_mix S) ≈ O(K M S d) since d=2 and K_mix is the number of mixture components.Space: O(S d + K_mix) to store samples and mixture parameters.
Tiny DDPM-style epsilon predictor with Random Fourier Features (2D toy data) and sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple 2D toy dataset: mixture of two Gaussians
5struct Mixture2D { vector<array<double,2>> mu; vector<double> sigma; vector<double> w; };
6
7array<double,2> sample_mixture(const Mixture2D& mix, mt19937& rng) {
8 discrete_distribution<int> pick(mix.w.begin(), mix.w.end());
9 int k = pick(rng);
10 normal_distribution<double> nd(0.0, 1.0);
11 array<double,2> x = { mix.mu[k][0] + mix.sigma[k]*nd(rng), mix.mu[k][1] + mix.sigma[k]*nd(rng) };
12 return x;
13}
14
15// Schedule: linear betas, and derived terms
16struct Schedule { int T; vector<double> beta, alpha, alpha_bar, tilde_beta; };
17vector<double> linear_betas(int T, double b0=1e-4, double b1=0.02) {
18 vector<double> b(T+1,0.0); for (int t=1;t<=T;++t) b[t]= b0 + (b1-b0)*(double)(t-1)/(double)(T-1); return b; }
19Schedule make_schedule(int T) {
20 Schedule s; s.T=T; s.beta=linear_betas(T); s.alpha.assign(T+1,0.0); s.alpha_bar.assign(T+1,0.0); s.tilde_beta.assign(T+1,0.0);
21 s.alpha[0]=1.0; s.alpha_bar[0]=1.0; for(int t=1;t<=T;++t){ s.alpha[t]=1.0 - s.beta[t]; s.alpha_bar[t]=s.alpha_bar[t-1]*s.alpha[t]; }
22 for(int t=1;t<=T;++t){ double ab_t=s.alpha_bar[t]; double ab_tm1= (t==1?1.0:s.alpha_bar[t-1]); s.tilde_beta[t] = (1.0 - ab_tm1)/(1.0 - ab_t) * s.beta[t]; }
23 return s;
24}
25
26// Random Fourier Features: phi([x, y, tau]) in R^D with cosine features
27struct RFF {
28 int D; // feature dim
29 int in_dim; // 3: x, y, tau
30 vector<vector<double>> W; // D x in_dim, Gaussian rows
31 vector<double> b; // D phases uniform(0, 2pi)
32 RFF(int D_, int in_dim_, mt19937& rng): D(D_), in_dim(in_dim_) {
33 normal_distribution<double> nd(0.0,1.0);
34 uniform_real_distribution<double> ud(0.0, 2.0*M_PI);
35 W.assign(D, vector<double>(in_dim)); b.assign(D,0.0);
36 for(int i=0;i<D;++i){ for(int j=0;j<in_dim;++j) W[i][j]= nd(rng); b[i]= ud(rng);} }
37 vector<double> phi(const array<double,3>& z) const {
38 vector<double> feat(D);
39 double scale = sqrt(2.0 / (double)D);
40 for (int i=0;i<D;++i){ double dot=0.0; for(int j=0;j<in_dim;++j) dot+= W[i][j]*z[j]; feat[i] = scale * cos(dot + b[i]); }
41 return feat;
42 }
43};
44
45// Linear head W_out (2 x D) predicts epsilon in R^2
46struct LinearHead {
47 int D; vector<vector<double>> W; // 2 x D
48 LinearHead(int D_, mt19937& rng): D(D_) {
49 normal_distribution<double> nd(0.0, 0.01);
50 W.assign(2, vector<double>(D));
51 for(int o=0;o<2;++o) for(int i=0;i<D;++i) W[o][i]= nd(rng);
52 }
53 array<double,2> forward(const vector<double>& phi) const {
54 array<double,2> y = {0.0, 0.0};
55 for (int i=0;i<D;++i){ y[0]+= W[0][i]*phi[i]; y[1]+= W[1][i]*phi[i]; }
56 return y;
57 }
58 // SGD update: W <- W - lr * gradW (grad from MSE)
59 void sgd_update(const vector<double>& phi, const array<double,2>& grad, double lr){
60 // grad is dL/dy (size 2); dL/dW = grad ⊗ phi
61 for (int i=0;i<D;++i){ W[0][i] -= lr * grad[0] * phi[i]; W[1][i] -= lr * grad[1] * phi[i]; }
62 }
63};
64
65int main(){
66 ios::sync_with_stdio(false);
67 cin.tie(nullptr);
68
69 // RNG
70 random_device rd; mt19937 rng(rd()); normal_distribution<double> nd(0.0,1.0);
71
72 // Dataset: two Gaussians in 2D
73 Mixture2D data{ { { {-2.0,0.0}, {+2.0,0.0} } }, {0.4, 0.6}, {0.5, 0.5} };
74
75 // Diffusion schedule
76 int T = 100; Schedule sched = make_schedule(T);
77
78 // Model: Random Fourier Features + linear head
79 int D = 256; RFF rff(D, 3, rng); LinearHead head(D, rng);
80
81 auto to_tau = [&](int t){ return (double)t / (double)T; }; // normalize time to [0,1]
82
83 // Training hyperparams (kept tiny for demo; increase for real learning)
84 int steps = 5000; double lr = 1e-2; int batch = 64;
85
86 // Training loop: epsilon-prediction MSE
87 for (int it=0; it<steps; ++it){
88 // Accumulate over a mini-batch
89 for (int b=0; b<batch; ++b){
90 // Sample clean data
91 auto x0 = sample_mixture(data, rng);
92 // Pick a random t in [1, T]
93 uniform_int_distribution<int> udt(1, T); int t = udt(rng);
94 double alpha_bar = sched.alpha_bar[t];
95 double stdn = sqrt(max(0.0, 1.0 - alpha_bar));
96 // Generate noisy x_t
97 array<double,2> eps = { nd(rng), nd(rng) };
98 array<double,2> xt = { sqrt(alpha_bar)*x0[0] + stdn*eps[0], sqrt(alpha_bar)*x0[1] + stdn*eps[1] };
99 // Input z = [x_t.x, x_t.y, tau]
100 array<double,3> z = { xt[0], xt[1], to_tau(t) };
101 vector<double> phi = rff.phi(z);
102 auto y = head.forward(phi); // predicted epsilon
103 // Gradient of 1/2||y - eps||^2 wrt y is (y - eps)
104 array<double,2> gy = { y[0] - eps[0], y[1] - eps[1] };
105 head.sgd_update(phi, gy, lr);
106 }
107 if ((it+1) % 1000 == 0) cerr << "Step " << (it+1) << "/" << steps << " done\n";
108 }
109
110 // Sampling: DDPM reverse updates using predicted epsilon
111 int S = 200; // number of samples
112 vector<array<double,2>> X(S);
113 for (int i=0;i<S;++i) X[i] = { nd(rng), nd(rng) }; // start from noise ~ N(0, I)
114
115 for (int t=T; t>=1; --t){
116 double alpha = sched.alpha[t];
117 double alpha_bar = sched.alpha_bar[t];
118 double sqrt_alpha = sqrt(alpha);
119 double sqrt_one_minus_ab = sqrt(max(0.0, 1.0 - alpha_bar));
120 double sigma_t = sqrt(max(1e-20, sched.tilde_beta[t])); // posterior std
121 for (int i=0;i<S;++i){
122 array<double,3> z = { X[i][0], X[i][1], to_tau(t) };
123 vector<double> phi = rff.phi(z);
124 auto eps_pred = head.forward(phi);
125 // DDPM mean
126 double coeff = (1.0 - alpha) / max(1e-12, sqrt_one_minus_ab);
127 array<double,2> mean = {
128 (X[i][0] - coeff * eps_pred[0]) / max(1e-12, sqrt_alpha),
129 (X[i][1] - coeff * eps_pred[1]) / max(1e-12, sqrt_alpha)
130 };
131 if (t > 1) {
132 X[i][0] = mean[0] + sigma_t * nd(rng);
133 X[i][1] = mean[1] + sigma_t * nd(rng);
134 } else {
135 X[i] = mean; // last step: no noise
136 }
137 }
138 }
139
140 // Print a few generated samples (should cluster near the two modes)
141 cout << fixed << setprecision(3);
142 for (int i=0;i<20;++i) cout << X[i][0] << ", " << X[i][1] << "\n";
143 return 0;
144}
145

This program trains a tiny epsilon predictor on a 2D toy dataset using Random Fourier Features (no deep library). The model takes (x_t, t) and predicts the added noise ε with an SGD-updated linear head. After training, it uses the DDPM reverse update with the predicted ε to generate samples from pure noise. Although simplistic, it demonstrates the full pipeline: schedule, noising, ε-prediction training, and reverse sampling.

Time: Training: O(steps × batch × D) for RFF and linear head; Sampling: O(S × T × D).Space: O(D) for features plus O(T) for schedule and O(S) for samples.
#diffusion models#score-based modeling#ddpm#reverse sde#epsilon prediction#annealed langevin#variance schedule#tweedie formula#probability flow ode#gaussian mixture