📚TheoryAdvanced

Diffusion Models (Score-Based)

Key Points

•
Score-based diffusion models corrupt data by gradually adding Gaussian noise and then learn to reverse this process by estimating the score, the gradient of the log-density.
•
The forward process q( $x_{t}$ | $x_{0}$ ) has a closed form Gaussian: $x_{t}$ = $\overset{α}{ˉ}_{t}$ $x_{0}$ + $1 - \overset{α}{ˉ}_{t}$ $ε$ where $ε$ $\sim$ $N$ (0, I).
•
Training commonly uses the epsilon-prediction objective, a mean-squared error between the true noise and a model’s predicted noise conditioned on $x_{t}$ and t.
•
Sampling runs the reverse diffusion using the learned score/noise predictor, stepping from pure noise back to data through many discrete time steps.
•
Careful choice of the noise schedule ( $β_{t}$ ), timestep embeddings, and numerical stability tricks strongly affects quality and speed.
•
Score-based methods relate to SDEs/ODEs; reverse-time SDEs use the score to transport noise into data, while probability flow ODEs allow deterministic sampling.
•
In C++, we can simulate the forward process, implement annealed Langevin dynamics with analytic scores, and even train a tiny epsilon-predictor for 2D toy data.
•
Common pitfalls include wrong time scaling, mismatched variance in reverse updates, forgetting data normalization, and unstable discretization.

Prerequisites

→Gaussian distributions and multivariate normal — Forward noising and many formulas assume properties of Gaussian random variables.
→Gradients and vector calculus — The score is defined as a gradient of the log-density and appears in SDEs and reverse updates.
→Stochastic differential equations (basics) — Continuous-time score-based models use forward/reverse SDEs and related solvers.
→Numerical integration and discretization — Reverse diffusion requires stable discrete updates (Euler–Maruyama, ODE solvers).
→Linear algebra and matrix calculus — Model parameterization, feature mappings, and variance computations rely on linear algebra.
→Optimization and SGD — Training minimizes mean squared error between predicted and true noise via gradient-based updates.
→Probability density functions and Bayes’ rule — Posterior mean/variance in reverse steps and responsibilities in mixtures use Bayes’ rule.

Detailed Explanation

Tap terms for definitions

01Overview

Score-based diffusion models are a class of generative models that synthesize data (images, audio, point clouds) by reversing a progressive noising process. The idea is simple: take a clean sample x_0 from the data distribution and iteratively add small amounts of Gaussian noise, producing a sequence x_1, x_2, …, x_T that becomes nearly pure noise. This forward process has a tractable Gaussian form that we can write and sample from exactly. The magic happens in reverse: we train a neural network to approximate the score, the gradient of the log probability density of the noisy data at various noise levels. With this score, we numerically integrate the reverse stochastic differential equation (SDE) or use a discrete-time update to transport noise back into clean samples. These models are robust and stable to train (no adversarial game like GANs), and they reach state-of-the-art quality in many domains. Practical implementations rely on a variance schedule that spreads noise across time steps, a time-conditioning mechanism so the model knows the current noise level, and a loss that is easy to compute—typically predicting the added Gaussian noise (epsilon) and minimizing mean squared error. Sampling quality trades off with speed: more steps usually means better fidelity but longer runtime. Recent improvements (e.g., improved SDE solvers, distillation, consistency models) reduce the number of steps while preserving quality.

02Intuition & Analogies

Imagine taking a sharp photograph and repeatedly overlaying a faint layer of fog. After many layers, everything looks like gray noise, and the original picture is hidden. That’s the forward diffusion: easy to do, just keep adding a tiny bit of Gaussian fog at each step. Now imagine trying to “un-fog” the image. If you simply subtract a fixed amount of fog, you might sharpen some areas but over-correct others. What you really need is a direction field telling you, for each pixel arrangement, how to move slightly so that the image becomes a bit more like a real photograph and less like random fog. That direction field is the score—the gradient of the log-density—pointing toward regions where real data are more likely. Another analogy: think of a mountain landscape hidden in morning mist. The forward process is like thickening the mist until visibility drops to almost zero; you can do this blindly. The reverse process needs a guide. The score is like tiny arrows everywhere in space indicating which way the terrain slopes upward toward the ridgelines where true data live. If you follow these arrows carefully—taking small steps and adding a little random jiggle to match the physics of diffusion—you can walk from the foggy plains (pure noise) back up into the mountain ridges (clean samples). Different times t correspond to different mist densities; the arrows change with t, so your guide must know the time. A learned model takes a noisy input and the current t, and returns the arrow (or equivalently the noise direction) to follow for the next small step out of the fog.

03Formal Definition

Let

x_{0}

\sim

p_{

data

} be a data sample in

R^{d}

. The forward (diffusion) process is a Markov chain that adds Gaussian noise according to a variance schedule \{

β_{t}

\}_{

t=1}^{T}

. Define

α_{t}

= 1 -

β_{t}

and

\overset{α}{ˉ}_{t}

\prod_{s = 1}^{t}

α_{s}

. Then the marginal noising has the closed form q(

x_{t}

x_{0}

) =

N

(

\overset{α}{ˉ}_{t}

x_{0}

, (1 -

\overset{α}{ˉ}_{t}

) I). A common training parameterization predicts the noise

ε

x_{t}

\overset{α}{ˉ}_{t}

x_{0}

1 - \overset{α}{ˉ}_{t}

ε

, with loss

E_{x_{0}, t, ε}

[

∥

ε

ε

θ

(

x_{t}

, t)

∥^{2}

]. In continuous time, score-based diffusion uses SDEs. A variance-preserving (VP) SDE can be written

d x = f (x

, t) dt + g(t) d

w_{t}

, with g(t) controlling the noise intensity. The reverse-time SDE that produces samples is dx = [f(x, t) - g(t)^2

\nabla_{x}

lo g

p_{t}

(x)] dt + g(t) d

\overset{w}{ˉ}_{t}

, where

\nabla_{x}

lo g

p_{t}

(x) is the score of the noisy marginal at time t, approximated by a neural network s_

θ

(x, t). A probability flow ODE dx = [f(x, t) -

\frac{1}{2}

g(t)^2

\nabla_{x}

lo g

p_{t}

(x)] dt yields deterministic sampling. In discrete-time DDPMs, the reverse step has closed form:

x_{t - 1}

is Gaussian with mean depending on

x_{t}

and the predicted noise, and variance set by the schedule. Tweedie’s formula relates the predictor to an estimate of

x_{0}

\overset{x}{^}_{0}

= (

x_{t}

1 - \overset{α}{ˉ}_{t}

ε

θ

\overset{α}{ˉ}_{t}

04When to Use

Use diffusion models when you need high-fidelity, diverse generative samples and you can afford iterative sampling. They shine in image, audio, and 3D generative tasks, super-resolution, inpainting, and conditional synthesis (text-to-image, class-conditional) by adding conditioning signals. If training stability is a concern (e.g., GANs collapse), diffusion models are attractive because the loss is a simple regression and the forward process is fixed and tractable. They also integrate naturally with SDE/ODE solvers and can leverage advances in numerical integration and control variates. Choose VP/VE/sub-VP variants based on data scaling and desired properties: VP keeps variance near constant (suitable for bounded, normalized data), while VE diffuses variance to infinity and uses annealed Langevin dynamics; sub-VP offers improved likelihood training. When compute or latency is limited, consider reduced step samplers (e.g., DDIM, higher-order solvers) or distillation to fewer steps. For small, low-dimensional problems (e.g., 2D toy data), simple feature models can learn usable scores, making C++ from-scratch demos feasible without deep libraries.

⚠️Common Mistakes

Mismatched variance in reverse updates: the posterior variance \tilde{\beta}_t differs from \beta_t; using the wrong \sigma_t inflates or destroys samples. Always use \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t in DDPM sampling unless intentionally using an alternative schedule.
Ignoring data normalization: training on unnormalized data (e.g., images not scaled to [−1,1]) shifts the effective noise scale and breaks the time embedding’s meaning. Normalize consistently and reverse it for output.
Poor timestep conditioning: feeding raw integer t without embedding often hurts learning. Use sinusoidal or learned embeddings so the model understands continuous noise levels.
Inadequate noise schedule: overly small early \beta_t produce slow mixing; overly large late \beta_t cause numerical instability. Start with standard linear or cosine schedules, then tune.
Too few sampling steps: aggressive step reduction without a better solver (e.g., DDIM, higher-order ODE methods) leads to artifacts. Balance speed-quality or use solvers designed for few steps.
Overfitting the score: tiny models on trivial datasets may memorize. Monitor validation loss on held-out data and visualize samples at various t.
Confusing score vs. epsilon prediction: they are equivalent up to scale factors tied to \bar{\alpha}_t; ensure conversion is correct when mixing formulas or code paths.

Key Formulas

Forward marginal

q (x_{t} ∣ x_{0}) = N (\overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

Explanation: Given the product of alphas up to time t, the noised variable $x_{t}$ is a Gaussian around a scaled version of $x_{0}$ with variance matching the accumulated noise. This provides a closed-form way to sample $x_{t}$ at any t without simulating all earlier steps.

Alpha definitions

α_{t} = 1 - β_{t}, \overset{α}{ˉ}_{t} = s = 1 \prod t α_{s}

Explanation: Alphas are complements of betas (per-step noise). The product $\overset{α}{ˉ}_{t}$ determines how much signal from $x_{0}$ remains after t steps.

Closed-form noising

x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε, ε \sim N (0, I)

Explanation: This reparameterization directly constructs $x_{t}$ from $x_{0}$ and Gaussian noise epsilon. It is used for efficient training and computing losses.

Epsilon-prediction loss

L_{ε} = E_{x_{0}, t, ε} ε - ε_{θ} (x_{t}, t)_{2}^{2}

Explanation: The model predicts the noise added at time t. Minimizing mean squared error is simple and stable, and it implicitly trains a score network.

Score–epsilon relation (VP)

s_{θ} (x_{t}, t) = \nabla_{x} lo g q_{t} (x) = - \frac{1}{1 - α ˉ _{t}} ε_{θ} (x_{t}, t) - \frac{α ˉ _{t}}{1 - α ˉ _{t}} x_{t}

Explanation: For VP diffusions, the score and epsilon predictor are related by scale factors depending on $\overset{α}{ˉ}_{t}$ . This allows converting between parameterizations.

Tweedie’s formula

\overset{x}{^}_{0} (x_{t}, t) = \frac{x _{t} - 1 - α ˉ _{t} ε _{θ} ( x _{t} , t )}{α ˉ _{t}}

Explanation: An estimate of the clean sample from a noisy input and the predicted noise. Useful in reverse updates and diagnostics.

Posterior variance

\tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}

Explanation: The conditional variance in the reverse step $x_{t}$ → $x_{t - 1}$ . Using this instead of $β_{t}$ improves numerical correctness and sample quality.

DDPM reverse update

x_{t - 1} = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} ε_{θ} (x_{t}, t)) + σ_{t} z, z \sim N (0, I)

Explanation: A discrete-time sampling step using the noise predictor. The variance $σ_{t}$ is typically set to $\tilde{β}_{t}$ or chosen for specific samplers (e.g., DDIM sets it to 0).

Forward SDE (generic)

d x = f (x, t) d t + g (t) d w_{t}

Explanation: A stochastic differential equation describes the continuous-time diffusion. The functions f and g control drift and diffusion (noise intensity).

Reverse-time SDE

d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{w}{ˉ}_{t}

Explanation: The reverse dynamics that generate data from noise require the score. Replacing the true score with a learned approximation yields practical samplers.

Probability flow ODE

d x = [f (x, t) - \frac{1}{2} g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t

Explanation: A deterministic ODE sharing the same marginals as the SDE. This enables sampling without additional randomness and is useful with ODE solvers.

Signal-to-noise ratio

SNR_{t} = \frac{α ˉ _{t}}{1 - α ˉ _{t}}

Explanation: Measures how much of the original signal remains at time t relative to noise. It guides schedule design and timestep selection.

Complexity Analysis

Let d be data dimensionality, P the number of model parameters (or feature dimension for simple linear/RFF models), T the number of diffusion steps, and N the number of training samples per epoch. Training with epsilon-prediction processes batches of noised samples. Using the closed-form q(

x_{t}

x_{0}

), each batch requires O(B d) to generate

x_{t}

and noise

ε .

A forward pass through a neural model typically costs O(P) to O(P d) depending on architecture; backprop doubles to triples that. So per step the cost is roughly O(B (d + P)). Over N samples and E epochs, training is O(E N (d + P)). Memory usage is dominated by parameters and activations: O(P + B d) for simple feed-forward networks; with deeper models it becomes O(P + sum of layer activations), often manageable on CPU for toy 2D data, but large for images. Sampling runs T reverse steps. Each step applies the model once, O(P) to O(P d), so total sampling cost is O(T (d + P)) per sample (or O(S T (d + P)) for S samples). Reducing T directly speeds sampling but may degrade quality unless better solvers (e.g., DDIM, ODE solvers) are used. The forward simulation alone (to visualize noising) is cheap: O(T d) per trajectory. Annealed Langevin dynamics with K noise levels and M inner steps costs O(K M d) when using an analytic score; replacing with a learned score adds O(K M (d + P)). Space during sampling is light: O(P + d) per sample. Numerical stability constraints require small steps when T is small or

β_{t}

is large; otherwise discretization error increases, effectively amplifying computational cost to reach a fixed quality target. Overall, diffusion models exchange compute for stability and high sample quality: more steps and larger P yield better fidelity but higher time and memory usage.

Code Examples

Forward diffusion: schedule, closed-form noising, and trajectory sampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility: generate a linear beta schedule
5 vector<double> linear_betas(int T, double beta_start=1e-4, double beta_end=0.02) {
6     vector<double> betas(T+1, 0.0); // 1-indexed for convenience
7     for (int t = 1; t <= T; ++t) {
8         betas[t] = beta_start + (beta_end - beta_start) * (double)(t - 1) / (double)(T - 1);
9     }
10     return betas;
11 }
12 
13 // Precompute alpha_t, alpha_bar_t, and posterior variance tilde_beta_t
14 struct Schedule {
15     int T;
16     vector<double> beta, alpha, alpha_bar, tilde_beta;
17 };
18 
19 Schedule make_schedule(int T, double beta_start=1e-4, double beta_end=0.02) {
20     Schedule s; s.T = T;
21     s.beta = linear_betas(T, beta_start, beta_end);
22     s.alpha.assign(T+1, 0.0);
23     s.alpha_bar.assign(T+1, 0.0);
24     s.tilde_beta.assign(T+1, 0.0);
25     s.alpha[0] = 1.0;
26     s.alpha_bar[0] = 1.0;
27     for (int t = 1; t <= T; ++t) {
28         s.alpha[t] = 1.0 - s.beta[t];
29         s.alpha_bar[t] = s.alpha_bar[t-1] * s.alpha[t];
30     }
31     for (int t = 1; t <= T; ++t) {
32         double ab_t = s.alpha_bar[t];
33         double ab_tm1 = (t==1) ? 1.0 : s.alpha_bar[t-1];
34         s.tilde_beta[t] = (1.0 - ab_tm1) / (1.0 - ab_t) * s.beta[t];
35     }
36     return s;
37 }
38 
39 // Sample x_t from q(x_t | x_0) in closed form
40 vector<double> sample_q_xt_given_x0(const vector<double>& x0, int t, const Schedule& s, mt19937& rng) {
41     int d = (int)x0.size();
42     normal_distribution<double> nd(0.0, 1.0);
43     vector<double> xt(d);
44     double mean_scale = sqrt(max(0.0, s.alpha_bar[t]));
45     double std_scale  = sqrt(max(0.0, 1.0 - s.alpha_bar[t]));
46     for (int i = 0; i < d; ++i) {
47         double eps = nd(rng);
48         xt[i] = mean_scale * x0[i] + std_scale * eps;
49     }
50     return xt;
51 }
52 
53 int main() {
54     ios::sync_with_stdio(false);
55     cin.tie(nullptr);
56 
57     // Example: 2D point diffused over T steps
58     int T = 1000;
59     Schedule sched = make_schedule(T);
60     // A single clean point
61     vector<double> x0 = {2.0, -1.0};
62 
63     // RNG
64     random_device rd; mt19937 rng(rd());
65 
66     // Sample x_t at arbitrary t
67     int t = 500;
68     auto xt = sample_q_xt_given_x0(x0, t, sched, rng);
69     cout << fixed << setprecision(4);
70     cout << "x_t at t=" << t << ": (" << xt[0] << ", " << xt[1] << ")\n";
71 
72     // Sample a full trajectory x_0 -> x_T (iterating t=1..T)
73     vector<double> x = x0;
74     normal_distribution<double> nd(0.0, 1.0);
75     for (int step = 1; step <= T; ++step) {
76         double a = sqrt(max(0.0, sched.alpha[step]));
77         double b = sqrt(max(0.0, 1.0 - sched.alpha[step]));
78         // Per-step form: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1 - alpha_t) * eps_t
79         for (int i = 0; i < (int)x.size(); ++i) {
80             x[i] = a * x[i] + b * nd(rng);
81         }
82     }
83     cout << "x_T after iterative forward diffusion: (" << x[0] << ", " << x[1] << ")\n";
84 
85     // Demonstrate that closed-form at t=T matches iterative (distributionally)
86     auto xT_closed = sample_q_xt_given_x0(x0, T, sched, rng);
87     cout << "x_T (closed-form): (" << xT_closed[0] << ", " << xT_closed[1] << ")\n";
88 
89     return 0;
90 }
91

This program builds a linear beta schedule, precomputes alpha, alpha_bar, and posterior variances, and then samples noised data x_t in two ways: directly from the closed-form q(x_t | x_0) and iteratively by applying per-step noising. It illustrates the equivalence of the marginal and the Markov chain and provides a foundation for training and visualization.

Time: O(T d) to simulate a full trajectory; O(d) to sample a single x_t directly using the closed form.Space: O(T) to store the schedule plus O(d) for vectors.

Annealed Langevin dynamics with analytic score for a 2D Gaussian mixture

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Gaussian2D {
5     array<double,2> mu; // mean
6     double sigma;       // isotropic std
7     double weight;      // mixture weight (sum to 1)
8 };
9 
10 // Compute log N(x; mu, (sigma2 + s2) I) and its gradient wrt x
11 static inline double log_gauss(const array<double,2>& x, const array<double,2>& mu, double var) {
12     // var = sigma^2
13     double dx = x[0]-mu[0], dy = x[1]-mu[1];
14     double quad = (dx*dx + dy*dy)/var;
15     double logdet = log(var);
16     double logc = -0.5*(2*log(2*M_PI) + 2*logdet);
17     return logc - 0.5*quad;
18 }
19 
20 // Score of smoothed mixture p_sigma(x) = sum_k w_k N(x; mu_k, (sigma_k^2 + sigma^2) I)
21 array<double,2> score_mixture(const array<double,2>& x, const vector<Gaussian2D>& comps, double sigma) {
22     vector<double> logw; logw.reserve(comps.size());
23     for (auto &g : comps) {
24         double var = (g.sigma*g.sigma + sigma*sigma);
25         logw.push_back(log(g.weight) + log_gauss(x, g.mu, var));
26     }
27     double m = *max_element(logw.begin(), logw.end());
28     double denom = 0.0; for (double v: logw) denom += exp(v - m);
29     // responsibilities
30     vector<double> gamma(logw.size());
31     for (size_t k=0;k<logw.size();++k) gamma[k] = exp(logw[k]-m)/denom;
32 
33     array<double,2> s = {0.0, 0.0};
34     for (size_t k=0;k<comps.size();++k) {
35         double var = comps[k].sigma*comps[k].sigma + sigma*sigma;
36         // grad log N wrt x = -(x - mu)/var
37         s[0] += gamma[k] * (-(x[0]-comps[k].mu[0]) / var);
38         s[1] += gamma[k] * (-(x[1]-comps[k].mu[1]) / var);
39     }
40     return s;
41 }
42 
43 int main(){
44     ios::sync_with_stdio(false);
45     cin.tie(nullptr);
46 
47     // Define a 2D mixture (two clusters)
48     vector<Gaussian2D> mix = {
49         {{-2.0, 0.0}, 0.4, 0.5},
50         {{+2.0, 0.0}, 0.6, 0.5}
51     };
52 
53     // Annealed noise levels (geometric from large to small)
54     int K = 10;
55     vector<double> sigmas(K);
56     double sigma_max = 3.0, sigma_min = 0.1;
57     for (int i = 0; i < K; ++i) {
58         double t = (double)i/(K-1);
59         sigmas[i] = sigma_max * pow(sigma_min/sigma_max, t);
60     }
61 
62     // RNG
63     random_device rd; mt19937 rng(rd());
64     normal_distribution<double> nd(0.0, 1.0);
65 
66     // Start from standard normal
67     int S = 1000; // number of samples to draw
68     vector<array<double,2>> X(S);
69     for (int i=0;i<S;++i) X[i] = {nd(rng), nd(rng)};
70 
71     // ALD parameters
72     int M = 50; // steps per sigma (increase for better quality)
73     for (double sigma : sigmas) {
74         double step = 0.1 * (sigma * sigma); // step size scales with sigma^2
75         for (int m = 0; m < M; ++m) {
76             for (int i = 0; i < S; ++i) {
77                 auto s = score_mixture(X[i], mix, sigma);
78                 // Langevin update: x <- x + eta * score + sqrt(2 eta) * z
79                 double z1 = nd(rng), z2 = nd(rng);
80                 X[i][0] += step * s[0] + sqrt(2.0 * step) * z1;
81                 X[i][1] += step * s[1] + sqrt(2.0 * step) * z2;
82             }
83         }
84     }
85 
86     // Print a few samples (x will form two clusters near the means)
87     cout << fixed << setprecision(3);
88     for (int i=0;i<20;++i) {
89         cout << X[i][0] << ", " << X[i][1] << "\n";
90     }
91     return 0;
92 }
93

This implements annealed Langevin dynamics (ALD) to sample from a known 2D Gaussian mixture using an analytic score. At each noise level sigma, it computes the score of the Gaussian-smoothed mixture and performs Langevin updates. As sigma decreases, samples concentrate near mixture modes, illustrating score-based generation without training a network.

Time: O(K M S d + K K_mix S) ≈ O(K M S d) since d=2 and K_mix is the number of mixture components.Space: O(S d + K_mix) to store samples and mixture parameters.

Tiny DDPM-style epsilon predictor with Random Fourier Features (2D toy data) and sampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple 2D toy dataset: mixture of two Gaussians
5 struct Mixture2D { vector<array<double,2>> mu; vector<double> sigma; vector<double> w; };
6 
7 array<double,2> sample_mixture(const Mixture2D& mix, mt19937& rng) {
8     discrete_distribution<int> pick(mix.w.begin(), mix.w.end());
9     int k = pick(rng);
10     normal_distribution<double> nd(0.0, 1.0);
11     array<double,2> x = { mix.mu[k][0] + mix.sigma[k]*nd(rng), mix.mu[k][1] + mix.sigma[k]*nd(rng) };
12     return x;
13 }
14 
15 // Schedule: linear betas, and derived terms
16 struct Schedule { int T; vector<double> beta, alpha, alpha_bar, tilde_beta; };
17 vector<double> linear_betas(int T, double b0=1e-4, double b1=0.02) {
18     vector<double> b(T+1,0.0); for (int t=1;t<=T;++t) b[t]= b0 + (b1-b0)*(double)(t-1)/(double)(T-1); return b; }
19 Schedule make_schedule(int T) {
20     Schedule s; s.T=T; s.beta=linear_betas(T); s.alpha.assign(T+1,0.0); s.alpha_bar.assign(T+1,0.0); s.tilde_beta.assign(T+1,0.0);
21     s.alpha[0]=1.0; s.alpha_bar[0]=1.0; for(int t=1;t<=T;++t){ s.alpha[t]=1.0 - s.beta[t]; s.alpha_bar[t]=s.alpha_bar[t-1]*s.alpha[t]; }
22     for(int t=1;t<=T;++t){ double ab_t=s.alpha_bar[t]; double ab_tm1= (t==1?1.0:s.alpha_bar[t-1]); s.tilde_beta[t] = (1.0 - ab_tm1)/(1.0 - ab_t) * s.beta[t]; }
23     return s;
24 }
25 
26 // Random Fourier Features: phi([x, y, tau]) in R^D with cosine features
27 struct RFF {
28     int D; // feature dim
29     int in_dim; // 3: x, y, tau
30     vector<vector<double>> W; // D x in_dim, Gaussian rows
31     vector<double> b; // D phases uniform(0, 2pi)
32     RFF(int D_, int in_dim_, mt19937& rng): D(D_), in_dim(in_dim_) {
33         normal_distribution<double> nd(0.0,1.0);
34         uniform_real_distribution<double> ud(0.0, 2.0*M_PI);
35         W.assign(D, vector<double>(in_dim)); b.assign(D,0.0);
36         for(int i=0;i<D;++i){ for(int j=0;j<in_dim;++j) W[i][j]= nd(rng); b[i]= ud(rng);} }
37     vector<double> phi(const array<double,3>& z) const {
38         vector<double> feat(D);
39         double scale = sqrt(2.0 / (double)D);
40         for (int i=0;i<D;++i){ double dot=0.0; for(int j=0;j<in_dim;++j) dot+= W[i][j]*z[j]; feat[i] = scale * cos(dot + b[i]); }
41         return feat;
42     }
43 };
44 
45 // Linear head W_out (2 x D) predicts epsilon in R^2
46 struct LinearHead {
47     int D; vector<vector<double>> W; // 2 x D
48     LinearHead(int D_, mt19937& rng): D(D_) {
49         normal_distribution<double> nd(0.0, 0.01);
50         W.assign(2, vector<double>(D));
51         for(int o=0;o<2;++o) for(int i=0;i<D;++i) W[o][i]= nd(rng);
52     }
53     array<double,2> forward(const vector<double>& phi) const {
54         array<double,2> y = {0.0, 0.0};
55         for (int i=0;i<D;++i){ y[0]+= W[0][i]*phi[i]; y[1]+= W[1][i]*phi[i]; }
56         return y;
57     }
58     // SGD update: W <- W - lr * gradW (grad from MSE)
59     void sgd_update(const vector<double>& phi, const array<double,2>& grad, double lr){
60         // grad is dL/dy (size 2); dL/dW = grad ⊗ phi
61         for (int i=0;i<D;++i){ W[0][i] -= lr * grad[0] * phi[i]; W[1][i] -= lr * grad[1] * phi[i]; }
62     }
63 };
64 
65 int main(){
66     ios::sync_with_stdio(false);
67     cin.tie(nullptr);
68 
69     // RNG
70     random_device rd; mt19937 rng(rd()); normal_distribution<double> nd(0.0,1.0);
71 
72     // Dataset: two Gaussians in 2D
73     Mixture2D data{ { { {-2.0,0.0}, {+2.0,0.0} } }, {0.4, 0.6}, {0.5, 0.5} };
74 
75     // Diffusion schedule
76     int T = 100; Schedule sched = make_schedule(T);
77 
78     // Model: Random Fourier Features + linear head
79     int D = 256; RFF rff(D, 3, rng); LinearHead head(D, rng);
80 
81     auto to_tau = [&](int t){ return (double)t / (double)T; }; // normalize time to [0,1]
82 
83     // Training hyperparams (kept tiny for demo; increase for real learning)
84     int steps = 5000; double lr = 1e-2; int batch = 64;
85 
86     // Training loop: epsilon-prediction MSE
87     for (int it=0; it<steps; ++it){
88         // Accumulate over a mini-batch
89         for (int b=0; b<batch; ++b){
90             // Sample clean data
91             auto x0 = sample_mixture(data, rng);
92             // Pick a random t in [1, T]
93             uniform_int_distribution<int> udt(1, T); int t = udt(rng);
94             double alpha_bar = sched.alpha_bar[t];
95             double stdn = sqrt(max(0.0, 1.0 - alpha_bar));
96             // Generate noisy x_t
97             array<double,2> eps = { nd(rng), nd(rng) };
98             array<double,2> xt = { sqrt(alpha_bar)*x0[0] + stdn*eps[0], sqrt(alpha_bar)*x0[1] + stdn*eps[1] };
99             // Input z = [x_t.x, x_t.y, tau]
100             array<double,3> z = { xt[0], xt[1], to_tau(t) };
101             vector<double> phi = rff.phi(z);
102             auto y = head.forward(phi); // predicted epsilon
103             // Gradient of 1/2||y - eps||^2 wrt y is (y - eps)
104             array<double,2> gy = { y[0] - eps[0], y[1] - eps[1] };
105             head.sgd_update(phi, gy, lr);
106         }
107         if ((it+1) % 1000 == 0) cerr << "Step " << (it+1) << "/" << steps << " done\n";
108     }
109 
110     // Sampling: DDPM reverse updates using predicted epsilon
111     int S = 200; // number of samples
112     vector<array<double,2>> X(S);
113     for (int i=0;i<S;++i) X[i] = { nd(rng), nd(rng) }; // start from noise ~ N(0, I)
114 
115     for (int t=T; t>=1; --t){
116         double alpha = sched.alpha[t];
117         double alpha_bar = sched.alpha_bar[t];
118         double sqrt_alpha = sqrt(alpha);
119         double sqrt_one_minus_ab = sqrt(max(0.0, 1.0 - alpha_bar));
120         double sigma_t = sqrt(max(1e-20, sched.tilde_beta[t])); // posterior std
121         for (int i=0;i<S;++i){
122             array<double,3> z = { X[i][0], X[i][1], to_tau(t) };
123             vector<double> phi = rff.phi(z);
124             auto eps_pred = head.forward(phi);
125             // DDPM mean
126             double coeff = (1.0 - alpha) / max(1e-12, sqrt_one_minus_ab);
127             array<double,2> mean = {
128                 (X[i][0] - coeff * eps_pred[0]) / max(1e-12, sqrt_alpha),
129                 (X[i][1] - coeff * eps_pred[1]) / max(1e-12, sqrt_alpha)
130             };
131             if (t > 1) {
132                 X[i][0] = mean[0] + sigma_t * nd(rng);
133                 X[i][1] = mean[1] + sigma_t * nd(rng);
134             } else {
135                 X[i] = mean; // last step: no noise
136             }
137         }
138     }
139 
140     // Print a few generated samples (should cluster near the two modes)
141     cout << fixed << setprecision(3);
142     for (int i=0;i<20;++i) cout << X[i][0] << ", " << X[i][1] << "\n";
143     return 0;
144 }
145

This program trains a tiny epsilon predictor on a 2D toy dataset using Random Fourier Features (no deep library). The model takes (x_t, t) and predicts the added noise ε with an SGD-updated linear head. After training, it uses the DDPM reverse update with the predicted ε to generate samples from pure noise. Although simplistic, it demonstrates the full pipeline: schedule, noising, ε-prediction training, and reverse sampling.

Time: Training: O(steps × batch × D) for RFF and linear head; Sampling: O(S × T × D).Space: O(D) for features plus O(T) for schedule and O(S) for samples.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility: generate a linear beta schedule
5	vector<double> linear_betas(int T, double beta_start=1e-4, double beta_end=0.02) {
6	vector<double> betas(T+1, 0.0); // 1-indexed for convenience
7	for (int t = 1; t <= T; ++t) {
8	betas[t] = beta_start + (beta_end - beta_start) * (double)(t - 1) / (double)(T - 1);
9	}
10	return betas;
11	}
12
13	// Precompute alpha_t, alpha_bar_t, and posterior variance tilde_beta_t
14	struct Schedule {
15	int T;
16	vector<double> beta, alpha, alpha_bar, tilde_beta;
17	};
18
19	Schedule make_schedule(int T, double beta_start=1e-4, double beta_end=0.02) {
20	Schedule s; s.T = T;
21	s.beta = linear_betas(T, beta_start, beta_end);
22	s.alpha.assign(T+1, 0.0);
23	s.alpha_bar.assign(T+1, 0.0);
24	s.tilde_beta.assign(T+1, 0.0);
25	s.alpha[0] = 1.0;
26	s.alpha_bar[0] = 1.0;
27	for (int t = 1; t <= T; ++t) {
28	s.alpha[t] = 1.0 - s.beta[t];
29	s.alpha_bar[t] = s.alpha_bar[t-1] * s.alpha[t];
30	}
31	for (int t = 1; t <= T; ++t) {
32	double ab_t = s.alpha_bar[t];
33	double ab_tm1 = (t==1) ? 1.0 : s.alpha_bar[t-1];
34	s.tilde_beta[t] = (1.0 - ab_tm1) / (1.0 - ab_t) * s.beta[t];
35	}
36	return s;
37	}
38
39	// Sample x_t from q(x_t \| x_0) in closed form
40	vector<double> sample_q_xt_given_x0(const vector<double>& x0, int t, const Schedule& s, mt19937& rng) {
41	int d = (int)x0.size();
42	normal_distribution<double> nd(0.0, 1.0);
43	vector<double> xt(d);
44	double mean_scale = sqrt(max(0.0, s.alpha_bar[t]));
45	double std_scale = sqrt(max(0.0, 1.0 - s.alpha_bar[t]));
46	for (int i = 0; i < d; ++i) {
47	double eps = nd(rng);
48	xt[i] = mean_scale * x0[i] + std_scale * eps;
49	}
50	return xt;
51	}
52
53	int main() {
54	ios::sync_with_stdio(false);
55	cin.tie(nullptr);
56
57	// Example: 2D point diffused over T steps
58	int T = 1000;
59	Schedule sched = make_schedule(T);
60	// A single clean point
61	vector<double> x0 = {2.0, -1.0};
62
63	// RNG
64	random_device rd; mt19937 rng(rd());
65
66	// Sample x_t at arbitrary t
67	int t = 500;
68	auto xt = sample_q_xt_given_x0(x0, t, sched, rng);
69	cout << fixed << setprecision(4);
70	cout << "x_t at t=" << t << ": (" << xt[0] << ", " << xt[1] << ")\n";
71
72	// Sample a full trajectory x_0 -> x_T (iterating t=1..T)
73	vector<double> x = x0;
74	normal_distribution<double> nd(0.0, 1.0);
75	for (int step = 1; step <= T; ++step) {
76	double a = sqrt(max(0.0, sched.alpha[step]));
77	double b = sqrt(max(0.0, 1.0 - sched.alpha[step]));
78	// Per-step form: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1 - alpha_t) * eps_t
79	for (int i = 0; i < (int)x.size(); ++i) {
80	x[i] = a * x[i] + b * nd(rng);
81	}
82	}
83	cout << "x_T after iterative forward diffusion: (" << x[0] << ", " << x[1] << ")\n";
84
85	// Demonstrate that closed-form at t=T matches iterative (distributionally)
86	auto xT_closed = sample_q_xt_given_x0(x0, T, sched, rng);
87	cout << "x_T (closed-form): (" << xT_closed[0] << ", " << xT_closed[1] << ")\n";
88
89	return 0;
90	}
91

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Gaussian2D {
5	array<double,2> mu; // mean
6	double sigma; // isotropic std
7	double weight; // mixture weight (sum to 1)
8	};
9
10	// Compute log N(x; mu, (sigma2 + s2) I) and its gradient wrt x
11	static inline double log_gauss(const array<double,2>& x, const array<double,2>& mu, double var) {
12	// var = sigma^2
13	double dx = x[0]-mu[0], dy = x[1]-mu[1];
14	double quad = (dxdx + dydy)/var;
15	double logdet = log(var);
16	double logc = -0.5(2log(2M_PI) + 2logdet);
17	return logc - 0.5*quad;
18	}
19
20	// Score of smoothed mixture p_sigma(x) = sum_k w_k N(x; mu_k, (sigma_k^2 + sigma^2) I)
21	array<double,2> score_mixture(const array<double,2>& x, const vector<Gaussian2D>& comps, double sigma) {
22	vector<double> logw; logw.reserve(comps.size());
23	for (auto &g : comps) {
24	double var = (g.sigmag.sigma + sigmasigma);
25	logw.push_back(log(g.weight) + log_gauss(x, g.mu, var));
26	}
27	double m = *max_element(logw.begin(), logw.end());
28	double denom = 0.0; for (double v: logw) denom += exp(v - m);
29	// responsibilities
30	vector<double> gamma(logw.size());
31	for (size_t k=0;k<logw.size();++k) gamma[k] = exp(logw[k]-m)/denom;
32
33	array<double,2> s = {0.0, 0.0};
34	for (size_t k=0;k<comps.size();++k) {
35	double var = comps[k].sigmacomps[k].sigma + sigmasigma;
36	// grad log N wrt x = -(x - mu)/var
37	s[0] += gamma[k] * (-(x[0]-comps[k].mu[0]) / var);
38	s[1] += gamma[k] * (-(x[1]-comps[k].mu[1]) / var);
39	}
40	return s;
41	}
42
43	int main(){
44	ios::sync_with_stdio(false);
45	cin.tie(nullptr);
46
47	// Define a 2D mixture (two clusters)
48	vector<Gaussian2D> mix = {
49	{{-2.0, 0.0}, 0.4, 0.5},
50	{{+2.0, 0.0}, 0.6, 0.5}
51	};
52
53	// Annealed noise levels (geometric from large to small)
54	int K = 10;
55	vector<double> sigmas(K);
56	double sigma_max = 3.0, sigma_min = 0.1;
57	for (int i = 0; i < K; ++i) {
58	double t = (double)i/(K-1);
59	sigmas[i] = sigma_max * pow(sigma_min/sigma_max, t);
60	}
61
62	// RNG
63	random_device rd; mt19937 rng(rd());
64	normal_distribution<double> nd(0.0, 1.0);
65
66	// Start from standard normal
67	int S = 1000; // number of samples to draw
68	vector<array<double,2>> X(S);
69	for (int i=0;i<S;++i) X[i] = {nd(rng), nd(rng)};
70
71	// ALD parameters
72	int M = 50; // steps per sigma (increase for better quality)
73	for (double sigma : sigmas) {
74	double step = 0.1 * (sigma * sigma); // step size scales with sigma^2
75	for (int m = 0; m < M; ++m) {
76	for (int i = 0; i < S; ++i) {
77	auto s = score_mixture(X[i], mix, sigma);
78	// Langevin update: x <- x + eta * score + sqrt(2 eta) * z
79	double z1 = nd(rng), z2 = nd(rng);
80	X[i][0] += step * s[0] + sqrt(2.0 * step) * z1;
81	X[i][1] += step * s[1] + sqrt(2.0 * step) * z2;
82	}
83	}
84	}
85
86	// Print a few samples (x will form two clusters near the means)
87	cout << fixed << setprecision(3);
88	for (int i=0;i<20;++i) {
89	cout << X[i][0] << ", " << X[i][1] << "\n";
90	}
91	return 0;
92	}
93

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple 2D toy dataset: mixture of two Gaussians
5	struct Mixture2D { vector<array<double,2>> mu; vector<double> sigma; vector<double> w; };
6
7	array<double,2> sample_mixture(const Mixture2D& mix, mt19937& rng) {
8	discrete_distribution<int> pick(mix.w.begin(), mix.w.end());
9	int k = pick(rng);
10	normal_distribution<double> nd(0.0, 1.0);
11	array<double,2> x = { mix.mu[k][0] + mix.sigma[k]nd(rng), mix.mu[k][1] + mix.sigma[k]nd(rng) };
12	return x;
13	}
14
15	// Schedule: linear betas, and derived terms
16	struct Schedule { int T; vector<double> beta, alpha, alpha_bar, tilde_beta; };
17	vector<double> linear_betas(int T, double b0=1e-4, double b1=0.02) {
18	vector<double> b(T+1,0.0); for (int t=1;t<=T;++t) b[t]= b0 + (b1-b0)*(double)(t-1)/(double)(T-1); return b; }
19	Schedule make_schedule(int T) {
20	Schedule s; s.T=T; s.beta=linear_betas(T); s.alpha.assign(T+1,0.0); s.alpha_bar.assign(T+1,0.0); s.tilde_beta.assign(T+1,0.0);
21	s.alpha[0]=1.0; s.alpha_bar[0]=1.0; for(int t=1;t<=T;++t){ s.alpha[t]=1.0 - s.beta[t]; s.alpha_bar[t]=s.alpha_bar[t-1]*s.alpha[t]; }
22	for(int t=1;t<=T;++t){ double ab_t=s.alpha_bar[t]; double ab_tm1= (t==1?1.0:s.alpha_bar[t-1]); s.tilde_beta[t] = (1.0 - ab_tm1)/(1.0 - ab_t) * s.beta[t]; }
23	return s;
24	}
25
26	// Random Fourier Features: phi([x, y, tau]) in R^D with cosine features
27	struct RFF {
28	int D; // feature dim
29	int in_dim; // 3: x, y, tau
30	vector<vector<double>> W; // D x in_dim, Gaussian rows
31	vector<double> b; // D phases uniform(0, 2pi)
32	RFF(int D_, int in_dim_, mt19937& rng): D(D_), in_dim(in_dim_) {
33	normal_distribution<double> nd(0.0,1.0);
34	uniform_real_distribution<double> ud(0.0, 2.0*M_PI);
35	W.assign(D, vector<double>(in_dim)); b.assign(D,0.0);
36	for(int i=0;i<D;++i){ for(int j=0;j<in_dim;++j) W[i][j]= nd(rng); b[i]= ud(rng);} }
37	vector<double> phi(const array<double,3>& z) const {
38	vector<double> feat(D);
39	double scale = sqrt(2.0 / (double)D);
40	for (int i=0;i<D;++i){ double dot=0.0; for(int j=0;j<in_dim;++j) dot+= W[i][j]z[j]; feat[i] = scale cos(dot + b[i]); }
41	return feat;
42	}
43	};
44
45	// Linear head W_out (2 x D) predicts epsilon in R^2
46	struct LinearHead {
47	int D; vector<vector<double>> W; // 2 x D
48	LinearHead(int D_, mt19937& rng): D(D_) {
49	normal_distribution<double> nd(0.0, 0.01);
50	W.assign(2, vector<double>(D));
51	for(int o=0;o<2;++o) for(int i=0;i<D;++i) W[o][i]= nd(rng);
52	}
53	array<double,2> forward(const vector<double>& phi) const {
54	array<double,2> y = {0.0, 0.0};
55	for (int i=0;i<D;++i){ y[0]+= W[0][i]phi[i]; y[1]+= W[1][i]phi[i]; }
56	return y;
57	}
58	// SGD update: W <- W - lr * gradW (grad from MSE)
59	void sgd_update(const vector<double>& phi, const array<double,2>& grad, double lr){
60	// grad is dL/dy (size 2); dL/dW = grad ⊗ phi
61	for (int i=0;i<D;++i){ W[0][i] -= lr * grad[0] * phi[i]; W[1][i] -= lr * grad[1] * phi[i]; }
62	}
63	};
64
65	int main(){
66	ios::sync_with_stdio(false);
67	cin.tie(nullptr);
68
69	// RNG
70	random_device rd; mt19937 rng(rd()); normal_distribution<double> nd(0.0,1.0);
71
72	// Dataset: two Gaussians in 2D
73	Mixture2D data{ { { {-2.0,0.0}, {+2.0,0.0} } }, {0.4, 0.6}, {0.5, 0.5} };
74
75	// Diffusion schedule
76	int T = 100; Schedule sched = make_schedule(T);
77
78	// Model: Random Fourier Features + linear head
79	int D = 256; RFF rff(D, 3, rng); LinearHead head(D, rng);
80
81	auto to_tau = [&](int t){ return (double)t / (double)T; }; // normalize time to [0,1]
82
83	// Training hyperparams (kept tiny for demo; increase for real learning)
84	int steps = 5000; double lr = 1e-2; int batch = 64;
85
86	// Training loop: epsilon-prediction MSE
87	for (int it=0; it<steps; ++it){
88	// Accumulate over a mini-batch
89	for (int b=0; b<batch; ++b){
90	// Sample clean data
91	auto x0 = sample_mixture(data, rng);
92	// Pick a random t in [1, T]
93	uniform_int_distribution<int> udt(1, T); int t = udt(rng);
94	double alpha_bar = sched.alpha_bar[t];
95	double stdn = sqrt(max(0.0, 1.0 - alpha_bar));
96	// Generate noisy x_t
97	array<double,2> eps = { nd(rng), nd(rng) };
98	array<double,2> xt = { sqrt(alpha_bar)x0[0] + stdneps[0], sqrt(alpha_bar)x0[1] + stdneps[1] };
99	// Input z = [x_t.x, x_t.y, tau]
100	array<double,3> z = { xt[0], xt[1], to_tau(t) };
101	vector<double> phi = rff.phi(z);
102	auto y = head.forward(phi); // predicted epsilon
103	// Gradient of 1/2\|\|y - eps\|\|^2 wrt y is (y - eps)
104	array<double,2> gy = { y[0] - eps[0], y[1] - eps[1] };
105	head.sgd_update(phi, gy, lr);
106	}
107	if ((it+1) % 1000 == 0) cerr << "Step " << (it+1) << "/" << steps << " done\n";
108	}
109
110	// Sampling: DDPM reverse updates using predicted epsilon
111	int S = 200; // number of samples
112	vector<array<double,2>> X(S);
113	for (int i=0;i<S;++i) X[i] = { nd(rng), nd(rng) }; // start from noise ~ N(0, I)
114
115	for (int t=T; t>=1; --t){
116	double alpha = sched.alpha[t];
117	double alpha_bar = sched.alpha_bar[t];
118	double sqrt_alpha = sqrt(alpha);
119	double sqrt_one_minus_ab = sqrt(max(0.0, 1.0 - alpha_bar));
120	double sigma_t = sqrt(max(1e-20, sched.tilde_beta[t])); // posterior std
121	for (int i=0;i<S;++i){
122	array<double,3> z = { X[i][0], X[i][1], to_tau(t) };
123	vector<double> phi = rff.phi(z);
124	auto eps_pred = head.forward(phi);
125	// DDPM mean
126	double coeff = (1.0 - alpha) / max(1e-12, sqrt_one_minus_ab);
127	array<double,2> mean = {
128	(X[i][0] - coeff * eps_pred[0]) / max(1e-12, sqrt_alpha),
129	(X[i][1] - coeff * eps_pred[1]) / max(1e-12, sqrt_alpha)
130	};
131	if (t > 1) {
132	X[i][0] = mean[0] + sigma_t * nd(rng);
133	X[i][1] = mean[1] + sigma_t * nd(rng);
134	} else {
135	X[i] = mean; // last step: no noise
136	}
137	}
138	}
139
140	// Print a few generated samples (should cluster near the two modes)
141	cout << fixed << setprecision(3);
142	for (int i=0;i<20;++i) cout << X[i][0] << ", " << X[i][1] << "\n";
143	return 0;
144	}
145