Diffusion Models (Score-Based)
Key Points
- •Score-based diffusion models corrupt data by gradually adding Gaussian noise and then learn to reverse this process by estimating the score, the gradient of the log-density.
- •The forward process q( | ) has a closed form Gaussian: = + where (0, I).
- •Training commonly uses the epsilon-prediction objective, a mean-squared error between the true noise and a model’s predicted noise conditioned on and t.
- •Sampling runs the reverse diffusion using the learned score/noise predictor, stepping from pure noise back to data through many discrete time steps.
- •Careful choice of the noise schedule (), timestep embeddings, and numerical stability tricks strongly affects quality and speed.
- •Score-based methods relate to SDEs/ODEs; reverse-time SDEs use the score to transport noise into data, while probability flow ODEs allow deterministic sampling.
- •In C++, we can simulate the forward process, implement annealed Langevin dynamics with analytic scores, and even train a tiny epsilon-predictor for 2D toy data.
- •Common pitfalls include wrong time scaling, mismatched variance in reverse updates, forgetting data normalization, and unstable discretization.
Prerequisites
- →Gaussian distributions and multivariate normal — Forward noising and many formulas assume properties of Gaussian random variables.
- →Gradients and vector calculus — The score is defined as a gradient of the log-density and appears in SDEs and reverse updates.
- →Stochastic differential equations (basics) — Continuous-time score-based models use forward/reverse SDEs and related solvers.
- →Numerical integration and discretization — Reverse diffusion requires stable discrete updates (Euler–Maruyama, ODE solvers).
- →Linear algebra and matrix calculus — Model parameterization, feature mappings, and variance computations rely on linear algebra.
- →Optimization and SGD — Training minimizes mean squared error between predicted and true noise via gradient-based updates.
- →Probability density functions and Bayes’ rule — Posterior mean/variance in reverse steps and responsibilities in mixtures use Bayes’ rule.
Detailed Explanation
Tap terms for definitions01Overview
Score-based diffusion models are a class of generative models that synthesize data (images, audio, point clouds) by reversing a progressive noising process. The idea is simple: take a clean sample x_0 from the data distribution and iteratively add small amounts of Gaussian noise, producing a sequence x_1, x_2, …, x_T that becomes nearly pure noise. This forward process has a tractable Gaussian form that we can write and sample from exactly. The magic happens in reverse: we train a neural network to approximate the score, the gradient of the log probability density of the noisy data at various noise levels. With this score, we numerically integrate the reverse stochastic differential equation (SDE) or use a discrete-time update to transport noise back into clean samples. These models are robust and stable to train (no adversarial game like GANs), and they reach state-of-the-art quality in many domains. Practical implementations rely on a variance schedule that spreads noise across time steps, a time-conditioning mechanism so the model knows the current noise level, and a loss that is easy to compute—typically predicting the added Gaussian noise (epsilon) and minimizing mean squared error. Sampling quality trades off with speed: more steps usually means better fidelity but longer runtime. Recent improvements (e.g., improved SDE solvers, distillation, consistency models) reduce the number of steps while preserving quality.
02Intuition & Analogies
Imagine taking a sharp photograph and repeatedly overlaying a faint layer of fog. After many layers, everything looks like gray noise, and the original picture is hidden. That’s the forward diffusion: easy to do, just keep adding a tiny bit of Gaussian fog at each step. Now imagine trying to “un-fog” the image. If you simply subtract a fixed amount of fog, you might sharpen some areas but over-correct others. What you really need is a direction field telling you, for each pixel arrangement, how to move slightly so that the image becomes a bit more like a real photograph and less like random fog. That direction field is the score—the gradient of the log-density—pointing toward regions where real data are more likely. Another analogy: think of a mountain landscape hidden in morning mist. The forward process is like thickening the mist until visibility drops to almost zero; you can do this blindly. The reverse process needs a guide. The score is like tiny arrows everywhere in space indicating which way the terrain slopes upward toward the ridgelines where true data live. If you follow these arrows carefully—taking small steps and adding a little random jiggle to match the physics of diffusion—you can walk from the foggy plains (pure noise) back up into the mountain ridges (clean samples). Different times t correspond to different mist densities; the arrows change with t, so your guide must know the time. A learned model takes a noisy input and the current t, and returns the arrow (or equivalently the noise direction) to follow for the next small step out of the fog.
03Formal Definition
04When to Use
Use diffusion models when you need high-fidelity, diverse generative samples and you can afford iterative sampling. They shine in image, audio, and 3D generative tasks, super-resolution, inpainting, and conditional synthesis (text-to-image, class-conditional) by adding conditioning signals. If training stability is a concern (e.g., GANs collapse), diffusion models are attractive because the loss is a simple regression and the forward process is fixed and tractable. They also integrate naturally with SDE/ODE solvers and can leverage advances in numerical integration and control variates. Choose VP/VE/sub-VP variants based on data scaling and desired properties: VP keeps variance near constant (suitable for bounded, normalized data), while VE diffuses variance to infinity and uses annealed Langevin dynamics; sub-VP offers improved likelihood training. When compute or latency is limited, consider reduced step samplers (e.g., DDIM, higher-order solvers) or distillation to fewer steps. For small, low-dimensional problems (e.g., 2D toy data), simple feature models can learn usable scores, making C++ from-scratch demos feasible without deep libraries.
⚠️Common Mistakes
- Mismatched variance in reverse updates: the posterior variance \tilde{\beta}_t differs from \beta_t; using the wrong \sigma_t inflates or destroys samples. Always use \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t in DDPM sampling unless intentionally using an alternative schedule.
- Ignoring data normalization: training on unnormalized data (e.g., images not scaled to [−1,1]) shifts the effective noise scale and breaks the time embedding’s meaning. Normalize consistently and reverse it for output.
- Poor timestep conditioning: feeding raw integer t without embedding often hurts learning. Use sinusoidal or learned embeddings so the model understands continuous noise levels.
- Inadequate noise schedule: overly small early \beta_t produce slow mixing; overly large late \beta_t cause numerical instability. Start with standard linear or cosine schedules, then tune.
- Too few sampling steps: aggressive step reduction without a better solver (e.g., DDIM, higher-order ODE methods) leads to artifacts. Balance speed-quality or use solvers designed for few steps.
- Overfitting the score: tiny models on trivial datasets may memorize. Monitor validation loss on held-out data and visualize samples at various t.
- Confusing score vs. epsilon prediction: they are equivalent up to scale factors tied to \bar{\alpha}_t; ensure conversion is correct when mixing formulas or code paths.
Key Formulas
Forward marginal
Explanation: Given the product of alphas up to time t, the noised variable is a Gaussian around a scaled version of with variance matching the accumulated noise. This provides a closed-form way to sample at any t without simulating all earlier steps.
Alpha definitions
Explanation: Alphas are complements of betas (per-step noise). The product determines how much signal from remains after t steps.
Closed-form noising
Explanation: This reparameterization directly constructs from and Gaussian noise epsilon. It is used for efficient training and computing losses.
Epsilon-prediction loss
Explanation: The model predicts the noise added at time t. Minimizing mean squared error is simple and stable, and it implicitly trains a score network.
Score–epsilon relation (VP)
Explanation: For VP diffusions, the score and epsilon predictor are related by scale factors depending on . This allows converting between parameterizations.
Tweedie’s formula
Explanation: An estimate of the clean sample from a noisy input and the predicted noise. Useful in reverse updates and diagnostics.
Posterior variance
Explanation: The conditional variance in the reverse step → . Using this instead of improves numerical correctness and sample quality.
DDPM reverse update
Explanation: A discrete-time sampling step using the noise predictor. The variance is typically set to or chosen for specific samplers (e.g., DDIM sets it to 0).
Forward SDE (generic)
Explanation: A stochastic differential equation describes the continuous-time diffusion. The functions f and g control drift and diffusion (noise intensity).
Reverse-time SDE
Explanation: The reverse dynamics that generate data from noise require the score. Replacing the true score with a learned approximation yields practical samplers.
Probability flow ODE
Explanation: A deterministic ODE sharing the same marginals as the SDE. This enables sampling without additional randomness and is useful with ODE solvers.
Signal-to-noise ratio
Explanation: Measures how much of the original signal remains at time t relative to noise. It guides schedule design and timestep selection.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility: generate a linear beta schedule 5 vector<double> linear_betas(int T, double beta_start=1e-4, double beta_end=0.02) { 6 vector<double> betas(T+1, 0.0); // 1-indexed for convenience 7 for (int t = 1; t <= T; ++t) { 8 betas[t] = beta_start + (beta_end - beta_start) * (double)(t - 1) / (double)(T - 1); 9 } 10 return betas; 11 } 12 13 // Precompute alpha_t, alpha_bar_t, and posterior variance tilde_beta_t 14 struct Schedule { 15 int T; 16 vector<double> beta, alpha, alpha_bar, tilde_beta; 17 }; 18 19 Schedule make_schedule(int T, double beta_start=1e-4, double beta_end=0.02) { 20 Schedule s; s.T = T; 21 s.beta = linear_betas(T, beta_start, beta_end); 22 s.alpha.assign(T+1, 0.0); 23 s.alpha_bar.assign(T+1, 0.0); 24 s.tilde_beta.assign(T+1, 0.0); 25 s.alpha[0] = 1.0; 26 s.alpha_bar[0] = 1.0; 27 for (int t = 1; t <= T; ++t) { 28 s.alpha[t] = 1.0 - s.beta[t]; 29 s.alpha_bar[t] = s.alpha_bar[t-1] * s.alpha[t]; 30 } 31 for (int t = 1; t <= T; ++t) { 32 double ab_t = s.alpha_bar[t]; 33 double ab_tm1 = (t==1) ? 1.0 : s.alpha_bar[t-1]; 34 s.tilde_beta[t] = (1.0 - ab_tm1) / (1.0 - ab_t) * s.beta[t]; 35 } 36 return s; 37 } 38 39 // Sample x_t from q(x_t | x_0) in closed form 40 vector<double> sample_q_xt_given_x0(const vector<double>& x0, int t, const Schedule& s, mt19937& rng) { 41 int d = (int)x0.size(); 42 normal_distribution<double> nd(0.0, 1.0); 43 vector<double> xt(d); 44 double mean_scale = sqrt(max(0.0, s.alpha_bar[t])); 45 double std_scale = sqrt(max(0.0, 1.0 - s.alpha_bar[t])); 46 for (int i = 0; i < d; ++i) { 47 double eps = nd(rng); 48 xt[i] = mean_scale * x0[i] + std_scale * eps; 49 } 50 return xt; 51 } 52 53 int main() { 54 ios::sync_with_stdio(false); 55 cin.tie(nullptr); 56 57 // Example: 2D point diffused over T steps 58 int T = 1000; 59 Schedule sched = make_schedule(T); 60 // A single clean point 61 vector<double> x0 = {2.0, -1.0}; 62 63 // RNG 64 random_device rd; mt19937 rng(rd()); 65 66 // Sample x_t at arbitrary t 67 int t = 500; 68 auto xt = sample_q_xt_given_x0(x0, t, sched, rng); 69 cout << fixed << setprecision(4); 70 cout << "x_t at t=" << t << ": (" << xt[0] << ", " << xt[1] << ")\n"; 71 72 // Sample a full trajectory x_0 -> x_T (iterating t=1..T) 73 vector<double> x = x0; 74 normal_distribution<double> nd(0.0, 1.0); 75 for (int step = 1; step <= T; ++step) { 76 double a = sqrt(max(0.0, sched.alpha[step])); 77 double b = sqrt(max(0.0, 1.0 - sched.alpha[step])); 78 // Per-step form: x_t = sqrt(alpha_t) * x_{t-1} + sqrt(1 - alpha_t) * eps_t 79 for (int i = 0; i < (int)x.size(); ++i) { 80 x[i] = a * x[i] + b * nd(rng); 81 } 82 } 83 cout << "x_T after iterative forward diffusion: (" << x[0] << ", " << x[1] << ")\n"; 84 85 // Demonstrate that closed-form at t=T matches iterative (distributionally) 86 auto xT_closed = sample_q_xt_given_x0(x0, T, sched, rng); 87 cout << "x_T (closed-form): (" << xT_closed[0] << ", " << xT_closed[1] << ")\n"; 88 89 return 0; 90 } 91
This program builds a linear beta schedule, precomputes alpha, alpha_bar, and posterior variances, and then samples noised data x_t in two ways: directly from the closed-form q(x_t | x_0) and iteratively by applying per-step noising. It illustrates the equivalence of the marginal and the Markov chain and provides a foundation for training and visualization.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Gaussian2D { 5 array<double,2> mu; // mean 6 double sigma; // isotropic std 7 double weight; // mixture weight (sum to 1) 8 }; 9 10 // Compute log N(x; mu, (sigma2 + s2) I) and its gradient wrt x 11 static inline double log_gauss(const array<double,2>& x, const array<double,2>& mu, double var) { 12 // var = sigma^2 13 double dx = x[0]-mu[0], dy = x[1]-mu[1]; 14 double quad = (dx*dx + dy*dy)/var; 15 double logdet = log(var); 16 double logc = -0.5*(2*log(2*M_PI) + 2*logdet); 17 return logc - 0.5*quad; 18 } 19 20 // Score of smoothed mixture p_sigma(x) = sum_k w_k N(x; mu_k, (sigma_k^2 + sigma^2) I) 21 array<double,2> score_mixture(const array<double,2>& x, const vector<Gaussian2D>& comps, double sigma) { 22 vector<double> logw; logw.reserve(comps.size()); 23 for (auto &g : comps) { 24 double var = (g.sigma*g.sigma + sigma*sigma); 25 logw.push_back(log(g.weight) + log_gauss(x, g.mu, var)); 26 } 27 double m = *max_element(logw.begin(), logw.end()); 28 double denom = 0.0; for (double v: logw) denom += exp(v - m); 29 // responsibilities 30 vector<double> gamma(logw.size()); 31 for (size_t k=0;k<logw.size();++k) gamma[k] = exp(logw[k]-m)/denom; 32 33 array<double,2> s = {0.0, 0.0}; 34 for (size_t k=0;k<comps.size();++k) { 35 double var = comps[k].sigma*comps[k].sigma + sigma*sigma; 36 // grad log N wrt x = -(x - mu)/var 37 s[0] += gamma[k] * (-(x[0]-comps[k].mu[0]) / var); 38 s[1] += gamma[k] * (-(x[1]-comps[k].mu[1]) / var); 39 } 40 return s; 41 } 42 43 int main(){ 44 ios::sync_with_stdio(false); 45 cin.tie(nullptr); 46 47 // Define a 2D mixture (two clusters) 48 vector<Gaussian2D> mix = { 49 {{-2.0, 0.0}, 0.4, 0.5}, 50 {{+2.0, 0.0}, 0.6, 0.5} 51 }; 52 53 // Annealed noise levels (geometric from large to small) 54 int K = 10; 55 vector<double> sigmas(K); 56 double sigma_max = 3.0, sigma_min = 0.1; 57 for (int i = 0; i < K; ++i) { 58 double t = (double)i/(K-1); 59 sigmas[i] = sigma_max * pow(sigma_min/sigma_max, t); 60 } 61 62 // RNG 63 random_device rd; mt19937 rng(rd()); 64 normal_distribution<double> nd(0.0, 1.0); 65 66 // Start from standard normal 67 int S = 1000; // number of samples to draw 68 vector<array<double,2>> X(S); 69 for (int i=0;i<S;++i) X[i] = {nd(rng), nd(rng)}; 70 71 // ALD parameters 72 int M = 50; // steps per sigma (increase for better quality) 73 for (double sigma : sigmas) { 74 double step = 0.1 * (sigma * sigma); // step size scales with sigma^2 75 for (int m = 0; m < M; ++m) { 76 for (int i = 0; i < S; ++i) { 77 auto s = score_mixture(X[i], mix, sigma); 78 // Langevin update: x <- x + eta * score + sqrt(2 eta) * z 79 double z1 = nd(rng), z2 = nd(rng); 80 X[i][0] += step * s[0] + sqrt(2.0 * step) * z1; 81 X[i][1] += step * s[1] + sqrt(2.0 * step) * z2; 82 } 83 } 84 } 85 86 // Print a few samples (x will form two clusters near the means) 87 cout << fixed << setprecision(3); 88 for (int i=0;i<20;++i) { 89 cout << X[i][0] << ", " << X[i][1] << "\n"; 90 } 91 return 0; 92 } 93
This implements annealed Langevin dynamics (ALD) to sample from a known 2D Gaussian mixture using an analytic score. At each noise level sigma, it computes the score of the Gaussian-smoothed mixture and performs Langevin updates. As sigma decreases, samples concentrate near mixture modes, illustrating score-based generation without training a network.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple 2D toy dataset: mixture of two Gaussians 5 struct Mixture2D { vector<array<double,2>> mu; vector<double> sigma; vector<double> w; }; 6 7 array<double,2> sample_mixture(const Mixture2D& mix, mt19937& rng) { 8 discrete_distribution<int> pick(mix.w.begin(), mix.w.end()); 9 int k = pick(rng); 10 normal_distribution<double> nd(0.0, 1.0); 11 array<double,2> x = { mix.mu[k][0] + mix.sigma[k]*nd(rng), mix.mu[k][1] + mix.sigma[k]*nd(rng) }; 12 return x; 13 } 14 15 // Schedule: linear betas, and derived terms 16 struct Schedule { int T; vector<double> beta, alpha, alpha_bar, tilde_beta; }; 17 vector<double> linear_betas(int T, double b0=1e-4, double b1=0.02) { 18 vector<double> b(T+1,0.0); for (int t=1;t<=T;++t) b[t]= b0 + (b1-b0)*(double)(t-1)/(double)(T-1); return b; } 19 Schedule make_schedule(int T) { 20 Schedule s; s.T=T; s.beta=linear_betas(T); s.alpha.assign(T+1,0.0); s.alpha_bar.assign(T+1,0.0); s.tilde_beta.assign(T+1,0.0); 21 s.alpha[0]=1.0; s.alpha_bar[0]=1.0; for(int t=1;t<=T;++t){ s.alpha[t]=1.0 - s.beta[t]; s.alpha_bar[t]=s.alpha_bar[t-1]*s.alpha[t]; } 22 for(int t=1;t<=T;++t){ double ab_t=s.alpha_bar[t]; double ab_tm1= (t==1?1.0:s.alpha_bar[t-1]); s.tilde_beta[t] = (1.0 - ab_tm1)/(1.0 - ab_t) * s.beta[t]; } 23 return s; 24 } 25 26 // Random Fourier Features: phi([x, y, tau]) in R^D with cosine features 27 struct RFF { 28 int D; // feature dim 29 int in_dim; // 3: x, y, tau 30 vector<vector<double>> W; // D x in_dim, Gaussian rows 31 vector<double> b; // D phases uniform(0, 2pi) 32 RFF(int D_, int in_dim_, mt19937& rng): D(D_), in_dim(in_dim_) { 33 normal_distribution<double> nd(0.0,1.0); 34 uniform_real_distribution<double> ud(0.0, 2.0*M_PI); 35 W.assign(D, vector<double>(in_dim)); b.assign(D,0.0); 36 for(int i=0;i<D;++i){ for(int j=0;j<in_dim;++j) W[i][j]= nd(rng); b[i]= ud(rng);} } 37 vector<double> phi(const array<double,3>& z) const { 38 vector<double> feat(D); 39 double scale = sqrt(2.0 / (double)D); 40 for (int i=0;i<D;++i){ double dot=0.0; for(int j=0;j<in_dim;++j) dot+= W[i][j]*z[j]; feat[i] = scale * cos(dot + b[i]); } 41 return feat; 42 } 43 }; 44 45 // Linear head W_out (2 x D) predicts epsilon in R^2 46 struct LinearHead { 47 int D; vector<vector<double>> W; // 2 x D 48 LinearHead(int D_, mt19937& rng): D(D_) { 49 normal_distribution<double> nd(0.0, 0.01); 50 W.assign(2, vector<double>(D)); 51 for(int o=0;o<2;++o) for(int i=0;i<D;++i) W[o][i]= nd(rng); 52 } 53 array<double,2> forward(const vector<double>& phi) const { 54 array<double,2> y = {0.0, 0.0}; 55 for (int i=0;i<D;++i){ y[0]+= W[0][i]*phi[i]; y[1]+= W[1][i]*phi[i]; } 56 return y; 57 } 58 // SGD update: W <- W - lr * gradW (grad from MSE) 59 void sgd_update(const vector<double>& phi, const array<double,2>& grad, double lr){ 60 // grad is dL/dy (size 2); dL/dW = grad ⊗ phi 61 for (int i=0;i<D;++i){ W[0][i] -= lr * grad[0] * phi[i]; W[1][i] -= lr * grad[1] * phi[i]; } 62 } 63 }; 64 65 int main(){ 66 ios::sync_with_stdio(false); 67 cin.tie(nullptr); 68 69 // RNG 70 random_device rd; mt19937 rng(rd()); normal_distribution<double> nd(0.0,1.0); 71 72 // Dataset: two Gaussians in 2D 73 Mixture2D data{ { { {-2.0,0.0}, {+2.0,0.0} } }, {0.4, 0.6}, {0.5, 0.5} }; 74 75 // Diffusion schedule 76 int T = 100; Schedule sched = make_schedule(T); 77 78 // Model: Random Fourier Features + linear head 79 int D = 256; RFF rff(D, 3, rng); LinearHead head(D, rng); 80 81 auto to_tau = [&](int t){ return (double)t / (double)T; }; // normalize time to [0,1] 82 83 // Training hyperparams (kept tiny for demo; increase for real learning) 84 int steps = 5000; double lr = 1e-2; int batch = 64; 85 86 // Training loop: epsilon-prediction MSE 87 for (int it=0; it<steps; ++it){ 88 // Accumulate over a mini-batch 89 for (int b=0; b<batch; ++b){ 90 // Sample clean data 91 auto x0 = sample_mixture(data, rng); 92 // Pick a random t in [1, T] 93 uniform_int_distribution<int> udt(1, T); int t = udt(rng); 94 double alpha_bar = sched.alpha_bar[t]; 95 double stdn = sqrt(max(0.0, 1.0 - alpha_bar)); 96 // Generate noisy x_t 97 array<double,2> eps = { nd(rng), nd(rng) }; 98 array<double,2> xt = { sqrt(alpha_bar)*x0[0] + stdn*eps[0], sqrt(alpha_bar)*x0[1] + stdn*eps[1] }; 99 // Input z = [x_t.x, x_t.y, tau] 100 array<double,3> z = { xt[0], xt[1], to_tau(t) }; 101 vector<double> phi = rff.phi(z); 102 auto y = head.forward(phi); // predicted epsilon 103 // Gradient of 1/2||y - eps||^2 wrt y is (y - eps) 104 array<double,2> gy = { y[0] - eps[0], y[1] - eps[1] }; 105 head.sgd_update(phi, gy, lr); 106 } 107 if ((it+1) % 1000 == 0) cerr << "Step " << (it+1) << "/" << steps << " done\n"; 108 } 109 110 // Sampling: DDPM reverse updates using predicted epsilon 111 int S = 200; // number of samples 112 vector<array<double,2>> X(S); 113 for (int i=0;i<S;++i) X[i] = { nd(rng), nd(rng) }; // start from noise ~ N(0, I) 114 115 for (int t=T; t>=1; --t){ 116 double alpha = sched.alpha[t]; 117 double alpha_bar = sched.alpha_bar[t]; 118 double sqrt_alpha = sqrt(alpha); 119 double sqrt_one_minus_ab = sqrt(max(0.0, 1.0 - alpha_bar)); 120 double sigma_t = sqrt(max(1e-20, sched.tilde_beta[t])); // posterior std 121 for (int i=0;i<S;++i){ 122 array<double,3> z = { X[i][0], X[i][1], to_tau(t) }; 123 vector<double> phi = rff.phi(z); 124 auto eps_pred = head.forward(phi); 125 // DDPM mean 126 double coeff = (1.0 - alpha) / max(1e-12, sqrt_one_minus_ab); 127 array<double,2> mean = { 128 (X[i][0] - coeff * eps_pred[0]) / max(1e-12, sqrt_alpha), 129 (X[i][1] - coeff * eps_pred[1]) / max(1e-12, sqrt_alpha) 130 }; 131 if (t > 1) { 132 X[i][0] = mean[0] + sigma_t * nd(rng); 133 X[i][1] = mean[1] + sigma_t * nd(rng); 134 } else { 135 X[i] = mean; // last step: no noise 136 } 137 } 138 } 139 140 // Print a few generated samples (should cluster near the two modes) 141 cout << fixed << setprecision(3); 142 for (int i=0;i<20;++i) cout << X[i][0] << ", " << X[i][1] << "\n"; 143 return 0; 144 } 145
This program trains a tiny epsilon predictor on a 2D toy dataset using Random Fourier Features (no deep library). The model takes (x_t, t) and predicts the added noise ε with an SGD-updated linear head. After training, it uses the DDPM reverse update with the predicted ε to generate samples from pure noise. Although simplistic, it demonstrates the full pipeline: schedule, noising, ε-prediction training, and reverse sampling.