📚TheoryAdvanced

Diffusion Models Theory

Key Points

•
Diffusion models learn to reverse a simple noising process by estimating the score (the gradient of the log density) of data at different noise levels.
•
The forward process adds Gaussian noise with a schedule $ $β_{t}$ $, leading to a closed-form noisy variable $ $x_{t}$ = $\overset{α}{ˉ}_{t}$ \, $x_{0}$ + $1 - \overset{α}{ˉ}_{t}$ \, $ε$ $.
•
Training optimizes an ELBO that reduces to denoising score matching, often implemented as predicting the noise $ $ε$ $ or the score $ $\nabla_{x}$ $lo g$ $p_{t}$ (x)$.
•
Sampling can be viewed either as a discrete reverse diffusion chain or as integrating a continuous-time SDE/ODE; both need the score function at each time.
•
Langevin dynamics uses the score to push samples toward high-density regions while injecting noise to explore the distribution.
•
Classifier-free guidance linearly interpolates unconditional and conditional scores to trade off sample fidelity and diversity.
•
Diffusion theory underpins modern image generators like Stable Diffusion by offering stable training and high-quality samples without adversarial losses.
•
C++ implementations can simulate the forward noising, run Langevin samplers with analytic scores, and illustrate guidance without deep frameworks.

Prerequisites

→Multivariate Gaussian distributions — Forward diffusion, reverse posteriors, and Langevin updates are all built on Gaussian densities and their gradients.
→Gradient and Jacobian calculus — The score is a gradient of log-density; understanding updates and score matching requires gradient intuition.
→Basic probability and KL divergence — ELBO derivations and variational bounds rely on expectations and divergence measures.
→Markov chains — The forward and reverse processes are Markovian; conditional independence assumptions are key.
→Stochastic differential equations (intro) — Continuous-time diffusion and probability flow are expressed as SDEs/ODEs.
→Numerical optimization (MSE, SGD) — Training minimizes mean-squared-error objectives equivalent to denoising score matching.
→Linear algebra and vector norms — Implementations and complexity rely on vector operations and norms in $\mathbb{R}^d$.

Detailed Explanation

Tap terms for definitions

01Overview

Diffusion models are a class of generative models that construct complex data (like images or audio) by learning to reverse a gradual noising process. The idea is to start from clean data, add small amounts of Gaussian noise step by step (the forward process), and then train a model to remove that noise step by step (the reverse process). If the forward process is chosen carefully, we know exactly how noise accumulates over time. This lets us write clean mathematical objectives—variational bounds and score-matching losses—that are easy to optimize. At inference, we begin with pure noise and iteratively denoise using the learned model to produce a new sample that looks like the training data. The theory connects discrete-time Markov chains and continuous-time stochastic differential equations (SDEs). In discrete time, the model learns conditional distributions that reverse each noising step. In continuous time, the model estimates the score (the gradient of the log-density) at each noise level, and sampling integrates an SDE or a probability flow ODE. A key practical trick is to parameterize the model to predict the injected noise or the data itself, which yields simple mean-squared-error losses. Another widely used technique, classifier-free guidance, blends unconditional and conditional scores to bias the generation toward a prompt or class label. Together, these ideas explain why diffusion models train stably and scale to high-fidelity generation, as seen in systems like Stable Diffusion and DALL·E.

02Intuition & Analogies

Imagine blurring a photo by repeatedly sprinkling a tiny bit of fog over it. Each sprinkle barely changes the picture, but after many sprinkles the image turns into pure gray noise. That blurring is the forward diffusion process. Now, suppose you could learn a perfect “anti-fog” brush that, given a slightly foggy image, knows precisely how to nudge each pixel back toward what a clean photo should look like. If you applied this brush repeatedly—starting from pure noise—you could paint a believable photo from scratch. That reverse “anti-fog” operation is what diffusion models learn. A more tactile analogy: think of a hill covered in snow with a hidden path underneath. The forward process is like repeated snowfalls that gradually cover the footprints until the path disappears. The reverse process is like a skilled tracker who, given a snapshot of the current snow, can infer where the footprints likely are and brush away just the right amount of snow to reveal the trail step by step. The key skill the tracker needs is the direction in which the snow thickness decreases fastest—that’s exactly a gradient. In diffusion, that gradient is the score: the direction that most increases the data’s probability. Langevin dynamics adds another layer to the story. If you always walk uphill along the gradient of log-probability, you get stuck at peaks; but if you add a little random shake while moving uphill, you can explore the whole landscape of likely images. That’s what Langevin does: move a bit in the direction of the score, then jitter. Classifier-free guidance is like asking two guides—the general “what looks like an image” guide and the “what looks like a dog” guide—and then following a weighted combination of their advice to emphasize your chosen subject without losing realism.

03Formal Definition

Let \(\{

β_{t}

\}_{

t=1}^T

\subset

(0,1)\) be a noise schedule and define \(

α_{t}

= 1 -

β_{t}

\), \(

\overset{α}{ˉ}_{t}

\prod_{s = 1}^{t}

α_{s}

\). The forward (variance-preserving) diffusion process is a Markov chain \[ q(

x_{t}

∣

x_{t - 1}

) =

N

(

α_{t}

x_{t - 1}

,\,

β_{t}

I),

x_{0}

\sim

p_{

data

}. \] It admits the closed form \(

x_{t}

\overset{α}{ˉ}_{t}

x_{0}

1 - \overset{α}{ˉ}_{t}

ε

\), with \(

ε

\sim

N

(0,I)\). The reverse chain is parameterized by \(p_

θ

(

x_{t - 1} ∣

x_{t}

)\), often Gaussian with mean depending on a neural network that predicts the noise or the score at time $t$. The exact reverse conditionals under $q$ are Gaussian with known variance \(

\tilde{β}_{t}

\) and mean linear in \(

x_{t}

\) and \(

x_{0}

\). Training maximizes a variational lower bound (ELBO) on \(

lo g

θ

(

x_{0}

)\). A common simplification yields the denoising objective \(

E

[

∥

ε

ε

θ

(

x_{t}

,t)

∥^{2}

]\), which is equivalent to score matching under the forward corruption. The score of a density is \(

\nabla_{x}

lo g

p(x)\); diffusion models learn time-indexed scores \(s_

θ

(x,t)

\approx

\nabla_{x}

lo g

p_{t}

(x)\). In continuous time, diffusion is described by an SDE \(

d x = f (x

,t)\,

d

t + g(t)\,

d w_{t}

\), and sampling uses either that SDE (predictor-corrector) or the associated probability flow ODE \(

d

x = [f(x,t) -

\frac{1}{2}

g(t)^2

\nabla_{x}

lo g

p_{t}

(x)]\,

d

t\). Classifier-free guidance forms a guided score by linearly combining unconditional and conditional scores: \(s_{

guided

} = s_{

uncond

} + w\,(s_{

cond

} - s_{

uncond

})\).

04When to Use

Use diffusion models when you need high-fidelity, diverse generative modeling with stable training: images, audio, 3D textures, or scientific data. They excel when: (1) you can afford multi-step sampling (tens to hundreds of steps), (2) you have powerful function approximators (e.g., U-Nets) and accelerators, and (3) you want robustness without adversarial games (unlike GANs). For conditional generation (e.g., text-to-image, class-conditional), diffusion pairs naturally with conditioning mechanisms like cross-attention or classifier-free guidance to steer outputs. Choose score-based SDE formulations when you want flexibility in sampler design (predictor-corrector, ODE solvers) and potentially fewer steps via high-order integrators. Opt for DDPM-style discrete reverse chains when implementation simplicity and widely available code are priorities. If you have analytic scores (e.g., simple mixture distributions), Langevin dynamics offers a minimal setup to demonstrate principles or to sample without training a neural net. For deployment, consider distillation or consistency models to reduce steps when latency matters, and use guidance to trade diversity for fidelity according to application needs (e.g., product imagery vs exploratory art).

⚠️Common Mistakes

Ignoring the noise schedule: Choosing (\beta_t) poorly leads to either vanishing signal (too aggressive noise early) or ineffective mixing (too gentle). Use well-tested schedules (linear/cosine) and verify (\bar{\alpha}_t) spans from near 1 to near 0 over T.
Confusing score and noise parameterizations: The score relates to noise prediction via (s_\theta(x_t,t) \approx -\varepsilon_\theta(x_t,t)/\sqrt{1-\bar{\alpha}_t}). Mixing these up breaks the reverse mean formula and yields bad samples.
Skipping variance terms: In reverse diffusion, the variance (\tilde{\beta}_t) (or its learned version) stabilizes sampling. Setting it to zero without using an ODE-consistent update can cause collapse.
Over-guiding: Large classifier-free guidance scales can overfit to the condition, reducing diversity and causing artifacts. Tune guidance and monitor FID/CLIP metrics as well as qualitative variety.
Misusing Langevin step sizes: Too large (\epsilon) diverges; too small mixes slowly. Use decreasing schedules or noise annealing, and check acceptance diagnostics (when available) or sample statistics.
Forgetting time conditioning: The model must know “how noisy” its input is. Omitting time embeddings or using inconsistent time units (discrete vs continuous) harms performance.
Evaluation mismatch: Training with one parameterization ((\varepsilon), (x_0), or score) but sampling with formulas for another leads to bias. Keep formulas consistent throughout.

Key Formulas

Forward Step

q (x_{t} ∣ x_{t - 1}) = N (α_{t} x_{t - 1}, β_{t} I)

Explanation: Each forward step adds Gaussian noise with variance $ $β_{t}$ $ and scales the signal by $ $α_{t}$ $. This builds a simple, tractable corruption process.

Closed-form Noising

x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε, ε \sim N (0, I)

Explanation: After t steps, the noisy variable is a linear combination of the original data and Gaussian noise. This lets us sample any time t directly without simulating all previous steps.

True Reverse Posterior

q (x_{t - 1} ∣ x_{t}, x_{0}) = N (μ_{t} (x_{t}, x_{0}), \tilde{β}_{t} I)

Explanation: Given both $ $x_{t}$ $ and $ $x_{0}$ $, the exact reverse conditional is Gaussian. Its mean is linear in $ $x_{t}$ $ and $ $x_{0}$ $, and the variance $ $\tilde{β}_{t}$ $ depends on the schedule.

Posterior Parameters

μ_{t} (x_{t}, x_{0}) = \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0} + \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t}, \tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}

Explanation: These are the closed-form mean and variance of the reverse conditional under the forward process. They are used to construct learning targets for the model’s reverse mean.

DDPM Reverse Mean ($\varepsilon$-param)

μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ε_{θ} (x_{t}, t))

Explanation: When the model predicts the injected noise, this formula produces the mean of the reverse Gaussian. It’s central to sampling with a learned noise predictor.

Score Definition

s (x) = \nabla_{x} lo g p (x)

Explanation: The score points in the direction of steepest ascent of log-probability. Diffusion models learn time-dependent versions of this vector field.

Simplified DDPM Loss

E_{t, x_{0}, ε} [ε - ε_{θ} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε, t)^{2}]

Explanation: The training objective compares the model’s noise prediction to the true noise that produced $ $x_{t}$ $. Minimizing this MSE implements denoising score matching under the forward corruption.

Score–Noise Relation

s_{θ} (x_{t}, t) \approx - \frac{1}{1 - α ˉ _{t}} ε_{θ} (x_{t}, t)

Explanation: A noise predictor can be converted to a score estimate via a time-dependent scaling. This equivalence underpins classifier-free guidance in either parameterization.

Langevin Update

x_{k + 1} = x_{k} + ϵ \nabla_{x} lo g p (x_{k}) + 2 ϵ z_{k}, z_{k} \sim N (0, I)

Explanation: Each iteration climbs the log-density by a small step and adds Gaussian noise to maintain correct stationary distribution. It is used as a basic score-based sampler.

General SDE

d x = f (x, t) d t + g (t) d w_{t}

Explanation: This SDE defines the forward diffusion in continuous time with drift $f$ and diffusion $g$. Different choices recover VP/VE processes used in score-based models.

Probability Flow ODE

d x = [f (x, t) - \frac{1}{2} g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t

Explanation: Solving this ODE deterministically transports noise into data while matching the SDE’s marginal densities. It enables sampler variants without injected randomness.

Classifier-free Guidance

s_{guided} (x, t) = s_{uncond} (x, t) + w (s_{cond} (x, t) - s_{uncond} (x, t))

Explanation: Guidance scales the conditional direction to emphasize adherence to a prompt/class. The weight $w $\geq$ 0$ trades diversity (lower w) for fidelity (higher w).

Complexity Analysis

Let d be data dimension, T the number of diffusion steps during sampling, N the number of training examples, and C the cost of one network forward/backward pass on a single input. Training: In DDPM/score-matching training, each minibatch samples a random time t and forms

x_{t}

using the closed-form reparameterization. The per-batch cost is O(B·C) where B is batch size; there is no need to unroll T steps in the forward pass. Across an epoch, the cost is O(N·C). Memory is O(P + A), where P is parameter count and A holds activations; it does not grow with T. Sampling (discrete reverse): Each new sample requires T model evaluations, one per reverse step, so time is O(T·C). Space is O(P) plus O(d) for current state. Typical T ranges from ~20 to 1000 depending on the sampler; accelerated solvers (DDIM, higher-order methods, consistency distillation) reduce effective T at some fidelity cost. Continuous-time SDE samplers: Using predictor-corrector or ODE solvers, the cost is O(S·C) with S the number of solver function evaluations (FEs). Corrector steps (e.g., Langevin iterations) add O(K·d) arithmetic per iteration per sample, which is typically negligible compared to network evals. Space remains O(P + d). Langevin dynamics with analytic scores (no network) costs O(K·d) for K iterations and is memory-light (O(d)). However, if the score is provided by a neural net, the cost reduces to the general O(S·C) pattern. Overall, diffusion’s main bottleneck is the number of sequential denoising steps at inference; training is stable and scalable because objectives are simple MSE/DSM without adversarial loops.

Code Examples

Simulate the forward diffusion process x_t = sqrt(alpha_bar_t) x0 + sqrt(1-alpha_bar_t) eps

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Schedule {
5     vector<double> beta, alpha, alpha_bar;
6 };
7 
8 Schedule make_linear_schedule(int T, double beta_start=1e-4, double beta_end=2e-2) {
9     Schedule sch; sch.beta.resize(T); sch.alpha.resize(T); sch.alpha_bar.resize(T);
10     for (int t = 0; t < T; ++t) {
11         double bt = beta_start + (beta_end - beta_start) * (double)t / (double)(T - 1);
12         sch.beta[t] = bt;
13         sch.alpha[t] = 1.0 - bt;
14         if (t == 0) sch.alpha_bar[t] = sch.alpha[t];
15         else sch.alpha_bar[t] = sch.alpha_bar[t-1] * sch.alpha[t];
16     }
17     return sch;
18 }
19 
20 // Sample from N(0, I_d)
21 vector<double> sample_normal(int d, mt19937 &rng) {
22     normal_distribution<double> nd(0.0, 1.0);
23     vector<double> z(d);
24     for (int i = 0; i < d; ++i) z[i] = nd(rng);
25     return z;
26 }
27 
28 // Generate a simple 2D mixture of Gaussians dataset
29 vector<array<double,2>> sample_mixture2D(int N, mt19937 &rng) {
30     vector<array<double,2>> X; X.reserve(N);
31     normal_distribution<double> nd(0.0, 1.0);
32     uniform_real_distribution<double> unif(0.0, 1.0);
33     array<double,2> mu1{ -2.0, 0.0 }, mu2{ 2.0, 0.0 };
34     double sigma = 0.5; // isotropic std
35     for (int i = 0; i < N; ++i) {
36         array<double,2> mu = (unif(rng) < 0.5 ? mu1 : mu2);
37         array<double,2> x;
38         x[0] = mu[0] + sigma * nd(rng);
39         x[1] = mu[1] + sigma * nd(rng);
40         X.push_back(x);
41     }
42     return X;
43 }
44 
45 // Compute mean and trace of covariance for a set of 2D points
46 pair<array<double,2>, double> stats2D(const vector<array<double,2>>& X) {
47     array<double,2> m{0.0, 0.0};
48     for (auto &x : X) { m[0] += x[0]; m[1] += x[1]; }
49     m[0] /= X.size(); m[1] /= X.size();
50     // Covariance trace = var(x) + var(y)
51     double v = 0.0;
52     for (auto &x : X) {
53         v += (x[0]-m[0])*(x[0]-m[0]) + (x[1]-m[1])*(x[1]-m[1]);
54     }
55     double trace = v / (double)X.size();
56     return {m, trace};
57 }
58 
59 int main() {
60     ios::sync_with_stdio(false);
61     cin.tie(nullptr);
62 
63     mt19937 rng(42);
64     int N = 10000; // number of data points
65     auto X0 = sample_mixture2D(N, rng);
66 
67     // Create schedule
68     int T = 1000;
69     auto sch = make_linear_schedule(T);
70 
71     // Time steps at which we will examine the noised data distribution
72     vector<int> checkpoints = {0, 250, 500, 750, 999};
73 
74     cout << fixed << setprecision(4);
75     for (int t : checkpoints) {
76         double a_bar = sch.alpha_bar[t];
77         double mean_scale = sqrt(a_bar);
78         double noise_scale = sqrt(1.0 - a_bar);
79         vector<array<double,2>> Xt; Xt.reserve(N);
80         for (int i = 0; i < N; ++i) {
81             array<double,2> x0 = X0[i];
82             vector<double> eps = sample_normal(2, rng);
83             array<double,2> xt{
84                 mean_scale * x0[0] + noise_scale * eps[0],
85                 mean_scale * x0[1] + noise_scale * eps[1]
86             };
87             Xt.push_back(xt);
88         }
89         auto [m, tr] = stats2D(Xt);
90         cout << "t=" << setw(3) << t
91              << "  mean=(" << m[0] << ", " << m[1] << ")"
92              << "  cov_trace=" << tr << "\n";
93     }
94 
95     return 0;
96 }
97

This program constructs a standard DDPM-style linear noise schedule and uses the closed-form noising formula x_t = sqrt(alpha_bar_t) x0 + sqrt(1 - alpha_bar_t) eps to corrupt a simple 2D mixture of Gaussians at selected time steps. It reports the sample mean and covariance trace so you can see the distribution drift toward zero-mean, high-variance noise as t increases. This mirrors the forward diffusion used for training diffusion models.

Time: O(N · |checkpoints| · d), here d=2. If you looped over all T steps, it would be O(N · T · d).Space: O(N · d) to hold the dataset and a noised copy at each checkpoint; O(T) for the schedule.

Langevin dynamics sampler with classifier-free guidance on a 2D Gaussian mixture

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Gaussian2D {
5     array<double,2> mu; double sigma; double log_norm; // log normalizer constant
6     Gaussian2D(array<double,2> mu_, double sigma_): mu(mu_), sigma(sigma_) {
7         log_norm = -log(2.0 * M_PI * sigma * sigma);
8     }
9     // log N(x | mu, sigma^2 I)
10     double logp(const array<double,2>& x) const {
11         double dx = x[0]-mu[0]; double dy = x[1]-mu[1];
12         double q = (dx*dx + dy*dy) / (2.0 * sigma * sigma);
13         return log_norm - q;
14     }
15     // score = grad_x log N(x | mu, sigma^2 I) = -(x-mu)/sigma^2
16     array<double,2> score(const array<double,2>& x) const {
17         double inv = 1.0 / (sigma * sigma);
18         return { -(x[0]-mu[0]) * inv, -(x[1]-mu[1]) * inv };
19     }
20 };
21 
22 // Mixture with uniform weights; provides exact score via responsibilities
23 struct Mixture2D {
24     vector<Gaussian2D> comps;
25     Mixture2D(const vector<Gaussian2D>& cs): comps(cs) {}
26     array<double,2> score_unconditional(const array<double,2>& x) const {
27         int K = (int)comps.size();
28         vector<double> logw(K);
29         for (int k = 0; k < K; ++k) logw[k] = comps[k].logp(x);
30         // log-sum-exp for normalization
31         double m = *max_element(logw.begin(), logw.end());
32         double denom = 0.0; for (int k = 0; k < K; ++k) denom += exp(logw[k]-m);
33         vector<double> r(K);
34         for (int k = 0; k < K; ++k) r[k] = exp(logw[k]-m) / denom; // responsibilities
35         // score = sum_k r_k * score_k
36         array<double,2> s{0.0, 0.0};
37         for (int k = 0; k < K; ++k) {
38             auto sk = comps[k].score(x);
39             s[0] += r[k] * sk[0]; s[1] += r[k] * sk[1];
40         }
41         return s;
42     }
43     // Conditional score given a label y = k* (idealized: density = component k*)
44     array<double,2> score_conditional(const array<double,2>& x, int y) const {
45         return comps[y].score(x);
46     }
47 };
48 
49 array<double,2> add(const array<double,2>& a, const array<double,2>& b) {
50     return {a[0]+b[0], a[1]+b[1]};
51 }
52 array<double,2> scale(const array<double,2>& a, double c) {
53     return {c*a[0], c*a[1]};
54 }
55 
56 int main(){
57     ios::sync_with_stdio(false);
58     cin.tie(nullptr);
59 
60     // Define a 3-component mixture
61     vector<Gaussian2D> comps = {
62         Gaussian2D({-2.0, 0.0}, 0.6),
63         Gaussian2D({ 2.0, 0.0}, 0.6),
64         Gaussian2D({ 0.0, 2.0}, 0.6)
65     };
66     Mixture2D mix(comps);
67 
68     mt19937 rng(123);
69     normal_distribution<double> nd(0.0,1.0);
70 
71     // Langevin parameters
72     int chains = 2000;      // independent chains
73     int steps  = 2000;      // iterations per chain
74     double eps = 2e-3;      // step size (try decreasing over time for annealing)
75 
76     // Classifier-free guidance parameters
77     bool use_guidance = true;
78     int y_label = 1;        // target component index (0,1,2)
79     double w = 2.0;         // guidance scale; 0=unconditional, larger -> more conditional
80 
81     vector<array<double,2>> x(chains);
82     // Initialize from standard normal
83     for (int i = 0; i < chains; ++i) x[i] = { nd(rng), nd(rng) };
84 
85     for (int t = 0; t < steps; ++t) {
86         for (int i = 0; i < chains; ++i) {
87             // Compute score
88             array<double,2> s_u = mix.score_unconditional(x[i]);
89             array<double,2> s = s_u;
90             if (use_guidance) {
91                 array<double,2> s_c = mix.score_conditional(x[i], y_label);
92                 // classifier-free guidance: s = s_u + w * (s_c - s_u)
93                 s = { s_u[0] + w*(s_c[0]-s_u[0]), s_u[1] + w*(s_c[1]-s_u[1]) };
94             }
95             // Langevin update: x <- x + eps * s + sqrt(2*eps) * z
96             array<double,2> noise{ nd(rng), nd(rng) };
97             x[i][0] += eps * s[0] + sqrt(2.0*eps) * noise[0];
98             x[i][1] += eps * s[1] + sqrt(2.0*eps) * noise[1];
99         }
100         // Optional: decay step size for stability
101         // eps *= 0.9995;
102     }
103 
104     // Report sample mean and covariance trace after sampling
105     array<double,2> m{0.0,0.0};
106     for (auto &xi : x) { m[0]+=xi[0]; m[1]+=xi[1]; }
107     m[0]/=chains; m[1]/=chains;
108     double tr=0.0; for (auto &xi : x){ tr += (xi[0]-m[0])*(xi[0]-m[0]) + (xi[1]-m[1])*(xi[1]-m[1]); }
109     tr/=chains;
110 
111     cout << fixed << setprecision(4);
112     cout << "Samples: mean=(" << m[0] << ", " << m[1] << ")  cov_trace=" << tr << "\n";
113     cout << "Guidance: " << (use_guidance?"ON":"OFF") << ", y=" << y_label << ", w=" << w << "\n";
114 
115     return 0;
116 }
117

This program samples from a 2D Gaussian mixture using Langevin dynamics. Because the mixture’s score is analytic, we can implement an exact score-based sampler without training. It also demonstrates classifier-free guidance by linearly interpolating between unconditional (mixture) and conditional (single component) scores, controlled by a guidance scale w. Larger w biases samples toward the chosen component while still leveraging the unconditional prior for realism.

Time: O(chains · steps · K · d), with K mixture components and d=2. Here, K and d are small so runtime is dominated by chains × steps.Space: O(chains · d) to store states; O(K) for mixture parameters.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Schedule {
5	vector<double> beta, alpha, alpha_bar;
6	};
7
8	Schedule make_linear_schedule(int T, double beta_start=1e-4, double beta_end=2e-2) {
9	Schedule sch; sch.beta.resize(T); sch.alpha.resize(T); sch.alpha_bar.resize(T);
10	for (int t = 0; t < T; ++t) {
11	double bt = beta_start + (beta_end - beta_start) * (double)t / (double)(T - 1);
12	sch.beta[t] = bt;
13	sch.alpha[t] = 1.0 - bt;
14	if (t == 0) sch.alpha_bar[t] = sch.alpha[t];
15	else sch.alpha_bar[t] = sch.alpha_bar[t-1] * sch.alpha[t];
16	}
17	return sch;
18	}
19
20	// Sample from N(0, I_d)
21	vector<double> sample_normal(int d, mt19937 &rng) {
22	normal_distribution<double> nd(0.0, 1.0);
23	vector<double> z(d);
24	for (int i = 0; i < d; ++i) z[i] = nd(rng);
25	return z;
26	}
27
28	// Generate a simple 2D mixture of Gaussians dataset
29	vector<array<double,2>> sample_mixture2D(int N, mt19937 &rng) {
30	vector<array<double,2>> X; X.reserve(N);
31	normal_distribution<double> nd(0.0, 1.0);
32	uniform_real_distribution<double> unif(0.0, 1.0);
33	array<double,2> mu1{ -2.0, 0.0 }, mu2{ 2.0, 0.0 };
34	double sigma = 0.5; // isotropic std
35	for (int i = 0; i < N; ++i) {
36	array<double,2> mu = (unif(rng) < 0.5 ? mu1 : mu2);
37	array<double,2> x;
38	x[0] = mu[0] + sigma * nd(rng);
39	x[1] = mu[1] + sigma * nd(rng);
40	X.push_back(x);
41	}
42	return X;
43	}
44
45	// Compute mean and trace of covariance for a set of 2D points
46	pair<array<double,2>, double> stats2D(const vector<array<double,2>>& X) {
47	array<double,2> m{0.0, 0.0};
48	for (auto &x : X) { m[0] += x[0]; m[1] += x[1]; }
49	m[0] /= X.size(); m[1] /= X.size();
50	// Covariance trace = var(x) + var(y)
51	double v = 0.0;
52	for (auto &x : X) {
53	v += (x[0]-m[0])(x[0]-m[0]) + (x[1]-m[1])(x[1]-m[1]);
54	}
55	double trace = v / (double)X.size();
56	return {m, trace};
57	}
58
59	int main() {
60	ios::sync_with_stdio(false);
61	cin.tie(nullptr);
62
63	mt19937 rng(42);
64	int N = 10000; // number of data points
65	auto X0 = sample_mixture2D(N, rng);
66
67	// Create schedule
68	int T = 1000;
69	auto sch = make_linear_schedule(T);
70
71	// Time steps at which we will examine the noised data distribution
72	vector<int> checkpoints = {0, 250, 500, 750, 999};
73
74	cout << fixed << setprecision(4);
75	for (int t : checkpoints) {
76	double a_bar = sch.alpha_bar[t];
77	double mean_scale = sqrt(a_bar);
78	double noise_scale = sqrt(1.0 - a_bar);
79	vector<array<double,2>> Xt; Xt.reserve(N);
80	for (int i = 0; i < N; ++i) {
81	array<double,2> x0 = X0[i];
82	vector<double> eps = sample_normal(2, rng);
83	array<double,2> xt{
84	mean_scale * x0[0] + noise_scale * eps[0],
85	mean_scale * x0[1] + noise_scale * eps[1]
86	};
87	Xt.push_back(xt);
88	}
89	auto [m, tr] = stats2D(Xt);
90	cout << "t=" << setw(3) << t
91	<< " mean=(" << m[0] << ", " << m[1] << ")"
92	<< " cov_trace=" << tr << "\n";
93	}
94
95	return 0;
96	}
97

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Gaussian2D {
5	array<double,2> mu; double sigma; double log_norm; // log normalizer constant
6	Gaussian2D(array<double,2> mu_, double sigma_): mu(mu_), sigma(sigma_) {
7	log_norm = -log(2.0 * M_PI * sigma * sigma);
8	}
9	// log N(x \| mu, sigma^2 I)
10	double logp(const array<double,2>& x) const {
11	double dx = x[0]-mu[0]; double dy = x[1]-mu[1];
12	double q = (dxdx + dydy) / (2.0 * sigma * sigma);
13	return log_norm - q;
14	}
15	// score = grad_x log N(x \| mu, sigma^2 I) = -(x-mu)/sigma^2
16	array<double,2> score(const array<double,2>& x) const {
17	double inv = 1.0 / (sigma * sigma);
18	return { -(x[0]-mu[0]) * inv, -(x[1]-mu[1]) * inv };
19	}
20	};
21
22	// Mixture with uniform weights; provides exact score via responsibilities
23	struct Mixture2D {
24	vector<Gaussian2D> comps;
25	Mixture2D(const vector<Gaussian2D>& cs): comps(cs) {}
26	array<double,2> score_unconditional(const array<double,2>& x) const {
27	int K = (int)comps.size();
28	vector<double> logw(K);
29	for (int k = 0; k < K; ++k) logw[k] = comps[k].logp(x);
30	// log-sum-exp for normalization
31	double m = *max_element(logw.begin(), logw.end());
32	double denom = 0.0; for (int k = 0; k < K; ++k) denom += exp(logw[k]-m);
33	vector<double> r(K);
34	for (int k = 0; k < K; ++k) r[k] = exp(logw[k]-m) / denom; // responsibilities
35	// score = sum_k r_k * score_k
36	array<double,2> s{0.0, 0.0};
37	for (int k = 0; k < K; ++k) {
38	auto sk = comps[k].score(x);
39	s[0] += r[k] * sk[0]; s[1] += r[k] * sk[1];
40	}
41	return s;
42	}
43	// Conditional score given a label y = k* (idealized: density = component k*)
44	array<double,2> score_conditional(const array<double,2>& x, int y) const {
45	return comps[y].score(x);
46	}
47	};
48
49	array<double,2> add(const array<double,2>& a, const array<double,2>& b) {
50	return {a[0]+b[0], a[1]+b[1]};
51	}
52	array<double,2> scale(const array<double,2>& a, double c) {
53	return {ca[0], ca[1]};
54	}
55
56	int main(){
57	ios::sync_with_stdio(false);
58	cin.tie(nullptr);
59
60	// Define a 3-component mixture
61	vector<Gaussian2D> comps = {
62	Gaussian2D({-2.0, 0.0}, 0.6),
63	Gaussian2D({ 2.0, 0.0}, 0.6),
64	Gaussian2D({ 0.0, 2.0}, 0.6)
65	};
66	Mixture2D mix(comps);
67
68	mt19937 rng(123);
69	normal_distribution<double> nd(0.0,1.0);
70
71	// Langevin parameters
72	int chains = 2000; // independent chains
73	int steps = 2000; // iterations per chain
74	double eps = 2e-3; // step size (try decreasing over time for annealing)
75
76	// Classifier-free guidance parameters
77	bool use_guidance = true;
78	int y_label = 1; // target component index (0,1,2)
79	double w = 2.0; // guidance scale; 0=unconditional, larger -> more conditional
80
81	vector<array<double,2>> x(chains);
82	// Initialize from standard normal
83	for (int i = 0; i < chains; ++i) x[i] = { nd(rng), nd(rng) };
84
85	for (int t = 0; t < steps; ++t) {
86	for (int i = 0; i < chains; ++i) {
87	// Compute score
88	array<double,2> s_u = mix.score_unconditional(x[i]);
89	array<double,2> s = s_u;
90	if (use_guidance) {
91	array<double,2> s_c = mix.score_conditional(x[i], y_label);
92	// classifier-free guidance: s = s_u + w * (s_c - s_u)
93	s = { s_u[0] + w(s_c[0]-s_u[0]), s_u[1] + w(s_c[1]-s_u[1]) };
94	}
95	// Langevin update: x <- x + eps * s + sqrt(2eps) z
96	array<double,2> noise{ nd(rng), nd(rng) };
97	x[i][0] += eps * s[0] + sqrt(2.0eps) noise[0];
98	x[i][1] += eps * s[1] + sqrt(2.0eps) noise[1];
99	}
100	// Optional: decay step size for stability
101	// eps *= 0.9995;
102	}
103
104	// Report sample mean and covariance trace after sampling
105	array<double,2> m{0.0,0.0};
106	for (auto &xi : x) { m[0]+=xi[0]; m[1]+=xi[1]; }
107	m[0]/=chains; m[1]/=chains;
108	double tr=0.0; for (auto &xi : x){ tr += (xi[0]-m[0])(xi[0]-m[0]) + (xi[1]-m[1])(xi[1]-m[1]); }
109	tr/=chains;
110
111	cout << fixed << setprecision(4);
112	cout << "Samples: mean=(" << m[0] << ", " << m[1] << ") cov_trace=" << tr << "\n";
113	cout << "Guidance: " << (use_guidance?"ON":"OFF") << ", y=" << y_label << ", w=" << w << "\n";
114
115	return 0;
116	}
117