📚TheoryIntermediate

Classifier-Free Guidance

Key Points

•
Classifier-Free Guidance (CFG) steers diffusion sampling toward a condition (like a text prompt) without needing a separate classifier.
•
It combines a conditional prediction with an unconditional one using a guidance scale w to push samples toward the condition.
•
The core formula in the epsilon-parameterization is $ $\tilde{ε}$ = (1+w)\, $ε$ _ $θ$ ( $x_{t}$ , c) - w\, $ε$ _ $θ$ ( $x_{t}$ )$.
•
CFG usually improves prompt adherence but can reduce diversity and cause artifacts if w is too large.
•
You pay roughly 2× compute per step because you must run the model once with the prompt and once without it (unless you batch or fuse them).
•
CFG can be applied in different parameterizations ($ $ε$ $-pred, $ $x_{0}$ $-pred, or v-pred), but you must convert consistently.
•
Scheduling w over time (larger early, smaller late) often stabilizes results and reduces oversaturation.
•
In C++, implement CFG by linearly combining two model outputs per step and then proceeding with your chosen sampler $\frac{DD PM}{DD I M}$ .

Prerequisites

→Gaussian noise and basic probability — Diffusion training and sampling add and remove Gaussian noise at controlled levels.
→Gradient/score interpretation — CFG relates to moving along gradients of log-probability densities.
→Diffusion model basics (DDPM/DDIM) — You need to understand x_t, x_0, epsilon prediction, and schedulers to apply CFG correctly.
→Linear algebra and vector operations — Implementing CFG requires elementwise linear combinations and norm stability checks.
→Parameterizations (epsilon, x0, v) — CFG must be applied consistently; conversions between parameterizations are essential.
→Time complexity (Big-O) — CFG roughly doubles per-step compute; understanding cost helps in system design.

Detailed Explanation

Tap terms for definitions

01Overview

Classifier-Free Guidance (CFG) is a technique used in diffusion models to make generated samples better match a conditioning signal (like text, class labels, or other metadata) while avoiding a separate external classifier. During training, the model is sometimes given the condition and sometimes not (by dropping it). This teaches the same network to produce both conditional and unconditional predictions. At inference time, we query the model twice at each sampling step: once with the condition (conditional) and once without (unconditional). We then combine the two predictions with a scalar guidance weight w to bias the update toward the condition. In the common epsilon-parameterization (where the model predicts the noise), the combined prediction is (\tilde{\varepsilon} = (1+w),\varepsilon_\theta(x_t,c) - w,\varepsilon_\theta(x_t)). This simple linear combination yields strong prompt adherence in text-to-image systems and is widely used in practice (e.g., Stable Diffusion). CFG improves controllability without training a classifier for gradients, simplifying deployment and often stabilizing guidance relative to classifier-based methods. However, it increases sampling cost (two forward passes per step) and, if overused, can degrade image quality or reduce diversity.

02Intuition & Analogies

Imagine asking an artist to draw a scene. First, you ask them to doodle freely without instructions (unconditional). Then you give a detailed prompt and ask them to draw again (conditional). If you overlay the two sketches, the difference shows how the instructions changed the drawing. Classifier-Free Guidance uses this difference as a steering wheel: it pushes the generation in the direction that the prompt suggests. The guidance scale w is like turning the steering wheel more or less; small w subtly nudges the artwork toward the prompt, large w aggressively bends it, sometimes too much. In diffusion, each sampling step starts with a noisy canvas x_t. The model predicts how to denoise it. With the prompt, it predicts a denoising direction tailored to your request; without the prompt, it predicts a general denoising direction that just makes the sample look realistic. Subtracting the two highlights the portion of the denoising specifically due to the prompt. Adding this difference back (scaled by w) makes the step follow the conditional path more strongly. Over many steps, this accumulates to produce an output that aligns closely with the condition. The elegance is that we do not need a separate classifier to tell us how the prompt affects likelihood; the same model, trained with conditioning dropout, already knows both behaviors.

03Formal Definition

Let

x_{t}

be the noisy latent at time t and c be a conditioning signal. Suppose a diffusion model predicts a target y_

θ

(

x_{t}

, c). In the \(

ε

\)-parameterization, \(y_

θ

ε

θ

\) approximates the noise in

x_{t}

. The unconditional prediction is \(

ε

θ

(

x_{t}

) :=

ε

θ

(

x_{t}

\emptyset

)\), where \(

\emptyset

\) denotes dropped/empty conditioning. Classifier-Free Guidance defines a guided prediction \(

\tilde{y}

\) as a linear combination of conditional and unconditional outputs. In epsilon space,

\n

\tilde{ε}

(

x_{t}

, c; w) = (1+w)\,

ε

θ

(

x_{t}

, c) - w\,

ε

θ

(

x_{t}

). \]

\nEquivalently

, \(

\tilde{ε}

ε

θ

(

x_{t}

, c) + w\,

\big

(

ε

θ

(

x_{t}

, c) -

ε

θ

(

x_{t}

)

\big

)\): the base conditional prediction plus w times the difference between conditional and unconditional directions. In score space, if s_

θ

(

x_{t}

, c) approximates \(-

\nabla_{x_{t}}

lo g

x_{t}

∣

c)\), then CFG uses \(

\tilde{s}

= s_

θ

(

x_{t}

, c) + w\,(s_

θ

(

x_{t}

, c) - s_

θ

(

x_{t}

))\). For other parameterizations (e.g., v-pred or

x_{0}

-pred), the same idea applies after converting to a common space (usually \(

ε

\)) with the standard reparameterization identities.

04When to Use

Use CFG when you have a conditional diffusion model trained with conditioning dropout (i.e., sometimes the model sees the condition, sometimes an empty token) and you want stronger adherence to the condition at inference. It is particularly effective in text-to-image generation, class-conditional image synthesis, image-to-image with prompts, and other modalities like audio or video where a descriptive condition is available. CFG is helpful when a separate classifier is impractical or unstable, since it avoids estimating (\nabla_x \log p(c\mid x)) directly. Employ CFG when you need a simple, drop-in way to improve prompt faithfulness with minimal architectural changes. Consider dynamic guidance schedules (large early, smaller late) if you observe oversaturation or detail washout at high w. If you are concerned about compute cost, batch unconditional and conditional passes together to reuse compute (e.g., shared encoder) or reduce precision to mitigate the 2× overhead. Avoid CFG if your model was not trained with conditioning dropout; the unconditional branch will be poor and guidance may misbehave.

⚠️Common Mistakes

• Using excessive guidance (very large w) can oversaturate colors, produce high-contrast artifacts, or collapse diversity. Start with moderate values (e.g., 3–9) and consider a decaying schedule. • Mixing parameterizations incorrectly: applying CFG directly in v- or x_0-space without consistent conversions leads to bias. Safest is to convert to (\varepsilon), apply CFG, then convert back if needed. • Forgetting to train with conditioning dropout (e.g., 10–20% of the time). Without a good unconditional head, the difference term is noisy and CFG becomes unstable. • Swapping conditional and unconditional batches, or broadcasting mistakes when batching both in one forward pass (shapes misaligned). Always track which half is cond vs uncond. • Applying CFG to the wrong timestep scaling (e.g., misusing (\alpha_t), (\bar{\alpha}_t), or (\sigma_t)) or mixing scheduler conventions across libraries. Keep a single, consistent scheduler API. • Assuming CFG helps everything: it improves condition adherence but may reduce diversity; for highly creative outputs, try smaller w or add stochasticity (nonzero (\eta) in DDIM). • Ignoring precision/NaNs when w is large: clamp intermediate norms or use dynamic thresholding to avoid exploding values late in sampling.

Key Formulas

Classifier-Free Guidance (epsilon space)

\tilde{ε} (x_{t}, c; w) = (1 + w) ε_{θ} (x_{t}, c) - w ε_{θ} (x_{t})

Explanation: Combine conditional and unconditional noise predictions with guidance scale w. Larger w pushes the denoising more toward the conditional direction.

CFG in score space

\tilde{s} (x_{t}, c; w) = s_{θ} (x_{t}, c) + w (s_{θ} (x_{t}, c) - s_{θ} (x_{t}))

Explanation: Apply the same linear guidance to score estimates. This matches the intuition of pushing along the condition-specific score direction.

x0 from epsilon

x_{0} = \frac{x _{t} - 1 - α ˉ _{t} ε}{α ˉ _{t}}

Explanation: Recover a clean estimate x0 given the predicted noise and current noise level. Used by DDPM/DDIM updates.

DDIM update (\u03b7=0 deterministic when $\varepsilon_{\text{DDIM}} = \tilde{\varepsilon}$)

x_{t - 1} = \overset{α}{ˉ}_{t - 1} x_{0} + 1 - \overset{α}{ˉ}_{t - 1} ε_{DDIM}

Explanation: Deterministic DDIM uses the guided epsilon directly to move to the next step, often with no extra noise when eta=0.

v–epsilon conversion

v = \frac{ε - σ _{t} x _{t}}{α _{t}}, ε = α_{t} v + σ_{t} x_{t}

Explanation: Relates v-prediction to epsilon. Here $ $α_{t}$ = $\overset{α}{ˉ}_{t}$ $ and $ $σ_{t}$ = $1 - \overset{α}{ˉ}_{t}$ $. Convert to epsilon to apply CFG consistently.

Score decomposition

\nabla_{x} lo g p (x ∣ c) = \nabla_{x} lo g p (x) + \nabla_{x} lo g p (c ∣ x)

Explanation: The conditional score equals the unconditional score plus the classifier gradient. CFG implicitly approximates this difference without training a classifier.

Generic CFG linear rule

\tilde{y} = y_{c} + w (y_{c} - y_{u})

Explanation: A template to apply CFG in any parameterization y (epsilon, score, x0, or v) if conversions are handled correctly.

Linear decreasing guidance schedule

w_{t} = w_{m i n} + (w_{m a x} - w_{m i n}) (1 - \frac{t}{T - 1})

Explanation: Start with strong guidance early and decrease over time to stabilize late steps. t counts down from T-1 to 0 during sampling.

Noise schedule relations

\overset{α}{ˉ}_{t} = i = 1 \prod t (1 - β_{i}), α_{t} = \overset{α}{ˉ}_{t}, σ_{t} = 1 - \overset{α}{ˉ}_{t}

Explanation: Defines cumulative noise levels used in conversions between parameterizations and in sampler updates.

Compute overhead of CFG

O (T \cdot C_{model}) \to O (2 T \cdot C_{model}) with CFG

Explanation: CFG doubles per-step model evaluations (conditional and unconditional), though batching can reduce wall-clock time.

Complexity Analysis

Let T be the number of sampling steps and

C_{m} o d e l

be the average cost of one model forward pass for a single conditioning. Without CFG, the total compute is roughly O(T ·

C_{m} o d e l

). With CFG, you need two predictions per step (conditional and unconditional), so the naive cost becomes O(2T ·

C_{m} o d e l

). In practice, batching the conditional and unconditional inputs together can reduce wall-clock time through better device utilization, but FLOPs remain close to doubled. Memory usage increases slightly because you must hold two outputs per step (or a 2× larger batch). If you also maintain per-step buffers for

x_{t}

ε,

and

x_{0}

, the additional memory is O(n) for n-dimensional data (e.g., latent tensors), dominated by model activations during forward passes. For samplers like DDIM with

η = 0,

there is no extra stochastic noise to store, and intermediate tensors can be reused in-place to keep memory overhead low. If you convert between parameterizations (e.g.,

v \leftrightarrow ε

↔

x_{0}

), the arithmetic cost is linear in tensor size and negligible compared to the model forward passes. Scheduling the guidance scale

w_{t}

adds only O(1) time per step. Overall, CFG’s trade-off is higher compute for improved adherence to conditions; careful batching and mixed-precision can mitigate runtime without changing asymptotic complexity.

Code Examples

Core CFG combination in epsilon-space

1 #include <iostream>
2 #include <vector>
3 #include <iomanip>
4 
5 // Elementwise linear combination implementing CFG in epsilon space:
6 //   eps_guided = (1 + w) * eps_cond - w * eps_uncond
7 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
8                                const std::vector<float>& eps_uncond,
9                                float w) {
10     size_t n = eps_cond.size();
11     std::vector<float> out(n);
12     for (size_t i = 0; i < n; ++i) {
13         out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
14     }
15     return out;
16 }
17 
18 void print_vec(const std::vector<float>& v, const std::string& name) {
19     std::cout << name << ": [";
20     for (size_t i = 0; i < v.size(); ++i) {
21         std::cout << std::fixed << std::setprecision(3) << v[i];
22         if (i + 1 < v.size()) std::cout << ", ";
23     }
24     std::cout << "]\n";
25 }
26 
27 int main() {
28     // Toy epsilon predictions for a 4D latent
29     std::vector<float> eps_cond  = { 0.10f, -0.05f, 0.20f, 0.00f };   // with prompt c
30     std::vector<float> eps_uncond= { 0.02f, -0.01f, 0.05f, 0.01f };   // with empty prompt
31 
32     float w = 7.5f; // typical strong guidance
33 
34     std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
35 
36     print_vec(eps_cond,   "eps_cond");
37     print_vec(eps_uncond, "eps_uncond");
38     print_vec(eps_guided, "eps_guided (CFG)");
39 
40     return 0;
41 }
42

This program demonstrates the core arithmetic of Classifier-Free Guidance in epsilon space. It linearly combines conditional and unconditional noise predictions using the guidance scale w. In real systems, eps_cond and eps_uncond are produced by the same neural network queried with and without the conditioning input.

Time: O(n) per vector (n elements), negligible compared to model inferenceSpace: O(n) for the output vector

DDIM sampling loop with CFG and a toy model

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <random>
5 #include <iomanip>
6 
7 // A tiny toy "model" that deterministically produces an epsilon prediction
8 // from x_t, a timestep t, and a condition id (0=uncond, 1=cond). In practice,
9 // this would be a deep network.
10 struct ToyModel {
11     // Deterministic pseudo-noise based on x_t, t, and cond_id
12     std::vector<float> predict_epsilon(const std::vector<float>& x_t, int t, int cond_id) const {
13         std::vector<float> eps(x_t.size());
14         float a = 0.12f + 0.01f * cond_id; // slight shift with condition
15         float b = 0.007f * (t + 1);
16         float c = 0.21f * cond_id;
17         for (size_t i = 0; i < x_t.size(); ++i) {
18             float xi = x_t[i];
19             eps[i] = std::sin(a * xi + b) + 0.1f * std::cos(0.3f * (i + 1) + c);
20         }
21         return eps;
22     }
23 };
24 
25 // Guidance scale schedule: start high, end lower (linear decrease)
26 float guidance_scale(int step_k, int T, float w_max = 7.5f, float w_min = 2.0f) {
27     if (T <= 1) return w_max;
28     float ratio = static_cast<float>(step_k) / static_cast<float>(T - 1); // 0..1
29     return w_max - (w_max - w_min) * ratio;
30 }
31 
32 // Compute x0 from x_t and eps in epsilon parameterization
33 // x0 = (x_t - sqrt(1 - abar_t) * eps) / sqrt(abar_t)
34 void compute_x0(const std::vector<float>& x_t, const std::vector<float>& eps,
35                 float sqrt_abar_t, std::vector<float>& x0_out) {
36     float sqrt_one_minus = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_t * sqrt_abar_t));
37     size_t n = x_t.size();
38     x0_out.resize(n);
39     for (size_t i = 0; i < n; ++i) {
40         x0_out[i] = (x_t[i] - sqrt_one_minus * eps[i]) / sqrt_abar_t;
41     }
42 }
43 
44 // DDIM eta=0 deterministic update:
45 // x_{t-1} = sqrt(abar_{t-1}) * x0 + sqrt(1 - abar_{t-1}) * eps_guided
46 void ddim_update(std::vector<float>& x_t, const std::vector<float>& x0,
47                  const std::vector<float>& eps_guided, float sqrt_abar_prev) {
48     float sqrt_one_minus_prev = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_prev * sqrt_abar_prev));
49     size_t n = x_t.size();
50     for (size_t i = 0; i < n; ++i) {
51         x_t[i] = sqrt_abar_prev * x0[i] + sqrt_one_minus_prev * eps_guided[i];
52     }
53 }
54 
55 // CFG combination in epsilon space
56 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
57                                const std::vector<float>& eps_uncond,
58                                float w) {
59     size_t n = eps_cond.size();
60     std::vector<float> out(n);
61     for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
62     return out;
63 }
64 
65 // Create a simple linear beta schedule and precompute sqrt(abar_t)
66 std::vector<float> make_sqrt_abar_schedule(int T, float beta_start = 0.0001f, float beta_end = 0.02f) {
67     std::vector<float> betas(T);
68     for (int t = 0; t < T; ++t) {
69         betas[t] = beta_start + (beta_end - beta_start) * (static_cast<float>(t) / std::max(1, T - 1));
70     }
71     std::vector<float> abar(T);
72     float prod = 1.0f;
73     for (int t = 0; t < T; ++t) {
74         prod *= (1.0f - betas[t]);
75         abar[t] = prod;
76     }
77     std::vector<float> sqrt_abar(T);
78     for (int t = 0; t < T; ++t) sqrt_abar[t] = std::sqrt(std::max(0.0f, abar[t]));
79     return sqrt_abar;
80 }
81 
82 int main() {
83     const int T = 20;            // number of sampling steps (small for demo)
84     const size_t dim = 8;        // toy latent dimension
85     ToyModel model;
86 
87     // Initialize x_T ~ N(0, I)
88     std::mt19937 rng(42);
89     std::normal_distribution<float> nd(0.0f, 1.0f);
90     std::vector<float> x_t(dim);
91     for (size_t i = 0; i < dim; ++i) x_t[i] = nd(rng);
92 
93     // Precompute sqrt(abar_t)
94     std::vector<float> sqrt_abar = make_sqrt_abar_schedule(T);
95 
96     auto l2 = [](const std::vector<float>& v){ double s=0; for(float x: v) s += x*x; return std::sqrt(s); };
97 
98     std::cout << "Initial ||x_T||2 = " << l2(x_t) << "\n";
99 
100     // DDIM with CFG
101     std::vector<float> x0(dim);
102     for (int step = 0; step < T; ++step) {
103         int t = T - 1 - step; // current timestep index (descending)
104 
105         // Predict eps with and without condition (cond_id: 1=cond, 0=uncond)
106         std::vector<float> eps_cond   = model.predict_epsilon(x_t, t, /*cond_id=*/1);
107         std::vector<float> eps_uncond = model.predict_epsilon(x_t, t, /*cond_id=*/0);
108 
109         // Guidance scale schedule
110         float w = guidance_scale(step, T, /*w_max=*/7.5f, /*w_min=*/2.0f);
111 
112         // Combine via CFG
113         std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
114 
115         // Reconstruct x0 from guided epsilon
116         compute_x0(x_t, eps_guided, /*sqrt_abar_t=*/sqrt_abar[t], x0);
117 
118         // Move to x_{t-1}
119         float sqrt_abar_prev = (t > 0) ? sqrt_abar[t - 1] : 1.0f; // at t=0, we stop after computing x0
120         ddim_update(x_t, x0, eps_guided, sqrt_abar_prev);
121     }
122 
123     std::cout << "Final  ||x_0||2 = " << l2(x_t) << "\n";
124     std::cout << "(Toy demo complete; in real use, decode latent x_0 to output)\n";
125     return 0;
126 }
127

This end-to-end toy sampler demonstrates DDIM with CFG. We generate an initial Gaussian latent, run T denoising steps, and at each step evaluate the toy model twice (with and without condition), combine the predictions using CFG, reconstruct x0, and update x_{t-1}. The guidance scale decreases linearly to reduce oversaturation late in sampling.

Time: O(T · (C_model + n)) ≈ O(2T · C_model) since two model calls dominateSpace: O(n) for storing x_t, eps, and x0 (model activations not modeled here)

Applying CFG correctly with v-prediction via epsilon conversion

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <iomanip>
5 
6 // Convert between v and epsilon parameterizations
7 // alpha = sqrt(abar), sigma = sqrt(1 - abar)
8 void v_to_epsilon(const std::vector<float>& v, const std::vector<float>& x_t,
9                   float alpha, float sigma, std::vector<float>& eps_out) {
10     size_t n = v.size();
11     eps_out.resize(n);
12     for (size_t i = 0; i < n; ++i) eps_out[i] = alpha * v[i] + sigma * x_t[i];
13 }
14 
15 void epsilon_to_v(const std::vector<float>& eps, const std::vector<float>& x_t,
16                   float alpha, float sigma, std::vector<float>& v_out) {
17     size_t n = eps.size();
18     v_out.resize(n);
19     for (size_t i = 0; i < n; ++i) v_out[i] = (eps[i] - sigma * x_t[i]) / alpha;
20 }
21 
22 // CFG in epsilon space
23 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
24                                const std::vector<float>& eps_uncond,
25                                float w) {
26     size_t n = eps_cond.size();
27     std::vector<float> out(n);
28     for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
29     return out;
30 }
31 
32 void print_vec(const std::vector<float>& v, const std::string& name) {
33     std::cout << name << ": [";
34     for (size_t i = 0; i < v.size(); ++i) {
35         std::cout << std::fixed << std::setprecision(3) << v[i];
36         if (i + 1 < v.size()) std::cout << ", ";
37     }
38     std::cout << "]\n";
39 }
40 
41 int main() {
42     // Example v-pred outputs (e.g., from a v-parameterized model)
43     std::vector<float> v_cond   = {0.05f, -0.10f, 0.02f};
44     std::vector<float> v_uncond = {0.01f, -0.02f, 0.00f};
45 
46     // Current noisy latent x_t and schedule values
47     std::vector<float> x_t = {0.7f, -0.3f, 0.1f};
48     float abar = 0.8f; // example cumulative alpha
49     float alpha = std::sqrt(abar);
50     float sigma = std::sqrt(1.0f - abar);
51 
52     // Convert both to epsilon, apply CFG, then convert back to v if sampler expects v
53     std::vector<float> eps_c, eps_u, eps_guided, v_guided;
54     v_to_epsilon(v_cond, x_t, alpha, sigma, eps_c);
55     v_to_epsilon(v_uncond, x_t, alpha, sigma, eps_u);
56 
57     float w = 6.0f;
58     eps_guided = cfg_epsilon(eps_c, eps_u, w);
59 
60     epsilon_to_v(eps_guided, x_t, alpha, sigma, v_guided);
61 
62     print_vec(v_cond,    "v_cond");
63     print_vec(v_uncond,  "v_uncond");
64     print_vec(v_guided,  "v_guided (after CFG via epsilon)");
65 
66     return 0;
67 }
68

Many modern diffusion models predict v instead of epsilon. This example shows how to convert v to epsilon, apply CFG in epsilon space (recommended), and convert the guided epsilon back to v for samplers that expect v. This avoids parameterization mismatch errors.

Time: O(n) for conversions and combinationSpace: O(n) for intermediate vectors

1	#include <iostream>
2	#include <vector>
3	#include <iomanip>
4
5	// Elementwise linear combination implementing CFG in epsilon space:
6	// eps_guided = (1 + w) * eps_cond - w * eps_uncond
7	std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
8	const std::vector<float>& eps_uncond,
9	float w) {
10	size_t n = eps_cond.size();
11	std::vector<float> out(n);
12	for (size_t i = 0; i < n; ++i) {
13	out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
14	}
15	return out;
16	}
17
18	void print_vec(const std::vector<float>& v, const std::string& name) {
19	std::cout << name << ": [";
20	for (size_t i = 0; i < v.size(); ++i) {
21	std::cout << std::fixed << std::setprecision(3) << v[i];
22	if (i + 1 < v.size()) std::cout << ", ";
23	}
24	std::cout << "]\n";
25	}
26
27	int main() {
28	// Toy epsilon predictions for a 4D latent
29	std::vector<float> eps_cond = { 0.10f, -0.05f, 0.20f, 0.00f }; // with prompt c
30	std::vector<float> eps_uncond= { 0.02f, -0.01f, 0.05f, 0.01f }; // with empty prompt
31
32	float w = 7.5f; // typical strong guidance
33
34	std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
35
36	print_vec(eps_cond, "eps_cond");
37	print_vec(eps_uncond, "eps_uncond");
38	print_vec(eps_guided, "eps_guided (CFG)");
39
40	return 0;
41	}
42

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <random>
5	#include <iomanip>
6
7	// A tiny toy "model" that deterministically produces an epsilon prediction
8	// from x_t, a timestep t, and a condition id (0=uncond, 1=cond). In practice,
9	// this would be a deep network.
10	struct ToyModel {
11	// Deterministic pseudo-noise based on x_t, t, and cond_id
12	std::vector<float> predict_epsilon(const std::vector<float>& x_t, int t, int cond_id) const {
13	std::vector<float> eps(x_t.size());
14	float a = 0.12f + 0.01f * cond_id; // slight shift with condition
15	float b = 0.007f * (t + 1);
16	float c = 0.21f * cond_id;
17	for (size_t i = 0; i < x_t.size(); ++i) {
18	float xi = x_t[i];
19	eps[i] = std::sin(a * xi + b) + 0.1f * std::cos(0.3f * (i + 1) + c);
20	}
21	return eps;
22	}
23	};
24
25	// Guidance scale schedule: start high, end lower (linear decrease)
26	float guidance_scale(int step_k, int T, float w_max = 7.5f, float w_min = 2.0f) {
27	if (T <= 1) return w_max;
28	float ratio = static_cast<float>(step_k) / static_cast<float>(T - 1); // 0..1
29	return w_max - (w_max - w_min) * ratio;
30	}
31
32	// Compute x0 from x_t and eps in epsilon parameterization
33	// x0 = (x_t - sqrt(1 - abar_t) * eps) / sqrt(abar_t)
34	void compute_x0(const std::vector<float>& x_t, const std::vector<float>& eps,
35	float sqrt_abar_t, std::vector<float>& x0_out) {
36	float sqrt_one_minus = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_t * sqrt_abar_t));
37	size_t n = x_t.size();
38	x0_out.resize(n);
39	for (size_t i = 0; i < n; ++i) {
40	x0_out[i] = (x_t[i] - sqrt_one_minus * eps[i]) / sqrt_abar_t;
41	}
42	}
43
44	// DDIM eta=0 deterministic update:
45	// x_{t-1} = sqrt(abar_{t-1}) * x0 + sqrt(1 - abar_{t-1}) * eps_guided
46	void ddim_update(std::vector<float>& x_t, const std::vector<float>& x0,
47	const std::vector<float>& eps_guided, float sqrt_abar_prev) {
48	float sqrt_one_minus_prev = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_prev * sqrt_abar_prev));
49	size_t n = x_t.size();
50	for (size_t i = 0; i < n; ++i) {
51	x_t[i] = sqrt_abar_prev * x0[i] + sqrt_one_minus_prev * eps_guided[i];
52	}
53	}
54
55	// CFG combination in epsilon space
56	std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
57	const std::vector<float>& eps_uncond,
58	float w) {
59	size_t n = eps_cond.size();
60	std::vector<float> out(n);
61	for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
62	return out;
63	}
64
65	// Create a simple linear beta schedule and precompute sqrt(abar_t)
66	std::vector<float> make_sqrt_abar_schedule(int T, float beta_start = 0.0001f, float beta_end = 0.02f) {
67	std::vector<float> betas(T);
68	for (int t = 0; t < T; ++t) {
69	betas[t] = beta_start + (beta_end - beta_start) * (static_cast<float>(t) / std::max(1, T - 1));
70	}
71	std::vector<float> abar(T);
72	float prod = 1.0f;
73	for (int t = 0; t < T; ++t) {
74	prod *= (1.0f - betas[t]);
75	abar[t] = prod;
76	}
77	std::vector<float> sqrt_abar(T);
78	for (int t = 0; t < T; ++t) sqrt_abar[t] = std::sqrt(std::max(0.0f, abar[t]));
79	return sqrt_abar;
80	}
81
82	int main() {
83	const int T = 20; // number of sampling steps (small for demo)
84	const size_t dim = 8; // toy latent dimension
85	ToyModel model;
86
87	// Initialize x_T ~ N(0, I)
88	std::mt19937 rng(42);
89	std::normal_distribution<float> nd(0.0f, 1.0f);
90	std::vector<float> x_t(dim);
91	for (size_t i = 0; i < dim; ++i) x_t[i] = nd(rng);
92
93	// Precompute sqrt(abar_t)
94	std::vector<float> sqrt_abar = make_sqrt_abar_schedule(T);
95
96	auto l2 = [](const std::vector<float>& v){ double s=0; for(float x: v) s += x*x; return std::sqrt(s); };
97
98	std::cout << "Initial \|\|x_T\|\|2 = " << l2(x_t) << "\n";
99
100	// DDIM with CFG
101	std::vector<float> x0(dim);
102	for (int step = 0; step < T; ++step) {
103	int t = T - 1 - step; // current timestep index (descending)
104
105	// Predict eps with and without condition (cond_id: 1=cond, 0=uncond)
106	std::vector<float> eps_cond = model.predict_epsilon(x_t, t, /cond_id=/1);
107	std::vector<float> eps_uncond = model.predict_epsilon(x_t, t, /cond_id=/0);
108
109	// Guidance scale schedule
110	float w = guidance_scale(step, T, /w_max=/7.5f, /w_min=/2.0f);
111
112	// Combine via CFG
113	std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
114
115	// Reconstruct x0 from guided epsilon
116	compute_x0(x_t, eps_guided, /sqrt_abar_t=/sqrt_abar[t], x0);
117
118	// Move to x_{t-1}
119	float sqrt_abar_prev = (t > 0) ? sqrt_abar[t - 1] : 1.0f; // at t=0, we stop after computing x0
120	ddim_update(x_t, x0, eps_guided, sqrt_abar_prev);
121	}
122
123	std::cout << "Final \|\|x_0\|\|2 = " << l2(x_t) << "\n";
124	std::cout << "(Toy demo complete; in real use, decode latent x_0 to output)\n";
125	return 0;
126	}
127