🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Classifier-Free Guidance

Key Points

  • •
    Classifier-Free Guidance (CFG) steers diffusion sampling toward a condition (like a text prompt) without needing a separate classifier.
  • •
    It combines a conditional prediction with an unconditional one using a guidance scale w to push samples toward the condition.
  • •
    The core formula in the epsilon-parameterization is \(ε~ = (1+w)\,ε_θ(xt​, c) - w\,ε_θ(xt​)\).
  • •
    CFG usually improves prompt adherence but can reduce diversity and cause artifacts if w is too large.
  • •
    You pay roughly 2× compute per step because you must run the model once with the prompt and once without it (unless you batch or fuse them).
  • •
    CFG can be applied in different parameterizations (\(ε\)-pred, \(x0​\)-pred, or v-pred), but you must convert consistently.
  • •
    Scheduling w over time (larger early, smaller late) often stabilizes results and reduces oversaturation.
  • •
    In C++, implement CFG by linearly combining two model outputs per step and then proceeding with your chosen sampler DDIMDDPM​.

Prerequisites

  • →Gaussian noise and basic probability — Diffusion training and sampling add and remove Gaussian noise at controlled levels.
  • →Gradient/score interpretation — CFG relates to moving along gradients of log-probability densities.
  • →Diffusion model basics (DDPM/DDIM) — You need to understand x_t, x_0, epsilon prediction, and schedulers to apply CFG correctly.
  • →Linear algebra and vector operations — Implementing CFG requires elementwise linear combinations and norm stability checks.
  • →Parameterizations (epsilon, x0, v) — CFG must be applied consistently; conversions between parameterizations are essential.
  • →Time complexity (Big-O) — CFG roughly doubles per-step compute; understanding cost helps in system design.

Detailed Explanation

Tap terms for definitions

01Overview

Classifier-Free Guidance (CFG) is a technique used in diffusion models to make generated samples better match a conditioning signal (like text, class labels, or other metadata) while avoiding a separate external classifier. During training, the model is sometimes given the condition and sometimes not (by dropping it). This teaches the same network to produce both conditional and unconditional predictions. At inference time, we query the model twice at each sampling step: once with the condition (conditional) and once without (unconditional). We then combine the two predictions with a scalar guidance weight w to bias the update toward the condition. In the common epsilon-parameterization (where the model predicts the noise), the combined prediction is (\tilde{\varepsilon} = (1+w),\varepsilon_\theta(x_t,c) - w,\varepsilon_\theta(x_t)). This simple linear combination yields strong prompt adherence in text-to-image systems and is widely used in practice (e.g., Stable Diffusion). CFG improves controllability without training a classifier for gradients, simplifying deployment and often stabilizing guidance relative to classifier-based methods. However, it increases sampling cost (two forward passes per step) and, if overused, can degrade image quality or reduce diversity.

02Intuition & Analogies

Imagine asking an artist to draw a scene. First, you ask them to doodle freely without instructions (unconditional). Then you give a detailed prompt and ask them to draw again (conditional). If you overlay the two sketches, the difference shows how the instructions changed the drawing. Classifier-Free Guidance uses this difference as a steering wheel: it pushes the generation in the direction that the prompt suggests. The guidance scale w is like turning the steering wheel more or less; small w subtly nudges the artwork toward the prompt, large w aggressively bends it, sometimes too much. In diffusion, each sampling step starts with a noisy canvas x_t. The model predicts how to denoise it. With the prompt, it predicts a denoising direction tailored to your request; without the prompt, it predicts a general denoising direction that just makes the sample look realistic. Subtracting the two highlights the portion of the denoising specifically due to the prompt. Adding this difference back (scaled by w) makes the step follow the conditional path more strongly. Over many steps, this accumulates to produce an output that aligns closely with the condition. The elegance is that we do not need a separate classifier to tell us how the prompt affects likelihood; the same model, trained with conditioning dropout, already knows both behaviors.

03Formal Definition

Let xt​ be the noisy latent at time t and c be a conditioning signal. Suppose a diffusion model predicts a target y_θ(xt​, c). In the \(ε\)-parameterization, \(y_θ = ε_θ\) approximates the noise in xt​. The unconditional prediction is \(ε_θ(xt​) := ε_θ(xt​, ∅)\), where \(∅\) denotes dropped/empty conditioning. Classifier-Free Guidance defines a guided prediction \(y~​\) as a linear combination of conditional and unconditional outputs. In epsilon space, \n\[ ε~(xt​, c; w) = (1+w)\,ε_θ(xt​, c) - w\,ε_θ(xt​). \] \nEquivalently, \(ε~ = ε_θ(xt​, c) + w\,\big(ε_θ(xt​, c) - ε_θ(xt​)\big)\): the base conditional prediction plus w times the difference between conditional and unconditional directions. In score space, if s_θ(xt​, c) approximates \(-∇xt​​ log p(xt​ ∣ c)\), then CFG uses \(s~ = s_θ(xt​, c) + w\,(s_θ(xt​, c) - s_θ(xt​))\). For other parameterizations (e.g., v-pred or x0​-pred), the same idea applies after converting to a common space (usually \(ε\)) with the standard reparameterization identities.

04When to Use

Use CFG when you have a conditional diffusion model trained with conditioning dropout (i.e., sometimes the model sees the condition, sometimes an empty token) and you want stronger adherence to the condition at inference. It is particularly effective in text-to-image generation, class-conditional image synthesis, image-to-image with prompts, and other modalities like audio or video where a descriptive condition is available. CFG is helpful when a separate classifier is impractical or unstable, since it avoids estimating (\nabla_x \log p(c\mid x)) directly. Employ CFG when you need a simple, drop-in way to improve prompt faithfulness with minimal architectural changes. Consider dynamic guidance schedules (large early, smaller late) if you observe oversaturation or detail washout at high w. If you are concerned about compute cost, batch unconditional and conditional passes together to reuse compute (e.g., shared encoder) or reduce precision to mitigate the 2× overhead. Avoid CFG if your model was not trained with conditioning dropout; the unconditional branch will be poor and guidance may misbehave.

⚠️Common Mistakes

• Using excessive guidance (very large w) can oversaturate colors, produce high-contrast artifacts, or collapse diversity. Start with moderate values (e.g., 3–9) and consider a decaying schedule. • Mixing parameterizations incorrectly: applying CFG directly in v- or x_0-space without consistent conversions leads to bias. Safest is to convert to (\varepsilon), apply CFG, then convert back if needed. • Forgetting to train with conditioning dropout (e.g., 10–20% of the time). Without a good unconditional head, the difference term is noisy and CFG becomes unstable. • Swapping conditional and unconditional batches, or broadcasting mistakes when batching both in one forward pass (shapes misaligned). Always track which half is cond vs uncond. • Applying CFG to the wrong timestep scaling (e.g., misusing (\alpha_t), (\bar{\alpha}_t), or (\sigma_t)) or mixing scheduler conventions across libraries. Keep a single, consistent scheduler API. • Assuming CFG helps everything: it improves condition adherence but may reduce diversity; for highly creative outputs, try smaller w or add stochasticity (nonzero (\eta) in DDIM). • Ignoring precision/NaNs when w is large: clamp intermediate norms or use dynamic thresholding to avoid exploding values late in sampling.

Key Formulas

Classifier-Free Guidance (epsilon space)

ε~(xt​,c;w)=(1+w)εθ​(xt​,c)−wεθ​(xt​)

Explanation: Combine conditional and unconditional noise predictions with guidance scale w. Larger w pushes the denoising more toward the conditional direction.

CFG in score space

s~(xt​,c;w)=sθ​(xt​,c)+w(sθ​(xt​,c)−sθ​(xt​))

Explanation: Apply the same linear guidance to score estimates. This matches the intuition of pushing along the condition-specific score direction.

x0 from epsilon

x0​=αˉt​​xt​−1−αˉt​​ε​

Explanation: Recover a clean estimate x0 given the predicted noise and current noise level. Used by DDPM/DDIM updates.

DDIM update (\u03b7=0 deterministic when \(\varepsilon_{\text{DDIM}} = \tilde{\varepsilon}\))

xt−1​=αˉt−1​​x0​+1−αˉt−1​​εDDIM​

Explanation: Deterministic DDIM uses the guided epsilon directly to move to the next step, often with no extra noise when eta=0.

v–epsilon conversion

v=αt​ε−σt​xt​​,ε=αt​v+σt​xt​

Explanation: Relates v-prediction to epsilon. Here \(αt​ = αˉt​​\) and \(σt​ = 1−αˉt​​\). Convert to epsilon to apply CFG consistently.

Score decomposition

∇x​logp(x∣c)=∇x​logp(x)+∇x​logp(c∣x)

Explanation: The conditional score equals the unconditional score plus the classifier gradient. CFG implicitly approximates this difference without training a classifier.

Generic CFG linear rule

y~​=yc​+w(yc​−yu​)

Explanation: A template to apply CFG in any parameterization y (epsilon, score, x0, or v) if conversions are handled correctly.

Linear decreasing guidance schedule

wt​=wmin​+(wmax​−wmin​)(1−T−1t​)

Explanation: Start with strong guidance early and decrease over time to stabilize late steps. t counts down from T-1 to 0 during sampling.

Noise schedule relations

αˉt​=i=1∏t​(1−βi​),αt​=αˉt​​,σt​=1−αˉt​​

Explanation: Defines cumulative noise levels used in conversions between parameterizations and in sampler updates.

Compute overhead of CFG

O(T⋅Cmodel​)→O(2T⋅Cmodel​) with CFG

Explanation: CFG doubles per-step model evaluations (conditional and unconditional), though batching can reduce wall-clock time.

Complexity Analysis

Let T be the number of sampling steps and Cm​odel be the average cost of one model forward pass for a single conditioning. Without CFG, the total compute is roughly O(T · Cm​odel). With CFG, you need two predictions per step (conditional and unconditional), so the naive cost becomes O(2T · Cm​odel). In practice, batching the conditional and unconditional inputs together can reduce wall-clock time through better device utilization, but FLOPs remain close to doubled. Memory usage increases slightly because you must hold two outputs per step (or a 2× larger batch). If you also maintain per-step buffers for xt​, ε, and x0​, the additional memory is O(n) for n-dimensional data (e.g., latent tensors), dominated by model activations during forward passes. For samplers like DDIM with η=0, there is no extra stochastic noise to store, and intermediate tensors can be reused in-place to keep memory overhead low. If you convert between parameterizations (e.g., v↔ε ↔ x0​), the arithmetic cost is linear in tensor size and negligible compared to the model forward passes. Scheduling the guidance scale wt​ adds only O(1) time per step. Overall, CFG’s trade-off is higher compute for improved adherence to conditions; careful batching and mixed-precision can mitigate runtime without changing asymptotic complexity.

Code Examples

Core CFG combination in epsilon-space
1#include <iostream>
2#include <vector>
3#include <iomanip>
4
5// Elementwise linear combination implementing CFG in epsilon space:
6// eps_guided = (1 + w) * eps_cond - w * eps_uncond
7std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
8 const std::vector<float>& eps_uncond,
9 float w) {
10 size_t n = eps_cond.size();
11 std::vector<float> out(n);
12 for (size_t i = 0; i < n; ++i) {
13 out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
14 }
15 return out;
16}
17
18void print_vec(const std::vector<float>& v, const std::string& name) {
19 std::cout << name << ": [";
20 for (size_t i = 0; i < v.size(); ++i) {
21 std::cout << std::fixed << std::setprecision(3) << v[i];
22 if (i + 1 < v.size()) std::cout << ", ";
23 }
24 std::cout << "]\n";
25}
26
27int main() {
28 // Toy epsilon predictions for a 4D latent
29 std::vector<float> eps_cond = { 0.10f, -0.05f, 0.20f, 0.00f }; // with prompt c
30 std::vector<float> eps_uncond= { 0.02f, -0.01f, 0.05f, 0.01f }; // with empty prompt
31
32 float w = 7.5f; // typical strong guidance
33
34 std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
35
36 print_vec(eps_cond, "eps_cond");
37 print_vec(eps_uncond, "eps_uncond");
38 print_vec(eps_guided, "eps_guided (CFG)");
39
40 return 0;
41}
42

This program demonstrates the core arithmetic of Classifier-Free Guidance in epsilon space. It linearly combines conditional and unconditional noise predictions using the guidance scale w. In real systems, eps_cond and eps_uncond are produced by the same neural network queried with and without the conditioning input.

Time: O(n) per vector (n elements), negligible compared to model inferenceSpace: O(n) for the output vector
DDIM sampling loop with CFG and a toy model
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <random>
5#include <iomanip>
6
7// A tiny toy "model" that deterministically produces an epsilon prediction
8// from x_t, a timestep t, and a condition id (0=uncond, 1=cond). In practice,
9// this would be a deep network.
10struct ToyModel {
11 // Deterministic pseudo-noise based on x_t, t, and cond_id
12 std::vector<float> predict_epsilon(const std::vector<float>& x_t, int t, int cond_id) const {
13 std::vector<float> eps(x_t.size());
14 float a = 0.12f + 0.01f * cond_id; // slight shift with condition
15 float b = 0.007f * (t + 1);
16 float c = 0.21f * cond_id;
17 for (size_t i = 0; i < x_t.size(); ++i) {
18 float xi = x_t[i];
19 eps[i] = std::sin(a * xi + b) + 0.1f * std::cos(0.3f * (i + 1) + c);
20 }
21 return eps;
22 }
23};
24
25// Guidance scale schedule: start high, end lower (linear decrease)
26float guidance_scale(int step_k, int T, float w_max = 7.5f, float w_min = 2.0f) {
27 if (T <= 1) return w_max;
28 float ratio = static_cast<float>(step_k) / static_cast<float>(T - 1); // 0..1
29 return w_max - (w_max - w_min) * ratio;
30}
31
32// Compute x0 from x_t and eps in epsilon parameterization
33// x0 = (x_t - sqrt(1 - abar_t) * eps) / sqrt(abar_t)
34void compute_x0(const std::vector<float>& x_t, const std::vector<float>& eps,
35 float sqrt_abar_t, std::vector<float>& x0_out) {
36 float sqrt_one_minus = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_t * sqrt_abar_t));
37 size_t n = x_t.size();
38 x0_out.resize(n);
39 for (size_t i = 0; i < n; ++i) {
40 x0_out[i] = (x_t[i] - sqrt_one_minus * eps[i]) / sqrt_abar_t;
41 }
42}
43
44// DDIM eta=0 deterministic update:
45// x_{t-1} = sqrt(abar_{t-1}) * x0 + sqrt(1 - abar_{t-1}) * eps_guided
46void ddim_update(std::vector<float>& x_t, const std::vector<float>& x0,
47 const std::vector<float>& eps_guided, float sqrt_abar_prev) {
48 float sqrt_one_minus_prev = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_prev * sqrt_abar_prev));
49 size_t n = x_t.size();
50 for (size_t i = 0; i < n; ++i) {
51 x_t[i] = sqrt_abar_prev * x0[i] + sqrt_one_minus_prev * eps_guided[i];
52 }
53}
54
55// CFG combination in epsilon space
56std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
57 const std::vector<float>& eps_uncond,
58 float w) {
59 size_t n = eps_cond.size();
60 std::vector<float> out(n);
61 for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
62 return out;
63}
64
65// Create a simple linear beta schedule and precompute sqrt(abar_t)
66std::vector<float> make_sqrt_abar_schedule(int T, float beta_start = 0.0001f, float beta_end = 0.02f) {
67 std::vector<float> betas(T);
68 for (int t = 0; t < T; ++t) {
69 betas[t] = beta_start + (beta_end - beta_start) * (static_cast<float>(t) / std::max(1, T - 1));
70 }
71 std::vector<float> abar(T);
72 float prod = 1.0f;
73 for (int t = 0; t < T; ++t) {
74 prod *= (1.0f - betas[t]);
75 abar[t] = prod;
76 }
77 std::vector<float> sqrt_abar(T);
78 for (int t = 0; t < T; ++t) sqrt_abar[t] = std::sqrt(std::max(0.0f, abar[t]));
79 return sqrt_abar;
80}
81
82int main() {
83 const int T = 20; // number of sampling steps (small for demo)
84 const size_t dim = 8; // toy latent dimension
85 ToyModel model;
86
87 // Initialize x_T ~ N(0, I)
88 std::mt19937 rng(42);
89 std::normal_distribution<float> nd(0.0f, 1.0f);
90 std::vector<float> x_t(dim);
91 for (size_t i = 0; i < dim; ++i) x_t[i] = nd(rng);
92
93 // Precompute sqrt(abar_t)
94 std::vector<float> sqrt_abar = make_sqrt_abar_schedule(T);
95
96 auto l2 = [](const std::vector<float>& v){ double s=0; for(float x: v) s += x*x; return std::sqrt(s); };
97
98 std::cout << "Initial ||x_T||2 = " << l2(x_t) << "\n";
99
100 // DDIM with CFG
101 std::vector<float> x0(dim);
102 for (int step = 0; step < T; ++step) {
103 int t = T - 1 - step; // current timestep index (descending)
104
105 // Predict eps with and without condition (cond_id: 1=cond, 0=uncond)
106 std::vector<float> eps_cond = model.predict_epsilon(x_t, t, /*cond_id=*/1);
107 std::vector<float> eps_uncond = model.predict_epsilon(x_t, t, /*cond_id=*/0);
108
109 // Guidance scale schedule
110 float w = guidance_scale(step, T, /*w_max=*/7.5f, /*w_min=*/2.0f);
111
112 // Combine via CFG
113 std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w);
114
115 // Reconstruct x0 from guided epsilon
116 compute_x0(x_t, eps_guided, /*sqrt_abar_t=*/sqrt_abar[t], x0);
117
118 // Move to x_{t-1}
119 float sqrt_abar_prev = (t > 0) ? sqrt_abar[t - 1] : 1.0f; // at t=0, we stop after computing x0
120 ddim_update(x_t, x0, eps_guided, sqrt_abar_prev);
121 }
122
123 std::cout << "Final ||x_0||2 = " << l2(x_t) << "\n";
124 std::cout << "(Toy demo complete; in real use, decode latent x_0 to output)\n";
125 return 0;
126}
127

This end-to-end toy sampler demonstrates DDIM with CFG. We generate an initial Gaussian latent, run T denoising steps, and at each step evaluate the toy model twice (with and without condition), combine the predictions using CFG, reconstruct x0, and update x_{t-1}. The guidance scale decreases linearly to reduce oversaturation late in sampling.

Time: O(T · (C_model + n)) ≈ O(2T · C_model) since two model calls dominateSpace: O(n) for storing x_t, eps, and x0 (model activations not modeled here)
Applying CFG correctly with v-prediction via epsilon conversion
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <iomanip>
5
6// Convert between v and epsilon parameterizations
7// alpha = sqrt(abar), sigma = sqrt(1 - abar)
8void v_to_epsilon(const std::vector<float>& v, const std::vector<float>& x_t,
9 float alpha, float sigma, std::vector<float>& eps_out) {
10 size_t n = v.size();
11 eps_out.resize(n);
12 for (size_t i = 0; i < n; ++i) eps_out[i] = alpha * v[i] + sigma * x_t[i];
13}
14
15void epsilon_to_v(const std::vector<float>& eps, const std::vector<float>& x_t,
16 float alpha, float sigma, std::vector<float>& v_out) {
17 size_t n = eps.size();
18 v_out.resize(n);
19 for (size_t i = 0; i < n; ++i) v_out[i] = (eps[i] - sigma * x_t[i]) / alpha;
20}
21
22// CFG in epsilon space
23std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond,
24 const std::vector<float>& eps_uncond,
25 float w) {
26 size_t n = eps_cond.size();
27 std::vector<float> out(n);
28 for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i];
29 return out;
30}
31
32void print_vec(const std::vector<float>& v, const std::string& name) {
33 std::cout << name << ": [";
34 for (size_t i = 0; i < v.size(); ++i) {
35 std::cout << std::fixed << std::setprecision(3) << v[i];
36 if (i + 1 < v.size()) std::cout << ", ";
37 }
38 std::cout << "]\n";
39}
40
41int main() {
42 // Example v-pred outputs (e.g., from a v-parameterized model)
43 std::vector<float> v_cond = {0.05f, -0.10f, 0.02f};
44 std::vector<float> v_uncond = {0.01f, -0.02f, 0.00f};
45
46 // Current noisy latent x_t and schedule values
47 std::vector<float> x_t = {0.7f, -0.3f, 0.1f};
48 float abar = 0.8f; // example cumulative alpha
49 float alpha = std::sqrt(abar);
50 float sigma = std::sqrt(1.0f - abar);
51
52 // Convert both to epsilon, apply CFG, then convert back to v if sampler expects v
53 std::vector<float> eps_c, eps_u, eps_guided, v_guided;
54 v_to_epsilon(v_cond, x_t, alpha, sigma, eps_c);
55 v_to_epsilon(v_uncond, x_t, alpha, sigma, eps_u);
56
57 float w = 6.0f;
58 eps_guided = cfg_epsilon(eps_c, eps_u, w);
59
60 epsilon_to_v(eps_guided, x_t, alpha, sigma, v_guided);
61
62 print_vec(v_cond, "v_cond");
63 print_vec(v_uncond, "v_uncond");
64 print_vec(v_guided, "v_guided (after CFG via epsilon)");
65
66 return 0;
67}
68

Many modern diffusion models predict v instead of epsilon. This example shows how to convert v to epsilon, apply CFG in epsilon space (recommended), and convert the guided epsilon back to v for samplers that expect v. This avoids parameterization mismatch errors.

Time: O(n) for conversions and combinationSpace: O(n) for intermediate vectors
#classifier-free guidance#diffusion models#epsilon prediction#v-prediction#ddim#ddpm#guidance scale#conditional sampling#text-to-image#score function#noise schedule#x0 reconstruction#dynamic thresholding#batching trick