Classifier-Free Guidance
Key Points
- •Classifier-Free Guidance (CFG) steers diffusion sampling toward a condition (like a text prompt) without needing a separate classifier.
- •It combines a conditional prediction with an unconditional one using a guidance scale w to push samples toward the condition.
- •The core formula in the epsilon-parameterization is \( = (1+w)\,_(, c) - w\,_()\).
- •CFG usually improves prompt adherence but can reduce diversity and cause artifacts if w is too large.
- •You pay roughly 2× compute per step because you must run the model once with the prompt and once without it (unless you batch or fuse them).
- •CFG can be applied in different parameterizations (\(\)-pred, \(\)-pred, or v-pred), but you must convert consistently.
- •Scheduling w over time (larger early, smaller late) often stabilizes results and reduces oversaturation.
- •In C++, implement CFG by linearly combining two model outputs per step and then proceeding with your chosen sampler .
Prerequisites
- →Gaussian noise and basic probability — Diffusion training and sampling add and remove Gaussian noise at controlled levels.
- →Gradient/score interpretation — CFG relates to moving along gradients of log-probability densities.
- →Diffusion model basics (DDPM/DDIM) — You need to understand x_t, x_0, epsilon prediction, and schedulers to apply CFG correctly.
- →Linear algebra and vector operations — Implementing CFG requires elementwise linear combinations and norm stability checks.
- →Parameterizations (epsilon, x0, v) — CFG must be applied consistently; conversions between parameterizations are essential.
- →Time complexity (Big-O) — CFG roughly doubles per-step compute; understanding cost helps in system design.
Detailed Explanation
Tap terms for definitions01Overview
Classifier-Free Guidance (CFG) is a technique used in diffusion models to make generated samples better match a conditioning signal (like text, class labels, or other metadata) while avoiding a separate external classifier. During training, the model is sometimes given the condition and sometimes not (by dropping it). This teaches the same network to produce both conditional and unconditional predictions. At inference time, we query the model twice at each sampling step: once with the condition (conditional) and once without (unconditional). We then combine the two predictions with a scalar guidance weight w to bias the update toward the condition. In the common epsilon-parameterization (where the model predicts the noise), the combined prediction is (\tilde{\varepsilon} = (1+w),\varepsilon_\theta(x_t,c) - w,\varepsilon_\theta(x_t)). This simple linear combination yields strong prompt adherence in text-to-image systems and is widely used in practice (e.g., Stable Diffusion). CFG improves controllability without training a classifier for gradients, simplifying deployment and often stabilizing guidance relative to classifier-based methods. However, it increases sampling cost (two forward passes per step) and, if overused, can degrade image quality or reduce diversity.
02Intuition & Analogies
Imagine asking an artist to draw a scene. First, you ask them to doodle freely without instructions (unconditional). Then you give a detailed prompt and ask them to draw again (conditional). If you overlay the two sketches, the difference shows how the instructions changed the drawing. Classifier-Free Guidance uses this difference as a steering wheel: it pushes the generation in the direction that the prompt suggests. The guidance scale w is like turning the steering wheel more or less; small w subtly nudges the artwork toward the prompt, large w aggressively bends it, sometimes too much. In diffusion, each sampling step starts with a noisy canvas x_t. The model predicts how to denoise it. With the prompt, it predicts a denoising direction tailored to your request; without the prompt, it predicts a general denoising direction that just makes the sample look realistic. Subtracting the two highlights the portion of the denoising specifically due to the prompt. Adding this difference back (scaled by w) makes the step follow the conditional path more strongly. Over many steps, this accumulates to produce an output that aligns closely with the condition. The elegance is that we do not need a separate classifier to tell us how the prompt affects likelihood; the same model, trained with conditioning dropout, already knows both behaviors.
03Formal Definition
04When to Use
Use CFG when you have a conditional diffusion model trained with conditioning dropout (i.e., sometimes the model sees the condition, sometimes an empty token) and you want stronger adherence to the condition at inference. It is particularly effective in text-to-image generation, class-conditional image synthesis, image-to-image with prompts, and other modalities like audio or video where a descriptive condition is available. CFG is helpful when a separate classifier is impractical or unstable, since it avoids estimating (\nabla_x \log p(c\mid x)) directly. Employ CFG when you need a simple, drop-in way to improve prompt faithfulness with minimal architectural changes. Consider dynamic guidance schedules (large early, smaller late) if you observe oversaturation or detail washout at high w. If you are concerned about compute cost, batch unconditional and conditional passes together to reuse compute (e.g., shared encoder) or reduce precision to mitigate the 2× overhead. Avoid CFG if your model was not trained with conditioning dropout; the unconditional branch will be poor and guidance may misbehave.
⚠️Common Mistakes
• Using excessive guidance (very large w) can oversaturate colors, produce high-contrast artifacts, or collapse diversity. Start with moderate values (e.g., 3–9) and consider a decaying schedule. • Mixing parameterizations incorrectly: applying CFG directly in v- or x_0-space without consistent conversions leads to bias. Safest is to convert to (\varepsilon), apply CFG, then convert back if needed. • Forgetting to train with conditioning dropout (e.g., 10–20% of the time). Without a good unconditional head, the difference term is noisy and CFG becomes unstable. • Swapping conditional and unconditional batches, or broadcasting mistakes when batching both in one forward pass (shapes misaligned). Always track which half is cond vs uncond. • Applying CFG to the wrong timestep scaling (e.g., misusing (\alpha_t), (\bar{\alpha}_t), or (\sigma_t)) or mixing scheduler conventions across libraries. Keep a single, consistent scheduler API. • Assuming CFG helps everything: it improves condition adherence but may reduce diversity; for highly creative outputs, try smaller w or add stochasticity (nonzero (\eta) in DDIM). • Ignoring precision/NaNs when w is large: clamp intermediate norms or use dynamic thresholding to avoid exploding values late in sampling.
Key Formulas
Classifier-Free Guidance (epsilon space)
Explanation: Combine conditional and unconditional noise predictions with guidance scale w. Larger w pushes the denoising more toward the conditional direction.
CFG in score space
Explanation: Apply the same linear guidance to score estimates. This matches the intuition of pushing along the condition-specific score direction.
x0 from epsilon
Explanation: Recover a clean estimate x0 given the predicted noise and current noise level. Used by DDPM/DDIM updates.
DDIM update (\u03b7=0 deterministic when \(\varepsilon_{\text{DDIM}} = \tilde{\varepsilon}\))
Explanation: Deterministic DDIM uses the guided epsilon directly to move to the next step, often with no extra noise when eta=0.
v–epsilon conversion
Explanation: Relates v-prediction to epsilon. Here \( = \) and \( = \). Convert to epsilon to apply CFG consistently.
Score decomposition
Explanation: The conditional score equals the unconditional score plus the classifier gradient. CFG implicitly approximates this difference without training a classifier.
Generic CFG linear rule
Explanation: A template to apply CFG in any parameterization y (epsilon, score, x0, or v) if conversions are handled correctly.
Linear decreasing guidance schedule
Explanation: Start with strong guidance early and decrease over time to stabilize late steps. t counts down from T-1 to 0 during sampling.
Noise schedule relations
Explanation: Defines cumulative noise levels used in conversions between parameterizations and in sampler updates.
Compute overhead of CFG
Explanation: CFG doubles per-step model evaluations (conditional and unconditional), though batching can reduce wall-clock time.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <iomanip> 4 5 // Elementwise linear combination implementing CFG in epsilon space: 6 // eps_guided = (1 + w) * eps_cond - w * eps_uncond 7 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond, 8 const std::vector<float>& eps_uncond, 9 float w) { 10 size_t n = eps_cond.size(); 11 std::vector<float> out(n); 12 for (size_t i = 0; i < n; ++i) { 13 out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i]; 14 } 15 return out; 16 } 17 18 void print_vec(const std::vector<float>& v, const std::string& name) { 19 std::cout << name << ": ["; 20 for (size_t i = 0; i < v.size(); ++i) { 21 std::cout << std::fixed << std::setprecision(3) << v[i]; 22 if (i + 1 < v.size()) std::cout << ", "; 23 } 24 std::cout << "]\n"; 25 } 26 27 int main() { 28 // Toy epsilon predictions for a 4D latent 29 std::vector<float> eps_cond = { 0.10f, -0.05f, 0.20f, 0.00f }; // with prompt c 30 std::vector<float> eps_uncond= { 0.02f, -0.01f, 0.05f, 0.01f }; // with empty prompt 31 32 float w = 7.5f; // typical strong guidance 33 34 std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w); 35 36 print_vec(eps_cond, "eps_cond"); 37 print_vec(eps_uncond, "eps_uncond"); 38 print_vec(eps_guided, "eps_guided (CFG)"); 39 40 return 0; 41 } 42
This program demonstrates the core arithmetic of Classifier-Free Guidance in epsilon space. It linearly combines conditional and unconditional noise predictions using the guidance scale w. In real systems, eps_cond and eps_uncond are produced by the same neural network queried with and without the conditioning input.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <random> 5 #include <iomanip> 6 7 // A tiny toy "model" that deterministically produces an epsilon prediction 8 // from x_t, a timestep t, and a condition id (0=uncond, 1=cond). In practice, 9 // this would be a deep network. 10 struct ToyModel { 11 // Deterministic pseudo-noise based on x_t, t, and cond_id 12 std::vector<float> predict_epsilon(const std::vector<float>& x_t, int t, int cond_id) const { 13 std::vector<float> eps(x_t.size()); 14 float a = 0.12f + 0.01f * cond_id; // slight shift with condition 15 float b = 0.007f * (t + 1); 16 float c = 0.21f * cond_id; 17 for (size_t i = 0; i < x_t.size(); ++i) { 18 float xi = x_t[i]; 19 eps[i] = std::sin(a * xi + b) + 0.1f * std::cos(0.3f * (i + 1) + c); 20 } 21 return eps; 22 } 23 }; 24 25 // Guidance scale schedule: start high, end lower (linear decrease) 26 float guidance_scale(int step_k, int T, float w_max = 7.5f, float w_min = 2.0f) { 27 if (T <= 1) return w_max; 28 float ratio = static_cast<float>(step_k) / static_cast<float>(T - 1); // 0..1 29 return w_max - (w_max - w_min) * ratio; 30 } 31 32 // Compute x0 from x_t and eps in epsilon parameterization 33 // x0 = (x_t - sqrt(1 - abar_t) * eps) / sqrt(abar_t) 34 void compute_x0(const std::vector<float>& x_t, const std::vector<float>& eps, 35 float sqrt_abar_t, std::vector<float>& x0_out) { 36 float sqrt_one_minus = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_t * sqrt_abar_t)); 37 size_t n = x_t.size(); 38 x0_out.resize(n); 39 for (size_t i = 0; i < n; ++i) { 40 x0_out[i] = (x_t[i] - sqrt_one_minus * eps[i]) / sqrt_abar_t; 41 } 42 } 43 44 // DDIM eta=0 deterministic update: 45 // x_{t-1} = sqrt(abar_{t-1}) * x0 + sqrt(1 - abar_{t-1}) * eps_guided 46 void ddim_update(std::vector<float>& x_t, const std::vector<float>& x0, 47 const std::vector<float>& eps_guided, float sqrt_abar_prev) { 48 float sqrt_one_minus_prev = std::sqrt(std::max(0.0f, 1.0f - sqrt_abar_prev * sqrt_abar_prev)); 49 size_t n = x_t.size(); 50 for (size_t i = 0; i < n; ++i) { 51 x_t[i] = sqrt_abar_prev * x0[i] + sqrt_one_minus_prev * eps_guided[i]; 52 } 53 } 54 55 // CFG combination in epsilon space 56 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond, 57 const std::vector<float>& eps_uncond, 58 float w) { 59 size_t n = eps_cond.size(); 60 std::vector<float> out(n); 61 for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i]; 62 return out; 63 } 64 65 // Create a simple linear beta schedule and precompute sqrt(abar_t) 66 std::vector<float> make_sqrt_abar_schedule(int T, float beta_start = 0.0001f, float beta_end = 0.02f) { 67 std::vector<float> betas(T); 68 for (int t = 0; t < T; ++t) { 69 betas[t] = beta_start + (beta_end - beta_start) * (static_cast<float>(t) / std::max(1, T - 1)); 70 } 71 std::vector<float> abar(T); 72 float prod = 1.0f; 73 for (int t = 0; t < T; ++t) { 74 prod *= (1.0f - betas[t]); 75 abar[t] = prod; 76 } 77 std::vector<float> sqrt_abar(T); 78 for (int t = 0; t < T; ++t) sqrt_abar[t] = std::sqrt(std::max(0.0f, abar[t])); 79 return sqrt_abar; 80 } 81 82 int main() { 83 const int T = 20; // number of sampling steps (small for demo) 84 const size_t dim = 8; // toy latent dimension 85 ToyModel model; 86 87 // Initialize x_T ~ N(0, I) 88 std::mt19937 rng(42); 89 std::normal_distribution<float> nd(0.0f, 1.0f); 90 std::vector<float> x_t(dim); 91 for (size_t i = 0; i < dim; ++i) x_t[i] = nd(rng); 92 93 // Precompute sqrt(abar_t) 94 std::vector<float> sqrt_abar = make_sqrt_abar_schedule(T); 95 96 auto l2 = [](const std::vector<float>& v){ double s=0; for(float x: v) s += x*x; return std::sqrt(s); }; 97 98 std::cout << "Initial ||x_T||2 = " << l2(x_t) << "\n"; 99 100 // DDIM with CFG 101 std::vector<float> x0(dim); 102 for (int step = 0; step < T; ++step) { 103 int t = T - 1 - step; // current timestep index (descending) 104 105 // Predict eps with and without condition (cond_id: 1=cond, 0=uncond) 106 std::vector<float> eps_cond = model.predict_epsilon(x_t, t, /*cond_id=*/1); 107 std::vector<float> eps_uncond = model.predict_epsilon(x_t, t, /*cond_id=*/0); 108 109 // Guidance scale schedule 110 float w = guidance_scale(step, T, /*w_max=*/7.5f, /*w_min=*/2.0f); 111 112 // Combine via CFG 113 std::vector<float> eps_guided = cfg_epsilon(eps_cond, eps_uncond, w); 114 115 // Reconstruct x0 from guided epsilon 116 compute_x0(x_t, eps_guided, /*sqrt_abar_t=*/sqrt_abar[t], x0); 117 118 // Move to x_{t-1} 119 float sqrt_abar_prev = (t > 0) ? sqrt_abar[t - 1] : 1.0f; // at t=0, we stop after computing x0 120 ddim_update(x_t, x0, eps_guided, sqrt_abar_prev); 121 } 122 123 std::cout << "Final ||x_0||2 = " << l2(x_t) << "\n"; 124 std::cout << "(Toy demo complete; in real use, decode latent x_0 to output)\n"; 125 return 0; 126 } 127
This end-to-end toy sampler demonstrates DDIM with CFG. We generate an initial Gaussian latent, run T denoising steps, and at each step evaluate the toy model twice (with and without condition), combine the predictions using CFG, reconstruct x0, and update x_{t-1}. The guidance scale decreases linearly to reduce oversaturation late in sampling.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <iomanip> 5 6 // Convert between v and epsilon parameterizations 7 // alpha = sqrt(abar), sigma = sqrt(1 - abar) 8 void v_to_epsilon(const std::vector<float>& v, const std::vector<float>& x_t, 9 float alpha, float sigma, std::vector<float>& eps_out) { 10 size_t n = v.size(); 11 eps_out.resize(n); 12 for (size_t i = 0; i < n; ++i) eps_out[i] = alpha * v[i] + sigma * x_t[i]; 13 } 14 15 void epsilon_to_v(const std::vector<float>& eps, const std::vector<float>& x_t, 16 float alpha, float sigma, std::vector<float>& v_out) { 17 size_t n = eps.size(); 18 v_out.resize(n); 19 for (size_t i = 0; i < n; ++i) v_out[i] = (eps[i] - sigma * x_t[i]) / alpha; 20 } 21 22 // CFG in epsilon space 23 std::vector<float> cfg_epsilon(const std::vector<float>& eps_cond, 24 const std::vector<float>& eps_uncond, 25 float w) { 26 size_t n = eps_cond.size(); 27 std::vector<float> out(n); 28 for (size_t i = 0; i < n; ++i) out[i] = (1.0f + w) * eps_cond[i] - w * eps_uncond[i]; 29 return out; 30 } 31 32 void print_vec(const std::vector<float>& v, const std::string& name) { 33 std::cout << name << ": ["; 34 for (size_t i = 0; i < v.size(); ++i) { 35 std::cout << std::fixed << std::setprecision(3) << v[i]; 36 if (i + 1 < v.size()) std::cout << ", "; 37 } 38 std::cout << "]\n"; 39 } 40 41 int main() { 42 // Example v-pred outputs (e.g., from a v-parameterized model) 43 std::vector<float> v_cond = {0.05f, -0.10f, 0.02f}; 44 std::vector<float> v_uncond = {0.01f, -0.02f, 0.00f}; 45 46 // Current noisy latent x_t and schedule values 47 std::vector<float> x_t = {0.7f, -0.3f, 0.1f}; 48 float abar = 0.8f; // example cumulative alpha 49 float alpha = std::sqrt(abar); 50 float sigma = std::sqrt(1.0f - abar); 51 52 // Convert both to epsilon, apply CFG, then convert back to v if sampler expects v 53 std::vector<float> eps_c, eps_u, eps_guided, v_guided; 54 v_to_epsilon(v_cond, x_t, alpha, sigma, eps_c); 55 v_to_epsilon(v_uncond, x_t, alpha, sigma, eps_u); 56 57 float w = 6.0f; 58 eps_guided = cfg_epsilon(eps_c, eps_u, w); 59 60 epsilon_to_v(eps_guided, x_t, alpha, sigma, v_guided); 61 62 print_vec(v_cond, "v_cond"); 63 print_vec(v_uncond, "v_uncond"); 64 print_vec(v_guided, "v_guided (after CFG via epsilon)"); 65 66 return 0; 67 } 68
Many modern diffusion models predict v instead of epsilon. This example shows how to convert v to epsilon, apply CFG in epsilon space (recommended), and convert the guided epsilon back to v for samplers that expect v. This avoids parameterization mismatch errors.