CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang; Yiyang Liu; Jiawei Chi; Fangfu Liu; Ran Xue; Yueqi Duan

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Intermediate

Hanyang Wang, Yiyang Liu, Jiawei Chi et al.3/3/2026

arXiv

Key Summary

•This paper turns a popular image-guidance trick (Classifier-Free Guidance) into a feedback-control problem, just like keeping a car steady in its lane.
•It shows that standard CFG is basically a simple "turn the wheel in proportion to the error" controller (P-control), which can wobble or overshoot at high guidance scales.
•The authors design SMC-CFG, a Sliding Mode Control version that pulls the model’s trajectory onto a safe lane (a sliding surface) and keeps it there.
•They use an error signal between conditional and unconditional predictions and add a switching correction that quickly reduces this error.
•A Lyapunov-style energy analysis shows the method converges in finite time, meaning the error must reach near-zero instead of bouncing around.
•Across Stable Diffusion 3.5, Flux, and Qwen-Image, SMC-CFG gives better text-image matching, cleaner details, and fewer artifacts, especially when guidance is strong.
•It stays robust over a wide range of guidance scales, letting users turn guidance up without wrecking image quality.
•The method adds almost no extra compute cost and keeps inference speed nearly the same as standard CFG.
•It also transfers to text-to-video, improving temporal stability and semantic consistency in motion.
•Overall, the work unifies many CFG tricks under one control-theory view and then upgrades them with a robust, nonlinear controller.

Why This Research Matters

Better guidance means you can ask for precise things—like exact positions, colors, or readable text—and actually get them without ugly artifacts. Artists and designers save time by turning guidance up without wrecking the look, which speeds iteration and improves quality. Developers gain a principled, plug-in controller that works across popular models with almost no extra compute. For video and 3D, steadier guidance also means more consistent objects over time and space, reducing flicker and drift. In domains like education, accessibility, and prototyping, more faithful images make communication clearer. The control-theory view also unifies many existing tricks, helping future research build stronger, safer generative tools.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine steering a bike on a windy day. Gentle nudges keep you straight, but if the wind gets strong and you turn the handle too much, you can wobble or even fall. AI image makers face a similar problem when they try to strongly follow a text prompt.

🥬 Filling (The Actual Concept)

What it is: Modern image generators (diffusion and flow-matching models) are systems that slowly transform random noise into a picture that matches a caption.
How it works (like a recipe):
1. Start with random noise (like TV static).
2. A learned “velocity field” tells the system which tiny step to take so the picture gets a bit clearer and more on-topic.
3. Repeat many small steps until the noise becomes a detailed image.
Why it matters: Without careful steering, the model can drift off-topic, become too colorful, bend shapes, or ignore important words in the prompt.

🍞 Bottom Bread (Anchor): Think of sculpting from a block of marble: each careful chip reveals the statue inside. The velocity field is the tool guiding each chip so the final statue matches your idea.

🍞 Top Bread (Hook): You know how you check your homework by comparing your answer to the teacher’s hint? That comparison tells you what to fix.

🥬 Filling (The Actual Concept)

What it is: Semantic alignment is making sure the final image truly matches the meaning of the text.
How it works:
1. Read the prompt (e.g., “a red bus on the left of a tree”).
2. Build an image where color, objects, and positions match the text.
3. Keep checking and adjusting until the details line up.
Why it matters: If alignment is weak, you might get the right bus but the wrong color or the bus appears on the wrong side.

🍞 Bottom Bread (Anchor): Asking for “two red roses and three white lilies” should give exactly that—not pink roses, not lilies mixed up, and not the wrong numbers.

🍞 Top Bread (Hook): Picture talking to two friends: one gives a generic answer, the other tailors it to your question. Comparing them shows what your question really adds.

🥬 Filling (The Actual Concept)

What it is: Classifier-Free Guidance (CFG) mixes two model predictions: one that ignores the text (unconditional) and one that uses it (conditional).
How it works:
1. Ask the model with the prompt (conditional) and without it (unconditional).
2. Subtract to get the “what the text adds” part (an error signal).
3. Add some multiple of that back into the update to steer the image.
Why it matters: Without CFG, the model may miss important words. With too much, it can overshoot and look unnatural.

🍞 Bottom Bread (Anchor): It’s like tasting soup, then adding salt based on the difference between “plain” and “spiced.” Add a little: tasty; add too much: yikes.

🍞 Top Bread (Hook): If you slam a door too hard, it bounces back—overcorrecting can make things unstable.

🥬 Filling (The Actual Concept)

What it is: Guidance scale is the knob that decides how strongly the text steers the image.
How it works:
1. Small scale: gentle nudge toward the prompt.
2. Big scale: strong push that can overshoot.
Why it matters: Too strong often causes wobbles—blown-out colors, warped shapes, and artifacts.

🍞 Bottom Bread (Anchor): Turning a radio’s volume a little can be nice; turning it to max can distort the sound.

🍞 Top Bread (Hook): Ever kept your balance on a skateboard by watching how far and how fast you’re tipping, then moving just enough? That’s control.

🥬 Filling (The Actual Concept)

What it is: Control theory studies how to use feedback (what’s wrong now) to steer systems to where we want.
How it works:
1. Measure the error (how far from the goal).
2. Decide how much to correct.
3. Apply correction; repeat until steady.
Why it matters: Without feedback, you can’t stay balanced when things change.

🍞 Bottom Bread (Anchor): Cruise control in a car measures speed (error from target speed) and adjusts the gas to hold steady.

The world before this paper: CFG was widely used to align images with text in diffusion and newer flow-matching models. People typically treated CFG as a simple linear push—like pulling a ruler straight between two points. That worked okay at modest settings, but at higher guidance scales many models became unstable. Colors clipped, structures bent, and prompts with positions or text often failed.

The problem: As models grew stronger and prompts got more complex, the simple linear rule (amplify the conditional-unconditional difference) couldn’t guarantee stability. It didn’t consider how fast the error was changing, so big pushes caused overshooting and oscillations.

Failed attempts: Researchers tried making the push change over time (weight schedulers), projecting guidance to avoid oversaturation (APG), or adding predictor-corrector tricks (Rectified-CFG++). These helped but were still mostly linear fixes. Under strong guidance or tricky prompts, wobbling remained.

The gap: The field lacked a unified way to see CFG as a feedback-control system with stability guarantees—especially one that could survive strong guidance and nonlinear model behavior.

Real stakes: Better guidance makes images match prompts more reliably, saves trial-and-error time for artists and developers, and unlocks harder tasks (accurate spatial layouts, readable text in images, consistent characters across frames in video). In everyday terms, it’s the difference between an assistant that vaguely follows instructions versus one that nails the details without breaking the look.

02Core Idea

🍞 Top Bread (Hook): You know how a GPS doesn’t just tell you the destination—it keeps checking where you actually are, then nudges you back on course if you drift?

🥬 Filling (The Actual Concept)

What it is: The key idea is to treat CFG as feedback control over an error signal and then use Sliding Mode Control (SMC) to snap the system onto a safe, fast path where the error must shrink.
How it works:
1. Compute the semantic error e(t): the difference between conditional and unconditional predictions.
2. Build a sliding surface s(t) that combines “how big the error is” and “how fast it’s changing.”
3. Add a switching correction that pushes the system toward s(t) = 0 and keeps it there.
4. Prove the energy goes down until the error is tiny (finite-time convergence).
Why it matters: Simple linear pushes wobble under strong guidance. Sliding-mode feedback is designed for nonlinear, wiggly systems—it’s robust and converges faster.

🍞 Bottom Bread (Anchor): It’s like putting bumpers in a bowling lane and gently bouncing the ball so it travels straight down the middle to the pins.

Multiple analogies for the same idea:

Thermostat + snap correction: A normal thermostat turns up heat in proportion to how cold it is. SMC adds a quick “snap” when you drift off the desired warm-up path, so the room settles faster without overshooting.
Coach on the sidelines: The coach (controller) watches both the score gap (error) and how fast it’s changing, then calls plays that force the team back to a winning strategy lane (sliding surface).
Biking with side-rails: Gentle steering keeps you centered, and if a gust pushes you off, low-friction rails nudge you back without wobble.

Before vs After:

Before: CFG = proportional control (P-control). Increase the gain to push harder toward the prompt, but risk overshooting and artifacts.
After: SMC-CFG adds a sliding rule and switching term to force a quick, stable return to the desired path—strong guidance without the usual instability.

Why it works (intuition):

The sliding surface s(t) = de/dt + λ·e(t) encodes the ideal, smooth way the error should shrink (roughly an exponential decay). If s(t) = 0, you’re on the perfect decay line.
The switching term looks at which side of that line you’re on and applies a bounded, decisive push back toward it.
A Lyapunov “energy” function proves these pushes always drain energy until you land on the surface, and then along it, the error dies out quickly.

Building blocks (the idea in pieces):

🍞 Hook: Imagine dimming a flashlight beam steadily until it’s dark—smooth, no flickers. 🥬 The Concept: Proportional Control (P-control)
- What it is: Adjust in direct proportion to current error.
- How: More error → stronger correction; less error → gentler.
- Why needed: It’s the base of CFG—simple and effective at low gains. 🍞 Anchor: Cruise control adds more gas if you’re far under speed, and less if you’re close.
🍞 Hook: Two friends giving advice—one says “how you’re off,” the other says “how fast you’re changing.” You want both. 🥬 The Concept: Sliding Surface
- What it is: A target lane combining “error now” and “error change.”
- How: s(t) = de/dt + λ·e(t). Make s(t) → 0.
- Why needed: It encodes the best path to shrink the error without wobbling. 🍞 Anchor: When landing a paper airplane, you watch height and descent rate so you don’t stall or slam.
🍞 Hook: Think of a light tap that flips direction if you drift the wrong way. 🥬 The Concept: Switching Control
- What it is: A sign-based nudge that always pushes back toward the sliding surface.
- How: If s > 0, push down; if s < 0, push up; controlled by strength k.
- Why needed: It provides fast, robust correction even when dynamics are messy. 🍞 Anchor: Bumpers in bowling gently push you back no matter which side you drift.
🍞 Hook: Rolling a ball down a hill—its energy keeps dropping until it settles. 🥬 The Concept: Lyapunov Stability (finite-time convergence)
- What it is: A mathematical “energy” that must decrease over time.
- How: Show the switching control drains energy until s(t) hits zero.
- Why needed: Guarantees the method won’t get stuck wobbling forever. 🍞 Anchor: A spinning top stops wobbling as friction drains its energy.
🍞 Hook: Schedules and directions matter—like choosing both how hard to push and which way to face. 🥬 The Concept: CFG-Ctrl (unified framework)
- What it is: A recipe that splits guidance into a schedule (how strong) and a direction operator (which way).
- How: $K_t$ sets strength over time; Π_t shapes the correction direction (e.g., projections).
- Why needed: It organizes existing tricks (weight schedulers, projections, predictors) under one roof. 🍞 Anchor: It’s like choosing speed ( $K_t$ ) and steering angle (Π_t) when driving.

These pieces together turn CFG into a robust feedback controller that aligns images to text strongly without flying off the rails.

03Methodology

At a high level: Text prompt + random noise → model predicts two velocities (with and without text) → compute semantic error → build a sliding surface → apply a switching correction → form guided velocity → advance one step → repeat until the image is done.

Step-by-step recipe with what, why, and examples:

Get two predictions from the model

What happens: For each step t, ask the model for two velocity fields at the current latent $x_t$ : one conditional v(c) (uses the text) and one unconditional v(∅) (ignores the text).
Why this step exists: We need both to know exactly what the text adds. Without the unconditional piece, we can’t isolate the semantic signal.
Example: Suppose v(c) = [2, 1] and v(∅) = [1.6, 0.5] along two abstract axes of change.

Compute the semantic error e(t)

What happens: e(t) = v(c) − v(∅). This is the “pure prompt effect.”
Why: This is the signal we will shape. Without e(t), guidance is guesswork.
Example: e(t) = [2 − 1.6, 1 − 0.5] = [0.4, 0.5].

Build the sliding surface s(t)

What happens: We combine how big the error is and how fast it’s changing. In discrete steps, that looks like s( $t) ≈ (e(t$ ) − e(t+1)) + λ·e(t+1) (a step-based version of de/dt + λ·e).
Why: This surface represents the “ideal decay lane” for error. If s(t) = 0, you’re shrinking error smoothly and fast.
Example: Say last step’s error was e(t+1) = [0.5, 0.6] and λ = 6. Then s(t) = ([0.4, 0.5] − [0.5, 0.6]) + $6·[0$ .5, 0.6] = ([-0.1, -0.1]) + [3.0, 3.6] = [2.9, 3.5]. Positive s(t) says: you’re above the lane; push down.

Apply the switching control Δe

What happens: Δe = − $k·sign(s(t$ )). If a component of s is positive, subtract k; if negative, add k. This flips as needed each step.
Why: It gives a decisive, bounded push back toward the sliding surface, robust to nonlinearities.
Example: With k = 0.1 and s = [2.9, 3.5], sign(s) = $\begin{pmatrix} +1 \\ +1 \end{pmatrix}$ , so Δe = −0. $1·[1$ , 1] = [−0.1, −0.1]. New e becomes e + Δe = [0.3, 0.4].

Form the guided velocity v̂

What happens: v̂ = v(∅) + $w·e$ (after the SMC update). Here w is the guidance scale.
Why: This is the usual CFG blend—but using the corrected error for stability and alignment.
Example: With v(∅) = [1.6, 0.5], w = 5, and corrected e = [0.3, 0.4], we get v̂ = [1.6, 0.5] + $5·[0$ .3, 0.4] = $\begin{pmatrix} 1.6 + 1.5 \\ 0.5 + 2.0 \end{pmatrix}$ = [3.1, 2.5].

Advance the latent $x_t$

What happens: Take an ODE step using v̂ to update $x_t$ → $x_{t−1}$ .
Why: This is how we move the picture from noisy to clear. Without this, nothing changes.
Example: Think of $x_t$ sliding a tiny amount in the direction v̂; repeat many times to reveal the image.

Repeat until done

What happens: Loop over t from noisy start to final clean image.
Why: Each pass nudges alignment and clarity; together they produce the finished picture.

What breaks without each part:

Skip unconditional v(∅): You can’t isolate the text effect; guidance gets noisy and inconsistent.
Skip e(t): No clear error signal; you’re pushing in the dark.
Skip s(t): You won’t know if the error is shrinking properly; easier to wobble or overshoot.
Skip switching Δe: You lose robust correction; high guidance may destabilize the image.
Skip v̂ formation: No controlled way to combine the base and the correction.
Skip ODE step: The latent never changes; no image emerges.

Concrete mini walk-through:

Suppose a prompt asks for “A blue bus labeled subway shuttle.” Early on, e(t) says “lean into bus shapes and blue textures; add readable text.” If the model pushes too hard (letters warp, colors clip), s(t) spikes positive. The switching Δe trims e(t) slightly, calming the update. Over steps, letters sharpen, blue settles, and the bus label stays readable without neon blowouts.

The secret sauce:

The sliding surface s(t) encodes the best way for the error to fade. The switching Δe ensures you get pushed back to that surface quickly, even when the model’s behavior is nonlinear. The Lyapunov analysis shows this isn’t just a hope—the “energy” must drop, so the method converges in finite time.

Bonus: A unified view of other methods (CFG-Ctrl)

Guidance schedule $K_t$ (how strong): • Constant (standard CFG), or time-varying (weight schedulers that start gentle and grow stronger).
Direction operator Π_t (which way): • Identity (plain CFG), or projections (APG, CFG-Zero*) that reshape the signal to avoid oversaturation.
Predictor-corrector flavor (Rectified-CFG++): • Incorporates a peek at a nearby future state to anticipate errors (like a short-term prediction in control).

SMC-CFG fits neatly in this family but adds robust nonlinear feedback that handles strong guidance without wobble.

04Experiments & Results

🍞 Top Bread (Hook): Picture a school field day where teams try the same tasks—running, balance beam, and puzzle solving—and you compare scores across all events to pick the true winner.

🥬 Filling (The Actual Concept)

What it is: The authors tested SMC-CFG against standard CFG and recent variants on multiple image generators and measured quality, alignment, and human-preference signals.
How it works:
1. Models: Stable Diffusion 3.5, Flux, and Qwen-Image (spanning different sizes and styles).
2. Data: MS-COCO subset (5,000 text–image pairs); plus compositional benchmarks like T2I-CompBench.
3. Baselines: Standard CFG, CFG-Zero*, Rectified-CFG++.
4. Metrics: FID (image realism/diversity), CLIP Score (text-image match), and preference/aesthetic scores (Aesthetic, ImageReward, PickScore, HPSv2/2.1, MPS).
Why it matters: Numbers are meaningful when they reflect both machine similarity and human preference. This mix shows whether images look real, match text, and please people.

🍞 Bottom Bread (Anchor): It’s like grading a project on accuracy (did you answer the question), neatness (is it clean and readable), and popularity (do people like it).

The test: They measured whether SMC-CFG keeps images realistic (low FID), aligns with the prompt (high CLIP), and is preferred by humans (higher preference models). They also checked robustness across a wide range of guidance scales.

The competition: Standard CFG is the main baseline. CFG-Zero* and Rectified-CFG++ are stronger, recent designs tailored to flow-based models—tough opponents.

The scoreboard (with context):

Stable Diffusion 3.5: • Standard CFG improved alignment but could degrade visuals at high guidance. • SMC-CFG achieved lower FID and matched or slightly exceeded the best CLIP among baselines, meaning better image realism with equal or better text alignment—like getting an A on neatness and still an A on correctness when others get an A- or B+.
Flux-dev: • Across metrics, SMC-CFG consistently edged out standard CFG and competed closely with CFG-Zero* and Rectified-CFG++. • Key win: robustness as guidance scale increases. While others wobble, SMC-CFG stays steady—like a runner that doesn’t slow in the second half.
Qwen-Image: • SMC-CFG delivered the best CLIP among compared methods and improved FID relative to CFG at stronger guidance, with stronger preference scores—like pleasing both the judges and the audience.

Compositional benchmarks (T2I-CompBench):

SMC-CFG improved color, shape, texture binding, and especially spatial relations across SD3.5, Flux, and Qwen-Image. This is the hardest part for text-to-image (e.g., “the bird on the left of a clock”). Scores rose like moving from a B to a clear A- or A.

Transfer to video (Wan2.2 text-to-video):

Qualitatively smoother motion and better semantic consistency across frames.
Quantitatively improved VBench total, quality, and semantic scores—fewer flickers and steadier subjects.

Efficiency:

Memory, FLOPs, and runtime were basically unchanged from standard CFG at both $512×512$ and $1024×1024$ , meaning you get more stability without paying extra compute.

Surprising findings:

The method remains strong even at very large guidance scales, where others collapse or produce artifacts. Instead of falling apart when you turn the dial up, SMC-CFG keeps the picture steady and aligned, as promised by the sliding-mode design.
A single pair of SMC hyperparameters (λ, k) per model worked across varied prompts and datasets, suggesting a reasonably wide “stability corridor” in practice.

05Discussion & Limitations

🍞 Top Bread (Hook): Imagine a great pair of training wheels—they keep you upright on bumpy roads, but you still have to pick the right height and tightness or you’ll feel wobbly or too stiff.

🥬 Filling (The Actual Concept)

Limitations:
1. Extra knobs: SMC-CFG adds two hyperparameters (λ and k). Though stable ranges exist, some models or tasks may need light tuning.
2. Discrete steps: Sliding control’s sign-based updates can cause tiny jitters (“chattering”) if k is too big or steps are too coarse. The paper’s settings avoid this, but it’s a general risk.
3. Bounds are implicit: Theory assumes certain bounds on model drift and Jacobian deviations. These aren’t measured directly during sampling.
4. Extreme prompts: Very long, conflicting, or stylized prompts can still be tricky; SMC-CFG improves robustness but isn’t magic.
Required resources: • Similar memory, FLOPs, and runtime to standard CFG; no extra training required. You just swap in the SMC guidance at inference time.
When NOT to use: • If you already run at very low guidance scales and are happy with results, SMC’s extra knobs may not be worth it. • If your model is heavily guidance-distilled to behave well without CFG, the gains may be smaller. • Ultra-low-latency environments with extremely large steps might prefer smoother-than-switching variants to avoid numerical jitter.
Open questions:
1. Adaptive control: Can λ and k adjust automatically based on the current error or its change rate, removing manual tuning?
2. Hybrid controllers: Combine SMC with projections (APG/CFG-Zero*) or predictor-corrector (Rectified-CFG++) for even stronger stability.
3. Discrete-time theory: Provide tighter, step-size-aware convergence guarantees.
4. Beyond images: Systematic studies for video, 3D, and multimodal tasks where temporal or geometric consistency matters more.

🍞 Bottom Bread (Anchor): It’s like upgrading from a basic bike to one with better shocks and brakes—you ride more confidently, but you’ll still want a good fit and might tune the seat and tire pressure for your trail.

06Conclusion & Future Work

Three-sentence summary: This paper reframes Classifier-Free Guidance as a feedback-control problem and shows that standard CFG is just a proportional controller that can wobble at high guidance. It introduces Sliding Mode Control CFG, which adds a sliding surface and a switching correction to force rapid, stable convergence of the semantic error. The method improves text-image alignment and visual quality across strong models and remains efficient, with theory-backed stability.

Main achievement: A unified control-theory framework (CFG-Ctrl) for guidance in flow-based diffusion models, plus a robust, nonlinear SMC controller that delivers finite-time convergence and practical gains over standard CFG.

Future directions: Develop adaptive strategies that auto-tune λ and k from the evolving error; combine SMC with projection or predictive components; extend systematic evaluations to video, 3D, and complex multimodal settings; and strengthen discrete-time convergence analysis.

Why remember this: It turns a widely used heuristic (CFG) into a principled, robust controller with clear guarantees and real improvements. In everyday terms, it gives you the confidence to turn the guidance knob higher without breaking your image, and it points the way to more reliable, controllable generative systems.

Practical Applications

•Use SMC-CFG in text-to-image pipelines to get better spatial relations (e.g., ‘the bird on the left of the clock’) at higher guidance without artifacts.
•Enable readable, on-image text (posters, labels, signs) by turning guidance higher with SMC-CFG for sharper lettering.
•Apply SMC-CFG in design tools to lock in brand colors and object counts (e.g., ‘three blue mugs’) without over-saturation.
•Generate instructional diagrams that precisely match step-by-step prompts, improving clarity for education and documentation.
•Adopt SMC-CFG in text-to-video for steadier subjects and fewer flickers across frames.
•Combine SMC-CFG with projection-based methods (like APG) to further reduce color clipping while keeping strong alignment.
•Run hyperparameter sweeps to find a single (λ, k) per model, then standardize it for production for robust, low-maintenance guidance.
•Scale up guidance (w) in compositional benchmarks to improve hard cases (color/shape/texture/spatial) without losing realism.
•Integrate SMC-CFG into 3D or multi-view generation loops to better preserve object identity and placement across views.
•Use SMC-CFG for safer retries: if a prompt is tricky, increase guidance with stability instead of risking artifacts.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes