Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Yuanyang Yin; Yufan Deng; Shenghai Yuan; Kaipeng Zhang; Xiao Yang; Feng Zhao

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Intermediate

Yuanyang Yin, Yufan Deng, Shenghai Yuan et al.1/12/2026

arXiv PDF

Key Summary

•Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.
•This paper finds the root cause: some middle layers in the model barely listen to the text, called Semantic-Weak Layers.
•A second cause is Condition Isolation, where text, image, and reference-frame signals are injected separately and don’t line up well.
•The authors introduce Focal Guidance, a simple add-on that helps those weak layers pay attention to the right words and regions.
•Focal Guidance has two parts: Fine-grained Semantic Guidance (ties keywords to exact spots in the image) and Attention Cache (borrows good attention patterns from strong layers).
•On a new benchmark that checks instruction-following, Focal Guidance boosts Wan2.1-I2V by +3.97% and HunyuanVideo-I2V by +7.44%.
•It improves actions like motions, interactions, and changing attributes, while keeping subject and background consistent.
•It works across different architectures and needs only light post-training or even none in some cases.
•Traditional metrics missed these gains, so the paper also provides a better evaluation focused on following instructions.
•Overall, the method makes videos that both look like the starting image and actually do what the text says.

Why This Research Matters

Better controllability means videos that not only look nice but actually do what you asked, which is crucial for creative work, education, and storytelling. It reduces trial-and-error for artists and editors, saving time and resources. In settings like training videos or science explainers, following instructions exactly can make content clearer and more trustworthy. For accessibility, precise word-to-pixel grounding can help systems respond more faithfully to spoken or written commands. It also encourages fairer evaluations by measuring what truly matters—did the video follow the prompt—rather than only how pretty it looks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a flipbook turns drawings into a moving story when you flip the pages? Now imagine you start with a real photo on the first page and a short script telling what should happen next.

🥬 Filling (The Actual Concept)

What it is: Image-to-Video (I2V) models turn a starting picture and a text instruction into a short video.
How it works: Step by step, the model takes random-looking frames and steadily cleans them up while listening to two helpers: the image (to keep looks right) and the text (to make the action right).
Why it matters: Without balancing both, you either get a video that looks like the image but ignores the instruction, or follows the words but changes the person or style.

🍞 Bottom Bread (Anchor) Example: You give a photo of a girl on a bike and say, “The girl takes off her purple helmet.” A good I2V model should keep the same girl and bike, and show her removing the helmet.

—

🍞 Top Bread (Hook) Imagine drawing a picture by first scribbling fuzzy shapes, then sharpening details over time until it looks real.

🥬 Filling (The Actual Concept)

What it is: Diffusion models are AI artists that start from noisy images and gradually denoise them into clear pictures and videos.
How it works:
1. Start with noise that looks like TV static.
2. Use a learned guide to remove noise step by step.
3. At each step, listen to conditions like text and a reference image.
4. End with a clean, meaningful frame.
Why it matters: This slow-and-steady clean-up process makes high-quality, detailed results possible.

🍞 Bottom Bread (Anchor) When you ask the model for “a red ball rolling left,” diffusion steps help keep the ball round and the motion smooth.

—

🍞 Top Bread (Hook) Think of a movie director telling actors what to do with words, but also handing them a photo to match their costumes.

🥬 Filling (The Actual Concept)

What it is: Text-to-Video (T2V) turns words into videos; Image-to-Video (I2V) adds a starting picture so the main subject stays the same.
How it works:
1. The text describes actions and changes.
2. The first image locks in the subject’s appearance and style.
3. The model combines both while denoising.
Why it matters: You want both faithfulness to the look (from the image) and obedience to the instruction (from the text).

🍞 Bottom Bread (Anchor) Prompt: “The man with a skull face picks up the guitar” plus a photo of that man. The video should show that exact man lifting the guitar.

—

🍞 Top Bread (Hook) You know how group projects go wrong if teammates never really talk to each other?

🥬 Filling (The Actual Concept)

What it is: Condition Isolation means the three helpers—the reference frame, image features, and text—are added into the model in their own separate ways and don’t line up closely.
How it works:
1. The reference frame gives high-detail looks.
2. The image encoder gives mid-level visual clues.
3. The text encoder gives low-frequency meanings.
4. They’re mixed only later by general attention, without fine matching early on.
Why it matters: Without tight, early alignment, the model may fail to connect words like “helmet” to the right pixels of the helmet in the picture.

🍞 Bottom Bread (Anchor) If you say “pick up the spoon,” but the model never firmly links the word “spoon” to the spoon region in the first image, it might pick up a cup instead.

—

🍞 Top Bread (Hook) Imagine in a school play, most actors follow the script, but a few in the middle scenes don’t listen well and ad-lib.

🥬 Filling (The Actual Concept)

What it is: Semantic-Weak Layers are middle parts of the model that don’t respond strongly to the text meaning.
How it works:
1. Early layers and final layers pay good attention to keywords.
2. Some middle layers drift and rely on generic visual habits (priors).
3. Text–visual similarity drops, so guidance from the prompt fades.
Why it matters: If these layers ignore the text, the final video may look right but do the wrong action.

🍞 Bottom Bread (Anchor) With “the little girl takes off the purple helmet,” weak layers may keep her biking but never remove the helmet.

—

🍞 Top Bread (Hook) Have you ever graded a project with the wrong rubric? You might miss what really matters.

🥬 Filling (The Actual Concept)

What it is: Traditional I2V metrics focus on looks (subject/background consistency, aesthetics) and under-measure instruction-following.
How it works:
1. They check if the person and background match the first image.
2. They score smoothness and quality.
3. They don’t fully test if the action or change matched the text.
Why it matters: Without the right test, progress on “doing what was asked” stays hidden.

🍞 Bottom Bread (Anchor) Two videos may look equally pretty, but only one actually shows “picking up the spoon.” Old metrics would call them a tie; a better benchmark would not.

02Core Idea

🍞 Top Bread (Hook) You know how a coach gives special tips to the teammates who are struggling, so the whole team improves?

🥬 Filling (The Actual Concept)

What it is: Focal Guidance is a lightweight add-on that helps the model’s weak layers focus on the right words and regions, restoring strong instruction-following.
How it works:
1. Find where text understanding is weak (Semantic-Weak Layers).
2. Fine-grained Semantic Guidance (FSG) links keywords to exact spots in the reference image.
3. Attention Cache (AC) copies good attention patterns from strong layers to weak ones.
4. Keep denoising while constantly reinforcing the right text–visual matches.
Why it matters: Without this help, those weak layers keep drifting, and the final video ignores parts of the prompt.

🍞 Bottom Bread (Anchor) Prompt: “The woman slowly picks up the spoon beside the coffee.” With Focal Guidance, the model locks onto the actual spoon area and follows the action as written.

—

Three analogies for the same idea:

Coach analogy: The coach (FG) pairs a player’s to-do list (text) with field positions (image regions) and shows clips of perfect plays (attention patterns) so struggling players (weak layers) improve.
GPS analogy: FG pins the destination (keywords) to exact map spots (visual anchors) and shares successful routes (attention cache) with drivers who got lost (weak layers).
Highlighter analogy: FG highlights key words and then highlights the matching parts of the picture, and lets quiet readers (weak layers) peek at top students’ notes (attention from strong layers).

Before vs. After:

Before: Models often favored looks over instructions. Middle layers went off-script. Some actions didn’t happen or were wrong.
After: Layers agree on who/what/where. Motions and interactions match the words. The subject and background stay consistent.

Why it works (intuition):

Text alone is too blurry; pixels alone are too detailed. FG builds a bridge at just the right places: it plants “visual anchors” for important words and reuses proven attention maps to prevent drift.
This reduces guesswork in the middle of denoising, so the model stays aligned with the prompt.

Building blocks (with Sandwich explanations):

🍞 Top Bread (Hook) Imagine label stickers that match words to exact spots in a picture.

🥬 Filling (The Actual Concept)

What it is: Fine-grained Semantic Guidance (FSG) binds each important word (like “spoon”) to its matching region in the reference image.
How it works:
1. Use a vision–language encoder to score how much each image patch matches each keyword.
2. Pick top words and compute a visual anchor (a compact summary) for each word.
3. Nudge the model’s internal features so these anchors sit right where they belong.
Why it matters: Without FSG, words float around without a home in the image, and the model can choose the wrong object.

🍞 Bottom Bread (Anchor) For “pick up the guitar,” FSG points the word “guitar” directly to the guitar-shaped area in the first frame.

🍞 Top Bread (Hook) Think of borrowing a friend’s perfect outline when your own drawing gets messy in the middle.

🥬 Filling (The Actual Concept)

What it is: Attention Cache (AC) saves attention patterns from strong, text-responsive layers and replays them to weaker ones.
How it works:
1. Record which pixels the strong layers focused on for each keyword.
2. Combine these maps into a cache.
3. Feed this cache into weak layers so they aim at the same correct regions.
Why it matters: Without AC, weak layers fall back on habits and ignore the text’s specific instructions.

🍞 Bottom Bread (Anchor) If the strong layers already showed “helmet” focus on the head, AC helps the weak layers keep that focus so “take off the helmet” really happens.

03Methodology

High-level recipe: Reference Image + Text Prompt → Find Weak Layers → FSG (word-to-pixel anchors) → AC (share strong attention) → Denoised Video that looks right and acts right.

Step-by-step (what, why, example):

Input and setup

What happens: You provide a reference frame (the first image) and a text prompt. The model starts denoising from noise toward a clean video.
Why this step exists: The reference frame preserves who/what; the text tells what to do.
Example: Image of a woman at a table with coffee; text: “The woman slowly picks up the spoon beside the coffee.”

Detect Semantic-Weak Layers

What happens: The system checks which layers respond poorly to the text (these are often in the middle of the network). It looks for low text–visual similarity and unstable focus patterns.
Why this step exists: You need to know where guidance collapses to insert help exactly there.
Example: Layers 11–26 (for one model) show weak alignment with “spoon.”

Fine-grained Semantic Guidance (FSG)

What happens (detailed): a) Keyword selection: Split the prompt into tokens; choose important words like “spoon,” “picks up,” “coffee.” b) Word-to-region matching: Use a vision–language encoder (like CLIP) to measure how similar each image patch is to each keyword. c) Make visual anchors: For each chosen word, blend the best-matching patches into a compact anchor that represents that object/region. d) Gentle injections: Slightly enrich the model’s internal features so each keyword carries its visual anchor; also nudge the matching pixels so the anchor sits where it belongs.
Why this step exists: It fixes Condition Isolation by giving each word a clear home in the picture.
Example with data: For “spoon,” FSG locks onto the thin metallic shape near the cup. The “spoon” word now drags the model’s focus to that exact spot in every denoising step.

Attention Cache (AC)

What happens (detailed): a) Attention recording: In layers that already track the text well (early and late layers), record where they focus for each keyword. b) Cache building: Combine these focus maps into a weighted cache (think of a heatmap per keyword). c) Guided replay: Feed the cache into the weak layers so they attend to the same right places.
Why this step exists: It prevents mid-layers from drifting back to visual habits and forgetting the instruction.
Example with data: A strong layer’s map for “spoon” glows brightest on the spoon pixels. AC reuses this map in a weak layer so it also boosts those spoon pixels.

Ongoing denoising with guidance

What happens: As frames get cleaner, FSG keeps words tied to the correct regions, and AC keeps weak layers on track.
Why this step exists: Guidance must persist across steps; otherwise, the model can fix focus one moment and lose it the next.
Example: Over a few steps, the spoon lifts smoothly; the woman’s face, clothes, and table stay consistent with the reference image.

Output

What happens: The model finishes denoising and outputs the video.
Why this step exists: You now have a clip that both looks like the starting photo and does what you asked.
Example: A short video shows the woman’s hand grasping and raising the real spoon next to the coffee.

The secret sauce:

Precise word-to-pixel anchoring (FSG) cuts confusion. It’s like placing name tags on objects in the photo.
Sharing proven focus (AC) rescues weak layers. It’s like letting mid-scene actors peek at the best rehearsal notes.
Both are lightweight: they don’t overhaul the whole model, they just strengthen the exact places that need help.

What breaks without each step:

Without detecting weak layers: You don’t know where to help; guidance gets spread too thin.
Without FSG: Words aren’t grounded; the model might choose the wrong object.
Without AC: Middle layers drift; text obedience fades mid-denoising.
Without persistence across steps: Early fixes vanish; later frames go off-script.

Sandwich reminders for key pieces:

FSG 🍞 Hook: Like sticking labels on objects in a photo. 🥬 Concept: Bind keywords to their exact regions with visual anchors and gentle injections. 🍞 Anchor: “Guitar” sticks to the guitar area; the hand moves to that spot.
AC 🍞 Hook: Borrow a perfect outline when your drawing gets messy. 🥬 Concept: Reuse strong-layer attention maps to guide weak layers. 🍞 Anchor: “Helmet” focus stays on the head, so “take off the helmet” is performed.

04Experiments & Results

The test (what and why):

What: The authors built a benchmark to grade instruction-following in I2V, across three everyday skills: Dynamic Attributes (things that change color/state), Human Motion (actions), and Human Interaction (two or more people doing something together).
Why: Old scores were like judging only costumes and scenery. This benchmark checks if the actors actually perform the lines.

The competition (who vs. who):

Baselines included top open-source I2V systems such as Wan2.1-I2V (CrossDiT-based), HunyuanVideo-I2V (MMDiT-based), SkyReels-V2, LTX-Video, Open-Sora, and CogVideoX.
Focal Guidance (FG) was tested as an add-on to Wan2.1-I2V and HunyuanVideo-I2V.

The scoreboard (with context):

FG + Wan2.1-I2V: Total Score rose from 0.6973 to 0.7250 (+3.97%). That’s like raising a solid B to a B+ by finally answering the main question correctly.
FG + HunyuanVideo-I2V: Total Score went from 0.5185 to 0.5571 (+7.44%). That’s like jumping from a C to a strong C+, closing a big gap.
Breaking it down (Wan2.1-I2V with FG): • Dynamic Attributes: +9.91% — better at changes like colors or states over time. • Human Motion: +8.38% — actions match the prompt more reliably. • Human Interaction: +9.02% — two-person activities are more faithful to instructions.
Subject/Background Consistency stayed essentially unchanged (tiny fluctuations around tenths of a percent), meaning the method kept the look while fixing the action.

Traditional metrics missed the gains:

On usual scores like aesthetics, smoothness, and simple consistency, improvements looked small or invisible. This confirms we needed a purpose-built benchmark that asks: “Did the video actually do what you asked?”

Surprising findings:

FG helps even with zero or very light fine-tuning, showing strong training efficiency.
Post-training alone gave only modest improvements; combining it with FG was best — explicit guidance (FG) plus a little learning (post-training) worked hand-in-hand.
The approach generalized across different architectures (CrossDiT and MMDiT), suggesting the weak-layer problem is common and FG’s fix is broadly useful.

Concrete examples (what changed in practice):

“The woman picks up the spoon beside the coffee”: Without FG, models sometimes grabbed the wrong object or froze. With FG, the hand found the actual spoon and moved it.
“The little girl on the bike takes off the purple helmet”: FG helped the model focus on the helmet region, so the removal action actually happened, not just continued biking.
“Leaves gradually change from green to red”: FG maintained attention on the leaves, so the color shift was clear and progressive, not random or ignored.

05Discussion & Limitations

Limitations (be specific):

If the base model’s vision–language understanding is weak, FSG’s word-to-region matching can be less accurate, limiting gains.
AC depends on having some strong layers to copy from; if almost all layers are weak, the cache has little to offer.
Very ambiguous prompts (vague wording or multiple similar objects) still pose challenges for precise grounding.
Extremely fast or complex motions may require additional motion-specific control signals.

Required resources:

A vision–language encoder (such as CLIP-like features) to match words to image regions.
Light post-training can help but isn’t mandatory; the method is designed to be lightweight.
Usual I2V inference resources (GPU memory/time comparable to the base model, with a small overhead for guidance and caching).

When NOT to use:

If you only care about style or background looks and don’t need precise instruction-following, FG may be unnecessary overhead.
If your prompts never refer to specific objects or actions (purely atmospheric prompts), word-to-region anchoring offers little benefit.
If the reference image is misleading (e.g., the required object truly isn’t there), stronger editing tools may be needed instead of FG.

Open questions:

Can the model learn to avoid Semantic-Weak Layers during pretraining, removing the need for guidance later?
How can we make FSG even more robust when objects are tiny, occluded, or heavily stylized?
Can AC be extended across time, reusing good attention patterns from earlier frames for later ones more intelligently?
What is the best automatic way to choose keywords and thresholds so that no manual tuning is needed across diverse prompts?

06Conclusion & Future Work

Three-sentence summary:

Some middle layers in I2V models don’t listen well to the text, causing videos that look right but do the wrong thing.
Focal Guidance fixes this by pinning keywords to exact image regions (FSG) and borrowing proven attention from strong layers (AC).
On a new instruction-following benchmark, it boosts controllability across major I2V models while keeping subjects and backgrounds faithful.

Main achievement:

Turning a diagnosis (Condition Isolation and Semantic-Weak Layers) into a practical, lightweight remedy that measurably improves instruction-following.

Future directions:

Bake this guidance into pretraining so fewer layers become weak; make keyword selection and thresholds fully adaptive; extend caching across time and modalities (e.g., audio or depth).

Why remember this:

It shows that precise word-to-pixel grounding plus sharing successful focus patterns can transform how well video models follow instructions — a small, smart nudge in the right places makes a big difference.

Practical Applications

•Video prototyping for ads: Ensure the on-screen actions (e.g., “raise the product cap”) match the script exactly.
•Educational animations: Make molecules change color or shape on command for clearer science lessons.
•Storyboarding and previz: Keep character identity from a sketch while precisely acting out new directions.
•How-to content: Show hands performing the exact steps (e.g., “tighten the blue screw”) from a reference image.
•Game asset iteration: Animate static concept art to test motions that follow text notes from designers.
•Accessibility tools: Generate faithful visualizations from spoken commands that reference objects in a picture.
•Product demos: Maintain the real product look while carrying out precise scripted interactions.
•Interactive art: Audience text prompts trigger accurate motions of elements seen in a starting poster or frame.
•Scientific visualization: Track and modify specific regions (like leaves or cells) according to textual hypotheses.
•Compliance and QA: Automatically check whether generated clips actually follow the instruction script.

Version: 1