CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong; Mingkun Chang; Shenglong Zhang; Yuran Wang; Cheng Liang; Zhizheng Zhao; Ruichuan An; Bohan Zeng; Yang Shi; Yifan Dai; Ziming Zhao; Guanbin Li; Pengfei Wan; Yuanxing Zhang; Wentao Zhang

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Intermediate

Chengzhuo Tong, Mingkun Chang, Shenglong Zhang et al.1/15/2026

arXiv PDF

Key Summary

•This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
•Instead of jumping straight to a finished image, the model creates three frames: a rough draft, a better version, and a final polished picture.
•Each frame is like one reasoning step, so we can see and train the model to fix mistakes and add details gradually.
•They built a special dataset (CoF-Evol-Instruct) with 64,000 three-step examples that go from meaning-first fixes to beauty-and-detail polish.
•They also encode each frame independently so no fake motion or flicker leaks between steps, keeping each frame crisp.
•The method uses a flow-based training recipe so the model can smoothly travel from noise to good images.
•On GenEval it scores 0.86, beating many text-planning systems, and on Imagine-Bench it hits 7.468, a big jump over its base video model.
•Ablations show that learning from the whole three-step chain (not just the final picture) is important for the best results.
•This proves video models can be pure visual reasoners for text-to-image, improving accuracy and image quality without extra language planning.
•The idea opens a path to more controllable, reliable image generation that fixes meaning first and style second.

Why This Research Matters

This work shows that AI can make better pictures by thinking visually in steps, just like people do when they sketch, fix, and finish. It reduces common mistakes (wrong counts, misplaced objects, wrong colors) that frustrate users and waste time. Designers, teachers, and online sellers benefit from images that match prompts more precisely, which means faster workflows and fewer retries. The approach is efficient at inference: only the final frame is decoded, while the earlier steps guide the quality. It also paves the way for more controllable generation where users could someday preview or nudge intermediate steps. By avoiding heavy language interleaving, pixel-level corrections become more direct and faithful. Overall, it’s a practical shift toward trustworthy, higher-fidelity image generation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how drawing a great picture usually starts with a sketch, then you clean it up, and finally you color and shade it? Artists don’t do everything at once—they improve step by step.

🥬 The Concept: Text-to-image (T2I) generation is when a computer makes a picture from a sentence.

What it is: A system that reads your words and makes an image that matches them.
How it works: 1) Read the prompt. 2) Imagine a scene. 3) Render pixels. 4) Adjust details.
Why it matters: If the computer jumps straight to pixels without careful thinking, it may miss objects, mess up counts, or confuse colors and positions.

🍞 Anchor: If you say, “a photo of three red apples on a table,” a good T2I model should show exactly three, not two or five, and they should be red, on a table.

🍞 Hook: Imagine telling a friend a complex scene to draw—like “a backpack with a Victorian pocket watch sewn on the front.” If they rush, they might forget the watch or sew it in the wrong place.

🥬 The Concept: Visual reasoning is how AI thinks with pictures, not just words.

What it is: Step-by-step understanding and decision-making based on what the model sees (or is generating).
How it works: 1) Notice key objects and layout. 2) Check relations and counts. 3) Add details and textures. 4) Re-check and correct.
Why it matters: Without visual reasoning, models often follow the “vibes” of training data (like typical scenes) and ignore tricky parts of prompts.

🍞 Anchor: When asked for “a hot dog the size of a thumbnail,” a reasoning model keeps the hot dog tiny instead of drawing a full-sized snack.

The world before: Classic T2I systems either generated the final image in one pass or used extra helpers like reward models or language-planning models to guide fixes. Two big issues popped up: (1) They kept switching between text and vision, which made pixel corrections indirect and lossy. (2) Unified multimodal models weren’t pretrained on large visual step-by-step refinements, so their self-corrections were shaky.

🍞 Hook: Think of a flipbook animation: each page shows the scene getting a bit closer to the final moment. Video models are pros at making those pages flow.

🥬 The Concept: Chain-of-Frame (CoF) reasoning is step-by-step visual thinking across frames.

What it is: Treat each frame as one reasoning step that refines the scene.
How it works: 1) Make a rough layout. 2) Fix meaning (objects, counts, positions). 3) Polish looks (lighting, texture). 4) Output final.
Why it matters: Without CoF, visual fixes get tangled or skipped, like trying to color before you finish the sketch.

🍞 Anchor: For “a bridge connecting two giant floating jellyfish,” frame 1 roughs in two jellyfish, frame 2 clearly draws the bridge between them, frame 3 adds glow and sunset tones.

The problem: Even though video models naturally refine scenes frame by frame, using them to help T2I was hard because T2I lacked a clear visual starting point and interpretable middle steps. We needed training data that shows how an image should evolve—from getting the meaning right to making it pretty.

🍞 Hook: Imagine teaching a kid to write an essay by giving them examples with drafts, revisions, and finals—not just the final essay.

🥬 The Concept: CoF-Evol-Instruct is a dataset of three-frame improvement paths for T2I.

What it is: 64,000 short chains where each step gets closer to a perfect image.
How it works: 1) Gather diverse prompts. 2) Generate anchor images from models of different strengths. 3) Route each anchor by quality. 4) Use a careful edit tool to build forward or backward steps that fix meaning then improve looks.
Why it matters: Without these supervised steps, models only learn final outcomes, not how to fix mistakes along the way.

🍞 Anchor: For “five candles,” the chain goes from wrong count → correct five → crisper wax textures and realistic lighting.

Failed attempts and the gap: Prior methods relied on (a) external verifiers that score images or (b) language plans inserted between visual steps. Both require hopping between text and vision, which can blur pixel-level corrections. What was missing was a purely visual, causally ordered refinement process that a model could internalize. Video models already think in sequences—so the idea is to reuse that for images.

Real stakes: This impacts designers needing precise mockups, teachers wanting accurate illustrations, shops requiring exact product counts and colors, and creators who need imaginative but correct compositions. A system that fixes meaning first and polish second is more trustworthy and easier to guide.

02Core Idea

🍞 Hook: You know how you build LEGO: first the base, then the walls, then decorations? Doing it in order keeps everything sturdy and right.

🥬 The Concept: The key insight—use a video model as a pure visual reasoner to generate a tiny three-frame chain, then output only the last frame as the final image.

What it is: A text-to-image method (CoF-T2I) that makes three visual reasoning steps: draft → refine → final.
How it works: 1) Start from noise. 2) Denoise into a short latent video of three frames. 3) Each frame fixes issues and adds details. 4) Decode only the last frame into pixels.
Why it matters: Without visible middle steps, models can’t easily learn how to correct mistakes—and users can’t see the thinking.

🍞 Anchor: Prompt: “a parrot flying outside a closed birdcage.” Frame 1 places a bird and cage. Frame 2 puts the parrot outside and cage door shut. Frame 3 sharpens feathers and lighting.

Multiple analogies:

Cooking: Frame 1 is the basic batter, frame 2 bakes it to the right shape, frame 3 frosts it beautifully.
Sketching: Frame 1 is the pencil outline, frame 2 checks proportions and fixes errors, frame 3 inking and shading.
Sports practice: Frame 1 learns the move slowly, frame 2 adjusts form, frame 3 executes cleanly at full speed.

Before vs After:

Before: One-shot images or text-heavy planning often miss precise counts, positions, or weird creative combos, and pixel fixes are indirect.
After: The model self-corrects visually in ordered steps, so composition and attributes lock in before style polish—fewer mistakes, better fidelity.

🍞 Hook: Imagine cleaning your room: first you put things in the right place, then you dust and decorate. If you decorate first, you’ll push things around and ruin it.

🥬 The Concept: Progressive visual refinement is fixing meaning first, then improving looks.

What it is: A two-stage mindset—semantic correction then aesthetic enhancement.
How it works: 1) Ensure objects, counts, and relations are right. 2) Add textures, lighting, and realism.
Why it matters: Jumping to style too early locks in errors or hides them under pretty details.

🍞 Anchor: “a MacBook in vibrant, fiery orange.” Step 1 gets the MacBook shape. Step 2 makes sure the color and intensity match “fiery orange.” Step 3 adds reflections and metal shine.

Why it works (intuition): Video models already learn to evolve scenes over time, so asking them to evolve a single image through three micro-frames fits their strengths. Training with chains teaches the model not just what a good picture looks like, but how to move from wrong to right. Independent frame encoding prevents accidental “motion” from bleeding into still images, keeping each reasoning step clean.

Building blocks (first mentions use sandwich):

🍞 Hook: Imagine describing a scene quietly in your head before drawing it. 🥬 The Concept: Latent frames are compressed, invisible versions of images the model works with.

What it is: Tiny, information-rich representations of frames.
How it works: 1) Compress images into latents. 2) Edit and refine latents. 3) Decode to pixels at the end.
Why it matters: Working in latents is faster and keeps structure clear. 🍞 Anchor: The model edits a small “idea” of the picture, then reveals the final image.

🍞 Hook: Think of a zip file that shrinks a video while remembering what matters. 🥬 The Concept: A video VAE compresses and decompresses frames.

What it is: A tool that turns images/videos into latents and back.
How it works: 1) Encode frames to latents. 2) Store and refine. 3) Decode the final one.
Why it matters: Without it, training would be slow and less stable. 🍞 Anchor: The VAE lets the model carry a mini-version of the picture while thinking.

🍞 Hook: Picture sliding a microscope over each frame so you study it alone. 🥬 The Concept: Independent frame encoding treats each frame separately.

What it is: Encoding each reasoning step on its own.
How it works: 1) Feed one frame at a time to the VAE. 2) Avoid temporal coupling. 3) Keep frames crisp.
Why it matters: Without it, you get fake motion or tangled steps. 🍞 Anchor: Each step is a clean page in a three-page comic, not smudged by the previous page.

🍞 Hook: Imagine plotting a straight, safe hiking path from start to finish. 🥬 The Concept: Rectified flow learns a smooth path from noise to image.

What it is: Training the model to follow a straightened trajectory through latent space.
How it works: 1) Mix noise and data. 2) Predict the direction to data. 3) Follow the learned velocity.
Why it matters: It’s efficient and yields high-quality generations. 🍞 Anchor: The model walks a direct route from scribbles to a sharp picture.

03Methodology

Overview: At a high level: Text Prompt → Sample three latent frames (draft → refine → final) with a video model → Encode each frame independently with a video VAE → Decode only the last frame → Output image.

Step-by-step (each key step uses sandwich):

Read the prompt and set the plan 🍞 Hook: You know how a coach gives a short game plan before play starts? 🥬 The Concept: A system prefix tells the model to generate a short refinement chain.

What it is: A small instruction added to the prompt: “Generate a short refinement chain… improving step by step.”
How it works: 1) Attach prefix. 2) Tokenize the text. 3) Feed to the video backbone.
Why it matters: Without this nudge, the model might produce one frame, not a useful chain. 🍞 Anchor: For “a bow and arrow made entirely of crystalline ice,” the plan ensures stepwise building: shape first, then icy details.

Generate a three-frame latent chain 🍞 Hook: Imagine making a mini flipbook with just three pages: sketch, fix, final. 🥬 The Concept: The video backbone samples z1, z2, z3 as latent frames.

What it is: A tiny sequence that encodes the chain-of-frame logic (semantics → aesthetics).
How it works: 1) Start from noise. 2) Denoise with rectified flow. 3) Produce z1 (draft), z2 (refine), z3 (final).
Why it matters: Without multiple steps, the model can’t separate fixing meaning from adding style. 🍞 Anchor: Prompt “five candles”: z1 shows candles but maybe wrong count, z2 corrects to five, z3 adds wax texture and glow.

Encode frames independently with the video VAE 🍞 Hook: Think of scanning each drawing page one by one to avoid smears. 🥬 The Concept: Frame-wise encoding keeps steps clean.

What it is: The VAE encodes each frame separately, avoiding cross-frame motion artifacts.
How it works: 1) Slide a one-frame window. 2) Encode each frame. 3) Keep latents tidy and stable.
Why it matters: Without it, frames leak into each other and details blur. 🍞 Anchor: A jellyfish-bridge scene won’t wobble between frames—the bridge stays where it should.

Train with rectified flow (velocity prediction) 🍞 Hook: Imagine a GPS that always points you straight toward your destination. 🥬 The Concept: The model learns a velocity field that moves from noise to data.

What it is: Predicting the direction from a noisy mix to the clean latent target.
How it works: 1) Interpolate between noise and data. 2) Predict the vector toward data. 3) Minimize the difference.
Why it matters: Without a good path, generation is slower or lower quality. 🍞 Anchor: The model quickly “travels” to the right draft, then refine, then final.

Decode only the final frame 🍞 Hook: Think of a magic curtain—you practice behind it, then show only the best take. 🥬 The Concept: Only z3 is turned into pixels.

What it is: The final latent is decoded; earlier frames stay internal.
How it works: 1) Keep z1 and z2 inside for reasoning. 2) Decode z3 with the VAE. 3) Output the image.
Why it matters: Saves time and focuses on the final result while still learning from the steps. 🍞 Anchor: Users see the polished crystal-feathered eagle, not the rough in-betweens.

Data: How the three-step chains are built 🍞 Hook: Imagine sorting homework into piles: wrong answer, okay but messy, perfect—and fixing accordingly. 🥬 The Concept: Quality-based routing chooses how to construct the chain from an anchor image.

What it is: A classifier decides if the anchor is semantically wrong (F1), semantically right but unrefined (F2), or high fidelity (F3).
How it works: 1) Sample anchors from weak/medium/strong T2I models. 2) Classify quality. 3) Pick a build strategy.
Why it matters: Without routing, we’d waste anchors or build bad chains. 🍞 Anchor: A semantically wrong backpack image goes to forward refinement; a perfect one builds the chain backward.

🍞 Hook: Think of a tiny, careful edit brush that changes only what you ask. 🥬 The Concept: Unified Editing Primitive (UEP) makes precise, controlled edits.

What it is: A planner–editor–verifier loop that adjusts either semantics or aesthetics.
How it works: 1) Plan a minimal edit. 2) Apply it. 3) Verify success. 4) Retry up to K times if needed.
Why it matters: Without tight control, frames would drift and break consistency. 🍞 Anchor: To fix “Attribute Binding,” UEP changes only color/material while preserving the object and layout.

The three construction routes:

Forward Refinement (F1 → F2 → F3): Fix wrong meaning, then improve looks.
Bidirectional Completion (F2 ← F2 → F3): From a decent middle, synthesize a plausible draft backward and a polished final forward.
Backward Synthesis (F1 ← F2 ← F3): From a perfect final, step back to a clean middle, then a slightly flawed draft—verified for consistency.

What breaks without each step:

No plan prefix: Model may skip multi-frame thinking.
No three frames: Meaning and style get tangled; errors persist.
No independent encoding: Motion artifacts and blurry reasoning steps.
No rectified flow: Slower, noisier path; worse quality.
No UEP/routing: Inconsistent chains; poor supervision.

Concrete example with data: Prompt: “a photo of three red apples on a table.”

z1 (draft): Two apples, orange-ish. UEP plans a semantic fix.
z2 (refine): Three apples, clearly red, on a table. Now aesthetics.
z3 (final): Crisp textures, realistic reflections, correct count and color.

Secret sauce:

Using a video backbone as a pure visual reasoner fits its natural strength (frame evolution).
Training on explicit chains teaches how to fix, not just what to draw.
Independent encoding keeps each reasoning step clean and controllable.

04Experiments & Results

The test: They measured how well images follow object-centric instructions (GenEval) and imaginative, compositional prompts (Imagine-Bench). Why these? GenEval checks the tough basics: objects, counts, colors, positions, and attributes. Imagine-Bench checks creative combinations and transformations that stretch understanding beyond simple matching.

The competition: CoF-T2I is compared with standard image generators (like SDXL, SD3, FLUX), unified multimodal models that interleave language reasoning (like Janus-Pro, BLIP3-o, OmniGen2, BAGEL-Think, T2I-R1), and its own video backbone (Wan2.1-T2V-14B). There’s also a strong ablation: Target-Only SFT, which fine-tunes just on final frames (no intermediate steps).

The scoreboard with context:

GenEval Overall: 0.86. Think of this as scoring an A when many strong rivals are at B to B+ (0.74–0.82). It surpasses BAGEL-Think (+0.04) and T2I-R1 (+0.07), showing that pure visual reasoning can beat language-heavy planning for pixel-precise tasks.
Imagine-Bench Overall: 7.468 versus the base video model’s 5.939—like moving from a solid C+ to a strong A-. The biggest jump is in Multi-Object (7.797 vs 5.383), showing better handling of complex scenes with multiple items.

Surprising findings:

Intermediate supervision matters: Target-Only SFT reaches 0.81 on GenEval—good, but still clearly below 0.86. Learning the chain (how to fix) adds real power beyond just the final target.
Clean steps beat continuous video encoding: Removing independent frame encoding drops performance to 0.83. Treating steps as separate states prevents entanglement and fake motion.
Step-by-step quality rises monotonically: On GenEval, frame scores climb from about 0.56 (draft) → 0.79 (refine) → 0.86 (final). On Imagine-Bench, a similar steady rise appears. This shows true iterative self-correction.
Scales robustly: The 1.3B and 14B backbones both improve when trained with CoF-T2I, with especially large relative gains on the smaller model—suggesting the method teaches a transferrable refinement habit.

Category insights:

Counting and spatial relations benefit strongly from the explicit draft→refine step that locks semantics before aesthetics.
Attribute binding (like “fiery orange MacBook” or “crystalline ice bow”) improves when the chain dedicates attention to the property at the right time.

Big picture: A video model acting as a pure visual reasoner, trained on three-step chains, can outperform or rival systems that rely on language interleaving or external verifiers—especially on pixel-faithful, composition-heavy prompts.

05Discussion & Limitations

Limitations:

Three steps may not be enough for extremely complex scenes or fine-grained typography in images (like long text within the picture). Longer chains could help but would cost more compute.
The method is designed for still images; it doesn’t generate full videos. Extending to text-to-video brings challenges: longer sequences, stable motion, and bigger training costs.
Some styles (very abstract art or heavy painterly effects) may not fully benefit from the strict semantic-then-aesthetic schedule, since style and meaning blur together.
Data curation relies on auxiliary models (planner/editor/verifier) to build training chains, which adds engineering complexity (though this is offline, not at inference).

Required resources:

A capable video backbone (e.g., Wan2.1-T2V), a frozen video VAE, and GPUs that can handle 1024×1024 training.
The 64K CoF-Evol-Instruct dataset for supervised chains.
Standard flow-matching training setup.

When not to use:

Ultra-fast, low-latency generation on very small devices where even a three-step chain may be too heavy.
Tasks demanding very long text rendered inside the image (fine typography), where specialized text-rendering modules might be better.
Purely stylistic doodles where strict semantic correctness isn’t a priority.

Open questions:

How much do we gain from more than three steps? Is there a point of diminishing returns?
Can reinforcement learning further improve the chain by rewarding tighter alignment and fewer errors?
What’s the best way to extend CoF to video and 3D tasks while keeping steps clean and independent?
Can we build lighter versions for edge devices without losing the benefits of visual reasoning?
How to make the chain interpretable to users (e.g., optional previews of frames 1–2 for debugging and control)?

06Conclusion & Future Work

Three-sentence summary: This paper turns a video generation model into a pure visual reasoner for text-to-image by producing a short three-frame refinement chain and decoding only the final frame. It trains on a curated dataset (CoF-Evol-Instruct) that teaches the model how to fix meaning first and polish aesthetics second, with independent frame encoding to keep steps clean. The approach achieves top-tier results on GenEval and Imagine-Bench and beats several language-planning systems.

Main achievement: Showing that video models’ natural Chain-of-Frame capability can directly power higher-quality, more accurate image generation—without relying on interleaved language reasoning or external verifiers.

Future directions: Explore longer chains, integrate reinforcement learning for adaptive step improvements, extend to video and 3D generation, and develop user-facing controls that let people steer or preview intermediate steps.

Why remember this: It reframes image generation as a visible, trainable thinking process—draft, refine, finalize—unlocking accuracy, control, and quality by aligning how models create with how humans create.

Practical Applications

•Product listing generation with exact counts, colors, and layouts for e-commerce images.
•Storyboarding scenes where stepwise control helps ensure characters and props are placed correctly before styling.
•Educational illustrations that must accurately depict quantities and spatial relations (e.g., science diagrams).
•Ad and marketing mockups that match brand colors and object arrangements precisely.
•UX/UI asset creation where composition must be locked before visual polish to minimize rework.
•Data augmentation for vision tasks that require controlled counts, colors, or spatial setups.
•Concept art for films/games where imaginative combinations (e.g., hybrid objects) must still follow clear instructions.
•Catalog or brochure generation that enforces attribute binding (correct materials, finishes, or textures).
•Quality-controlled content pipelines where intermediate steps can be optionally audited for compliance.
•Rapid prototyping of product variations by editing only attributes (color/material) while preserving layout.

Version: 1