UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen; Haoyu Ma; Zhipeng Fan; Ziqi Huang; Animesh Sinha; Xiaoliang Dai; Jialiang Wang; Zecheng He; Jianwei Yang; Chunyuan Li; Junzhe Sun; Chu Wang; Serena Yeung-Levy; Felix Juefei-Xu

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Intermediate

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan et al.2/12/2026

arXiv

Key Summary

•UniT teaches one multimodal model to think in steps with pictures and words, so it can check its own work and fix mistakes as it goes.
•Instead of answering in one shot, the model plans, verifies, and edits across several rounds during test time (when you actually use it).
•A smart data factory creates step-by-step training stories where an image is made, critiqued, and improved, showing verification, subgoal decomposition, and content memory.
•Sequential chain-of-thought scaling (think → try → check → fix) beats parallel sampling (make many tries and pick one) and uses compute more efficiently.
•Models trained on short reasoning chains can generalize to longer ones at test time, showing they learn how to reason, not just copy patterns.
•On generation and editing, UniT boosts alignment and quality (e.g., +10.34% on OneIG-Bench and higher scores on CompBench).
•For multi-turn editing, human raters greatly prefer UniT, showing strong memory of past edits and better instruction understanding.
•On visual reasoning (MIRA), UniT gets much better with more rounds, proving that step-by-step checking helps understanding too.
•Budget forcing lets you pick how many rounds (and thus time/compute) the model spends thinking and refining.
•This unified approach makes one model that can understand, generate, and edit images more reliably in the real world.

Why This Research Matters

Real creative and editing tasks rarely fit into one-shot answers; they evolve over multiple steps and require careful checking and memory. UniT shows that a single model can plan, verify, remember, and refine, delivering higher-quality, more faithful results with just a few extra rounds. This means better product photos, design iterations, educational tools, and assistive technologies that truly follow detailed, changing instructions. Because sequential scaling is more compute-efficient than making many random tries, teams can get stronger results without exploding costs. As unified models get stronger, UniT’s approach will scale with them, improving both generation and understanding. In short, it makes AI feel more like a thoughtful collaborator than a single-shot guesser.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you draw a picture, you often sketch first, look at it, fix the parts that look off, and repeat until it looks right? Most AI models that work with images and text don’t do that. They answer in a single pass: one prompt in, one result out—no checking, no improving, no trying again. That’s fine for simple requests, but it breaks down when instructions are long, involve many objects, or change over time across multiple turns.

🍞 Top Bread (Hook) Imagine a chef who cooks a complex meal in one go without tasting in between. Sometimes it will be okay—but for tricky dishes, you need taste–adjust–taste cycles.

🥬 Filling (The Actual Concept) What it is: Unified multimodal models are single AI systems that can read text, look at images, and also create or edit images using one brain. How it works:

The model reads text and images together.
It keeps track of what has been said and seen in a single conversation.
It can switch smoothly between understanding (analyzing) and generation (creating) without swapping models. Why it matters: Without one unified system, you need separate tools for understanding, checking, and editing, which causes clunky handoffs and lost context.

🍞 Bottom Bread (Anchor) Example: You upload a photo and say, “Make the sky pink and add a kite above the tallest tree.” A unified model can understand the request, edit the photo, and verify the change in one place.

Before this paper, unified models usually worked in one shot. They did not have a built-in habit of thinking through steps, checking if they followed all parts of the instruction, or making careful edits round by round. In language-only AI, a powerful idea called test-time scaling had already shown that giving models more “thinking time” at inference (the moment you use them) can greatly improve results on hard problems like math proofs or code debugging.

🍞 Top Bread (Hook) You know how giving yourself five more minutes on a tough puzzle can help you spot a mistake and fix it?

🥬 Filling (The Actual Concept) What it is: Test-Time Scaling (TTS) means letting the model spend more compute during use to think longer, reflect, and improve answers. How it works:

Allow extra steps of reasoning when answering.
Use verification to check if the steps make sense.
If needed, try again with corrections. Why it matters: Without TTS, models rush and make avoidable mistakes on complex tasks.

🍞 Bottom Bread (Anchor) Example: When solving a riddle, the AI can write out thoughts, notice a flaw, and correct itself before finalizing the answer.

But bringing that same magic to multimodal models (that handle text and images together) is tricky. You need not only step-by-step thinking but also the ability to verify visual details, remember content across turns, and refine images safely without ruining the picture quality.

🍞 Top Bread (Hook) Imagine doing a long division problem: you don’t jump straight to the answer—you write each step.

🥬 Filling (The Actual Concept) What it is: Chain-of-Thought Reasoning is the model explaining its steps as it solves a problem. How it works:

Break a big task into smaller steps.
Write out reasoning for each step.
Use the reasoning to guide the next step. Why it matters: Without steps, it’s hard to spot and fix specific errors.

🍞 Bottom Bread (Anchor) Example: For “two cats playing with one ball,” the model lists what to check: two cats exist; they interact with one ball; positions match the prompt.

Early attempts often used separate models—a generator makes an image, a vision-language model (VLM) checks it, and an editor fixes it. This teacher pipeline is great for producing training examples. But at deployment, running three models is slow and messy. We want one model to do all of that, and to do it in rounds. Past attempts in images tried making many versions and picking the best (parallel sampling), but that wastes compute and doesn’t learn from mistakes between tries. What was missing was a way to teach one unified model to plan, verify, remember, and refine—then let it scale its thinking at test time in a compute-efficient way.

Why should anyone care? Because real requests are messy and multi-step: design mockups with many objects, product photos with precise edits across multiple client rounds, or educational puzzles that require looking closely and reasoning about shapes. A model that can plan, check, and refine makes fewer silly errors and aligns better with what people actually want.

🍞 Top Bread (Hook) Think of building a LEGO castle: you follow steps in order.

🥬 Filling (The Actual Concept) What it is: Sequential Reasoning is solving a task step by step in a specific order. How it works:

Define the first small goal.
Complete it and re-check.
Move to the next small goal, using what you learned. Why it matters: Without order, steps can clash, and progress gets undone.

🍞 Bottom Bread (Anchor) Example: First “zoom into the bird,” then “change background,” then “increase brightness,” instead of trying all edits at once.

🍞 Top Bread (Hook) When you erase and redraw parts of a sketch several times, each pass makes it cleaner.

🥬 Filling (The Actual Concept) What it is: Iterative Refinement means improving the result by repeating small fix-up cycles. How it works:

Produce an initial result.
Check what’s wrong.
Make a targeted edit.
Repeat until it’s good. Why it matters: Without refinement, small early mistakes stay forever.

🍞 Bottom Bread (Anchor) Example: The model adds missing picture frames to a bookshelf after first removing books.

That’s the world before UniT: unified models that act in one shot, and ad-hoc multi-model systems that are hard to deploy. The gap: we needed a way to train a single unified model to think in multimodal steps and to scale that thinking during use.

02Core Idea

The “aha!” in one sentence: Teach a single multimodal model to plan, verify, remember, and refine across multiple rounds at test time—so more compute becomes more careful thinking, not just more random sampling.

Three analogies:

Chef analogy: Taste–adjust–taste cycles. UniT is the chef who tastes the soup (verify), decides what spice to add (plan), remembers the last changes (content memory), and tries again (refine) until it’s just right.
Homework checker: Work out steps, check each step, fix the wrong steps, and keep your notes so you don’t repeat past mistakes.
GPS rerouting: You follow the route (plan), check if you’re on track (verify), update route if traffic changes (refine), and remember past turns (content memory) to avoid wrong loops.

Before vs. After:

Before: One-pass outputs; limited self-checking; parallel sampling wastes tries; separate models do generation, checking, and editing with clumsy handoffs.
After (UniT): One unified model does step-by-step multimodal thinking; verification, subgoal decomposition, and content memory emerge; sequential scaling is more compute-efficient than making many independent samples.

Why it works (intuition):

Steps create handles for fixing: When requirements are decomposed into subgoals, the model can target a specific fix (like “remove the grass” before “add yellow flowers”).
Verification reduces drift: Each round compares the current image to the instructions, cutting off mistake chains early.
Content memory keeps coherence: By recalling prior content, the model preserves identities and layouts over multiple edits.
Sequential beats parallel: Each step learns from the last; parallel tries don’t share information, so they plateau sooner.
Short-to-long generalization: Training on short chains teaches the “habit” of stepping; at test time, the model extends that habit to longer chains when allowed more rounds.

Building blocks (explained with Sandwich):

🍞 Top Bread (Hook) You know how you break a big chore into smaller to-dos?

🥬 Filling (The Actual Concept) What it is: Subgoal Decomposition means splitting a complex instruction into bite-sized steps in a smart order. How it works:

Identify key parts of the instruction.
Order them so that early steps unlock later ones.
Execute steps one by one, checking progress. Why it matters: Without subgoals, the model may try everything at once and mess up.

🍞 Bottom Bread (Anchor) Example: “Remove shoes,” then “put a helmet on the skateboard,” then “change background to a skatepark.”

🍞 Top Bread (Hook) Imagine a friend who remembers the exact spot of every sticker on your notebook through many redesigns.

🥬 Filling (The Actual Concept) What it is: Content Memory is the model’s ability to remember visual details across rounds. How it works:

Keep a running multimodal context (text + images) of what changed.
Refer back when making new edits.
Preserve identities, positions, and styles unless told otherwise. Why it matters: Without memory, the subject’s identity drifts, and consistency breaks.

🍞 Bottom Bread (Anchor) Example: The same bear keeps its features after style changes and background edits across rounds.

🍞 Top Bread (Hook) Think of a workshop where the planner, inspector, and builder are all in the same room and talk instantly.

🥬 Filling (The Actual Concept) What it is: Agentic Data Synthesis is an automated way to create training stories where images are generated, checked, and edited with explicit step-by-step reasoning. How it works:

A generator makes an image.
A VLM inspects and writes down what’s wrong and how to fix it.
An editor applies targeted edits.
Repeat until the VLM declares success, saving the whole conversation. Why it matters: Without these stories, the model won’t learn to plan, verify, and refine like a human.

🍞 Bottom Bread (Anchor) Example: “Bookshelf with no books, only frames” → remove books → add frames → confirm success.

🍞 Top Bread (Hook) Picture a multitool that gets smarter the longer you use it on a project.

🥬 Filling (The Actual Concept) What it is: Multimodal Test-Time Scaling means letting the unified model spend more rounds at inference to improve results via multimodal chain-of-thought. How it works:

Set a budget for rounds (how many images to generate/edit).
In each round, reason in text, then generate or edit an image.
Verify and continue until the budget is spent or you’re satisfied. Why it matters: Without scaling, you miss easy gains from a few extra rounds of careful thinking.

🍞 Bottom Bread (Anchor) Example: With 4 rounds instead of 1, the model can fix missing objects and wrong colors to match the full prompt.

Put together, UniT shows that one model can learn these habits (plan, verify, remember, refine) from synthetic agentic data and then use extra compute at test time to get consistently better results across both making images and understanding them.

03Methodology

At a high level: User prompt and (optionally) an input image → Round 1: think (plan + verify) → generate/edit image → Round 2: think → refine → … up to the chosen budget → final image (and reasoning trace).

Data: How the “teacher” pipeline creates training stories

What happens: Three roles form a loop. A generator (Flux Pro) makes an initial image from a prompt. A VLM (Qwen3-VL) verifies it, writes step-by-step thoughts, and produces a concrete edit plan. An editor (Flux Kontext or Qwen-Image-Edit) applies the edit. Repeat until the VLM says the prompt is satisfied. Keep all the text–image steps.
Why this step exists: The unified student model needs examples of how to plan, verify, remember content, and refine. Without these explicit stories, it won’t pick up the habits.
Example: Prompt: “Two cats playing with a single ball.” Round 1 image has two cats but two balls. VLM says: remove one ball. Editor removes it. VLM re-checks and confirms success.

Training the unified model (Bagel → UniT)

What happens: Fine-tune a single unified model (Bagel base) on about 12K of these multi-round trajectories (average 3.6 rounds). The model learns to interleave text reasoning with image generation/editing and to keep multimodal context.
Why this step exists: We want one deployable model. If we kept three models, inference would be slow and hard to manage; also, the student wouldn’t internalize the habits.
Example: After training, given “Replace the bread with grilled salmon, then add teriyaki sauce,” the model itself plans steps, applies edits, and checks results.

Inference: how UniT reasons at test time

What happens: You set a computational budget C (number of image rounds). In each round, the model first writes multimodal chain-of-thought (plan + verify), then generates or edits an image accordingly. If it finishes too early, we gently nudge it to continue (“Let’s edit the image”) until C is used or it’s clearly done.
Why this step exists: Budget controls latency and quality. Without it, you can’t trade time for accuracy.
Example: With C=1, you get a one-shot try. With C=4, the model can fix missing frames on a shelf or correct colors and positions.

The Secret Sauce: nested guidance to balance text alignment and visual consistency

What happens: The model uses two levels of classifier-free guidance (CFG):
1. Text guidance: pull the image toward following the current instruction.
2. Image-history guidance: pull the image toward staying consistent with earlier images in the session. They are applied one after the other (nesting), so you can tune prompt adherence separately from visual continuity.
Why this step exists: Without balanced guidance, you either drift away from the prompt or lose consistency across edits.
Example: If you’ve already established a skateboard’s look, image-history guidance helps keep it the same while text guidance applies a new background.

Budget forcing for multimodal scaling

What happens: You pick C (e.g., 1–10). Each round does: think (text chain-of-thought) → image generation/editing. If the model tries to stop early, we prevent EOS and ask it to continue. If it goes beyond C, we keep the Cth image.
Why this step exists: In multimodal settings, the main cost is image generation, not extra text tokens. Controlling image rounds is the right knob.
Example: On a puzzle image, the model might use Round 1 to zoom-in and describe parts, Round 2 to compare candidates, and Round 3 to finalize the answer.

Sequential vs. parallel at test time

What happens: Sequential scaling reuses past work: each round learns from prior images and thoughts. Parallel sampling makes N totally separate images and picks the best.
Why this step exists: We want compute efficiency. Without sharing information between rounds, each new try doesn’t get smarter.
Example: In sequential mode, Round 2 removes the wrong extra balloon identified in Round 1. In parallel mode, you might get many different scenes, but none fix the exact issue.

Data quality filters

What happens: Keep only useful learning examples by removing: overly long trajectories (over 8 rounds), regressions where edits worsen results, irrelevant edits, tiny visual changes that add noise, and any overlap with test benchmarks.
Why this step exists: Without careful curation, the model may learn to waffle, make tiny edits, or drift off-topic.
Example: If LPIPS change between rounds is <0.03 (almost no visual difference), we drop that round so the model doesn’t learn to waste a step.

Putting it together (recipe view)

Input: A prompt and optionally an image.
Step A: Think—describe, compare with instruction, plan one small change.
Step B: Do—generate or edit the image.
Step C: Check—verify what still fails, remember content so identities and layouts persist.
Loop: Repeat A–B–C up to C rounds or until satisfied.
Output: The final improved image plus a coherent trail of reasoning.

Failure modes and mitigations (practical notes)

Precise physics or complex geometry can be stubborn; consider stronger verifiers or physics-aware rules.
Verification hallucinations can trigger unnecessary edits; better verifiers or confidence checks help.
Too many low-impact edits can slowly reduce quality; skip tiny-change rounds or occasionally reset from scratch guided by the accumulated reasoning.

04Experiments & Results

The tests: What did they measure?

Compositional generation (OneIG-Bench): Can the model follow complicated text instructions for making images with correct objects, attributes, and spatial relations?
Multi-object editing (CompBench): Can it edit specific parts while keeping the rest consistent?
Multi-turn editing (ImgEdit): Across three instruction turns, can it remember past changes, interpret new requests, and keep versions coherent (judged by humans)?
Visual reasoning (MIRA): Can the model analyze images to solve geometry, physics, puzzles, and causal questions—far beyond its training distribution?

The competition: Baselines

Bagel: the base unified model with no chain-of-thought.
Bagel+CoT: text-only reasoning but no multimodal iterative refinement.
UniT: full multimodal chain-of-thought with iterative planning, verification, memory, and refinement.

Scoreboard with context

OneIG-Bench (alignment): UniT improves alignment by about 10.34% over the base single-pass model at C=10. Think of it as moving from a solid B to an A- in following complex instructions.
CompBench (overall quality): UniT gains +5.56% going from C=1 to C=10. This is like steadily polishing the same statue until the details match the blueprint.
ImgEdit (human ratings over 0–10): With up to 4 rounds per turn, UniT achieves a 225% relative improvement over C=1 baselines. That’s like judges at a science fair moving you from “needs work” to “excellent” because you remembered past steps and made each new step clear.
MIRA (reasoning accuracy): UniT gets +53.33% improvement from C=1 to C=10. It’s the difference between guessing and carefully working through a puzzle by checking each piece.

Sequential vs. parallel (compute framing)

Measuring compute as the number of images generated, sequential chain-of-thought scaling achieves similar or better results with fewer images compared to best-of-N parallel sampling. In simple terms: learning from your last try beats starting from scratch over and over.
Example: On OneIG-Bench, C=4 sequential can rival N=10 parallel—about 2.5× compute savings.

Surprising findings

Short-to-long generalization: Even though training trajectories averaged only 3.6 rounds, at test time the model effectively handled longer chains (averaging ~4.7 rounds) without special retraining. This suggests UniT learned the skill of stepping, not just memorized specific step counts.
Cognitive behavior ablation: Removing subgoal planning hurts compositional tasks most; removing content memory crushes multi-turn editing; removing verification weakens visual reasoning—each habit plays a distinct role.
Data quality ablation: Off-topic edits damaged compositional performance; tiny-change rounds hurt multi-turn editing; poor-quality end states harmed reasoning—showing that different filters support different capabilities.

What this means in practice

If you have time for a few extra rounds, you get noticeably better alignment, fewer silly mistakes, and stronger consistency across turns.
If you need lowest latency, parallel sampling might still be okay—but expect it to plateau earlier and waste more tries.
As base unified models get stronger, UniT’s method should improve directly because the habits (plan, verify, remember, refine) make better use of stronger skills.

05Discussion & Limitations

Limitations (honest take)

Extra rounds mean extra compute and latency; not ideal when you need instant results.
Fine-grained physics or tricky spatial layouts can still fail if the base generator can’t represent them well; more rounds can’t fix missing core skills.
Verification hallucinations sometimes trigger unnecessary edits, degrading image quality.
Very long sequences of tiny edits can slowly add noise; you may need to skip low-impact rounds or occasionally regenerate from a clean slate using the accumulated plan.

Required resources

A unified multimodal model fine-tuned on multi-round trajectories (about 12K here) and enough GPU memory to hold several rounds of image generation context.
Inference-time budget control to choose how many rounds—common settings are C=1 to C=10, with 1–4 being a sweet spot for many tasks.

When NOT to use

Ultra-low-latency deployments where even one extra round is too expensive.
Cases demanding precise, verifiable physics or geometry beyond the base model’s representational power.
Highly adversarial prompts that can exploit verification weaknesses and lead to oscillating, unhelpful edits.

Open questions

How to make verification more robust (e.g., physics-aware checks, stronger visual verifiers) so the model doesn’t chase phantom errors?
Can we adaptively allocate rounds—more for hard cases, fewer for easy ones—without separate oracles?
What’s the best way to prevent quality drift over very long sequences (smart reset points, noise scheduling, or hybrid regeneration)?
How does this extend to video and audio, where temporal consistency and synchronization add new constraints?
Can reinforcement learning from human feedback (RLHF) further polish the planning and verification steps for even better human preference alignment?

06Conclusion & Future Work

Three-sentence summary: UniT shows that one unified multimodal model can be taught to plan, verify, remember, and refine across multiple rounds at test time, turning extra compute into smarter, not just bigger, thinking. A teacher pipeline creates step-by-step training stories that induce three key cognitive behaviors—verification, subgoal decomposition, and content memory—so the student model learns to improve iteratively. In practice, sequential chain-of-thought scaling beats parallel sampling and yields strong gains in compositional generation, multi-turn editing, and visual reasoning.

Main achievement: Establishing multimodal chain-of-thought test-time scaling as a practical, compute-efficient paradigm for both making images and understanding them within a single unified model.

Future directions: Add physics-aware verification, adaptive budget allocation, RLHF-tuned reflection, and expansion to audio and video where cross-modal timing and consistency matter. Explore safeguards that skip tiny edits or reset with accumulated plans to prevent quality drift in very long sessions.

Why remember this: UniT turns “think more at test time” from a language-only trick into a general multimodal strategy. It proves that careful step-by-step verification, smart subgoals, and strong memory can make one model handle real-world, multi-step, changing instructions far more reliably—just like how people work through tough tasks.

Practical Applications

•Multi-turn photo editing assistants that remember past edits and keep subjects consistent across revisions.
•Design mockup tools that follow complex, step-by-step art director notes with fewer mistakes.
•E-commerce image refinement that sequentially fixes attribute mismatches (colors, counts, placements).
•Educational puzzle solvers that show visual chain-of-thought to teach geometry or spatial reasoning.
•Content moderation and verification that check visual claims step by step before flagging issues.
•Scientific figure generation where iterative verification ensures labels, counts, and layouts are correct.
•Storyboarding tools that maintain character identity and scene continuity across multiple panels.
•Medical or industrial annotation helpers that iteratively refine segmentations or highlight findings.
•AR/VR asset editing that preserves style and proportions while applying targeted updates.
•Marketing A/B creative generation that refines toward precise brand guidelines over several rounds.

Version: 1