Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Sen Ye; Mengde Xu; Shuyang Gu; Di He; Liwei Wang; Han Hu

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Intermediate

Sen Ye, Mengde Xu, Shuyang Gu et al.2/17/2026

arXiv

Key Summary

•Big idea: Make image-making AIs stop, think, check, and fix their own work so they get better at both creating pictures and understanding them.
•The paper shows that training only for pretty pictures often weakens understanding, and training only for understanding often weakens creativity.
•Their Reason–Reflect–Refine (R3) framework turns one-shot image generation into a loop: plan → draft → check against the prompt → edit → repeat.
•They use reinforcement learning (a reward-and-practice method) so the model learns which plans and edits actually improve alignment with the user’s request.
•A special tree-style RL training gives feedback at each step, making learning stable and faster than training a single long chain all at once.
•On GenEval++ (a hard instruction-following test), overall generation accuracy jumps from 0.371 to 0.689—like going from a C to a strong A-.
•Understanding also improves: image–text alignment rises from 60.6% to 73.37%, and VQA accuracy increases from 86.48% to 89.63%.
•The loop can stop early when the image is already good (“No further edit needed”), saving time on easy prompts.
•Even in general settings (TIIF benchmark), the approach beats strong baselines, showing the idea transfers beyond one dataset.
•Takeaway: If you make the model use its understanding while it draws, both skills grow together instead of fighting each other.

Why This Research Matters

When AIs can both understand and generate well, they can follow your instructions exactly while still being creative. Designers get images that match precise briefs without endless manual edits. Teachers and students can create visuals that are faithful to lessons, like correct counts, positions, and labels. Scientists and engineers can prototype accurate diagrams or scene variations that respect constraints. Everyday users get tools that self-correct, saving time and frustration. And by learning to check and fix their own work, future AIs become more trustworthy collaborators, not just flashy image makers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you write a story or draw a picture, your first try isn’t perfect? You plan, make a draft, look it over, then fix the parts that don’t match your idea. That loop makes your work better and helps you understand what you’re trying to say.

🥬 Filling (The Actual Concept):

What it is: This paper studies multimodal models—AIs that handle both words and images—and asks why they often get worse at understanding when they get better at generating pictures, and vice versa.
How it works (story of the field):
1. The world before: Big AIs could write and draw. Some were great at making beautiful images (generation). Others were great at answering questions about images (understanding). But doing both well at the same time was tough.
2. The problem: When researchers trained a model to make images look more realistic, it often forgot how to count objects or follow tricky instructions. When they trained it to answer questions accurately, its pictures got less creative or less faithful to the prompt.
3. Failed attempts: People tried mixing tasks in one big training soup, or splitting the model into parts (one part for understanding, one for generating), or inventing new ways to turn images into tokens. These helped a bit but didn’t truly stop the tug-of-war.
4. The gap: What was missing was a way to make the model actually use its understanding while it was in the act of generating—not as a separate, after-the-fact skill.
5. Real stakes: If AIs can both understand and generate well, they can make posters that precisely follow your instructions, plan step-by-step edits for designers, help kids learn by making accurate visuals, and assist scientists by generating images that truly match detailed descriptions.
Why it matters: Without a method that ties understanding into generation, training pushes the model to focus on just making images likely under the data—pretty but sometimes wrong—letting its “understanding muscles” weaken.

🍞 Bottom Bread (Anchor): Imagine asking for “a photo of four cats sitting on a red bench, two wearing hats.” A model that only optimizes for pretty pictures might give a lovely scene—but with three cats and mismatched hats. A model that plans, checks, and fixes uses its understanding to count and correct until the picture matches your words.

New Concept 1 — Multimodal Models 🍞 Hook: Imagine a friend who can read a story, look at a picture, and then write a caption that matches the picture perfectly. 🥬 The Concept: Multimodal models are AIs that work with more than one kind of information at once—like text and images. How it works: (1) They read text inputs, (2) look at or create images, and (3) connect meanings across both. Why it matters: Without this, you’d need two separate AIs—one to understand and one to draw—that don’t learn from each other. 🍞 Anchor: When you type “a blue kite above two trees,” a multimodal model can both check an image for those details and also generate one that follows them.

New Concept 2 — Generative Modeling 🍞 Hook: Think of an artist who can paint new scenes from imagination, not just copy existing pictures. 🥬 The Concept: Generative modeling means making new content—like drawing a new image from a prompt. How it works: (1) The model reads your instructions, (2) forms an internal plan, and (3) produces pixels step-by-step (often using a process called diffusion). Why it matters: If a model can’t generate, it can only talk about images, not create them. 🍞 Anchor: “A cozy cabin under the northern lights” turns into a brand-new image, not a search result.

New Concept 3 — Understanding vs. Generation Trade-off 🍞 Hook: You know how cramming only for art class might hurt your math grade? Focusing on just one thing can make another slip. 🥬 The Concept: Training for super-strong image generation can weaken careful understanding (like counting or spatial reasoning), and training for sharp understanding can dull creative generation. How it works: (1) Generation often learns to match the data distribution, (2) understanding requires precise reasoning, (3) the same model weights get pulled in different directions, creating competition. Why it matters: Without balancing, “pretty” and “precise” don’t grow together. 🍞 Anchor: A diffusion model might draw gorgeous scenes but miss “exactly five apples on the left.” A VQA-tuned model might count well but draw less compelling images.

02Core Idea

🍞 Top Bread (Hook): Imagine building LEGO from instructions. First, you plan, then you build, then you check if it matches the picture, and if not, you tweak pieces until it does. That loop is what makes your model stick together correctly.

🥬 Filling (The Actual Concept):

What it is: The Reason–Reflect–Refine (R3) framework turns one-shot image generation into an iterative loop that explicitly uses understanding while generating.
Aha! in one sentence: Make the model use its understanding to judge and fix its own drawings as it goes, so both skills improve together.

Multiple Analogies (3 ways):

Painter’s process: Sketch (Reason), step back to judge (Reflect), rework details (Refine), repeat.
Writing an essay: Outline (Reason), proofread for alignment to the prompt (Reflect), edit sentences (Refine), repeat.
Cooking a new recipe: Plan ingredients (Reason), taste and compare to the goal (Reflect), adjust seasoning (Refine), repeat.

Before vs. After:

Before: Generation was often a single leap from words to image. If it missed the target, too bad.
After: Generation is a staircase: plan → draft → check → edit. Errors are opportunities to learn and improve.

Why It Works (intuition, no equations):

Checking changes what you learn: If rewards come from how well the final image matches the prompt, then the model is pushed to inspect, reason, and improve. Reflection becomes practice for understanding.
Shorter learning paths: Giving feedback at intermediate steps (plan made sense? edit helped?) reduces guesswork and stabilizes training.
Shared win-win: When fixing images requires counting, layout checks, and attribute binding, the understanding circuits light up and get stronger, which then makes the next generation step better too.

Building Blocks (Sandwich style for each): New Concept 4 — R3 Framework 🍞 Hook: You know how teachers ask you to show your work, check it, and then correct mistakes? 🥬 The Concept: R3 is a “think, check, fix” loop inside image generation. How it works: (1) Reason: expand the prompt and draft an image; (2) Reflect: compare the image to the prompt and write what’s wrong or say it’s done; (3) Refine: edit the image based on the reflection; repeat until good. Why it matters: Without this loop, models stop too early and lock in mistakes. 🍞 Anchor: For “four cats wearing hats,” R3 drafts cats, checks the count and hats, and edits until exactly four hatted cats appear.

New Concept 5 — Chain-of-Thought for Images 🍞 Hook: When solving a puzzle, you don’t just jump to the answer—you write steps on paper. 🥬 The Concept: Chain-of-Thought means the model writes small reasoning notes (“<think>…</think>”) to guide generation. How it works: (1) Plan details, (2) note mismatches during checking, (3) turn notes into edit instructions. Why it matters: Without explicit steps, mistakes hide and never get fixed. 🍞 Anchor: “Place two blue cups on the left” becomes a plan, then a check “I see only one cup,” then an edit “Add one more blue cup on the left.”

New Concept 6 — Reinforcement Learning (RL) 🍞 Hook: Like getting points for each good move in a game so you learn winning strategies. 🥬 The Concept: RL gives rewards when images better match prompts, teaching the model which plans and edits help. How it works: (1) Try, (2) get a score, (3) adjust to do better next time. Why it matters: Without rewards tied to outcomes, the model can’t tell which choices made images more aligned. 🍞 Anchor: If an edit changes “three ducks” to actually show three ducks, the model earns more points and learns that edit style.

New Concept 7 — Tree-RL Strategy 🍞 Hook: Imagine grading each paragraph of an essay instead of waiting until the whole book is finished. 🥬 The Concept: Tree-RL breaks long generation chains into stages (Reason vs. Reflect–Refine) and gives feedback at each stage. How it works: (1) Store results from Reason; (2) train Reflect–Refine using those; (3) sample diverse cases to learn faster. Why it matters: Without staged feedback, learning from a single long chain is noisy and slow. 🍞 Anchor: The model learns sooner that a clearer plan leads to better drafts, and that a precise edit instruction raises the score.

New Concept 8 — Reward Model (Image–Text Alignment Judge) 🍞 Hook: Think of a fair referee who checks if the picture matches the rules you set. 🥬 The Concept: A reward model scores how well the image fits the prompt. How it works: (1) Compare image to text, (2) give a 0–1 score, (3) guide learning. Why it matters: Without a judge, the model can’t tell “better” from “worse.” 🍞 Anchor: “Yellow grasshopper under the fence” gets a higher score when the grasshopper is actually yellow and under a fence.

03Methodology

At a high level: Prompt → Reason (plan text + draft image) → Reflect (check + edit instruction or stop) → Refine (apply edit to image) → Repeat → Final image.

Step-by-step (with what, why, example):

New Concept 9 — Reason Step 🍞 Hook: Before drawing, you sketch a blueprint so you don’t forget details. 🥬 The Concept: Reason expands the prompt into a detailed plan and drafts the first image. How it works: (1) Read the user prompt, (2) write a “<think>” plan that clarifies counts, colors, positions, and styles, (3) generate an initial image from that plan. Why it matters: Without a plan, the first image might miss key instructions, making later fixes harder. 🍞 Anchor: Prompt: “A photo of four cats.” Plan: “<think>Show four cats, two on the bench, two on the ground, warm light.</think>” Draft image: shows four cats in those spots.

New Concept 10 — Reflect Step 🍞 Hook: After a draft, you step back and ask, “Does this match what I wanted?” 🥬 The Concept: Reflect compares the current image to the original prompt and writes either (a) “No further edit needed.” or (b) a clear edit instruction. How it works: (1) Look for mismatches (counts, colors, positions, text), (2) create a short, actionable edit command, (3) or stop if everything fits. Why it matters: Without reflection, the model keeps mistakes or over-edits a good image. 🍞 Anchor: For “three ducks by the pond,” if it sees two, it writes: “Add one more duck on the left side near the pond.”

New Concept 11 — Refine Step 🍞 Hook: Like erasing a wrong line and redrawing it neatly. 🥬 The Concept: Refine applies the edit instruction to the image via an image-editing generation process. How it works: (1) Condition on the current image and the edit text, (2) regenerate only what’s needed, (3) output the improved image. Why it matters: Without targeted edits, the model might redraw the whole scene and lose good parts. 🍞 Anchor: If the sign shape is wrong, the edit changes just the sign to an octagon while keeping the bench and background intact.

Putting the loop together:

Input: “A picture of a birthday card with the words: ‘HAPPY’, ‘BIRTHDAY’, ‘OUR’, ‘MERMAID’.”
Reason: Plan typography and layout, then draft the card image.
Reflect: Check if all four words are correct. If one word is misspelled, output: “Fix text to exactly: HAPPY BIRTHDAY OUR MERMAID.” Or stop if perfect.
Refine: Edit the text on the card. Repeat if needed.
Output: The final image matches all requested words and layout.

Training recipe (how the model learns the loop):

New Concept 12 — Group-Relative Policy Optimization (GRPO) 🍞 Hook: When judging a contest, it helps to compare entries to each other, not just to an absolute scale. 🥬 The Concept: GRPO is an RL method that normalizes rewards within a group of attempts to reduce noise. How it works: (1) Sample several plans/edits for the same prompt, (2) score them, (3) update the policy using each attempt’s advantage compared to the group. Why it matters: Without this, training can be unstable and chase lucky wins. 🍞 Anchor: Try 8 different edit instructions; reward the ones that improve alignment the most compared to the rest.

New Concept 13 — FlowGRPO for Diffusion 🍞 Hook: If drawing is a multi-step brushstroke process, you want feedback that guides each stroke toward the final goal. 🥬 The Concept: FlowGRPO adapts GRPO to diffusion-style image generation, where images are formed by gradually removing noise. How it works: (1) Treat the denoising as steps in a decision process, (2) assign a final reward when the image is done, (3) adjust all steps to favor paths leading to better images. Why it matters: Without this, the image generator doesn’t learn which sampling paths give better alignment. 🍞 Anchor: Among several noisy-to-clean paths, the ones ending with “exactly five balloons on the right” get reinforced.

New Concept 14 — Tree-RL Training 🍞 Hook: It’s easier to learn from short math problems than one giant marathon problem. 🥬 The Concept: The training alternates between optimizing Reason and optimizing Reflect–Refine, giving each clear feedback. How it works: (1) Reason creates a buffer of plan+image results, (2) Reflect–Refine trains on these, (3) sampling favors a mix of easy and hard cases to learn faster. Why it matters: Without splitting, long chains make cause-and-effect too fuzzy to learn well. 🍞 Anchor: If a plan format is off (“forgot <think> tags”), that stage gets direct feedback; if an edit instruction helps, that stage is rewarded.

New Concept 15 — Reward Models for Stages 🍞 Hook: Like different judges for different parts of a contest—one checks structure, another checks final performance. 🥬 The Concept: Stage-wise rewards score initial images, reflection correctness (did it improve or stop correctly), and final edits. How it works: (1) Use a vision-language model to score image–prompt alignment, (2) add format bonuses (e.g., correct “<think>…</think>”), (3) reward correct “No further edit needed” only when true. Why it matters: Without stage-specific feedback, the model can game the system or stop too early. 🍞 Anchor: If an edit boosts the alignment score from 0.6 to 0.8, both the reflection and the edit stage get positive credit.

The secret sauce:

Explicit self-checking (Reflect) turns understanding into a habit, not a side task.
Iterative editing (Refine) makes fixes cheap and focused.
Tree-RL with GRPO/FlowGRPO stabilizes learning and points credit to the right steps.

Concrete mini-examples:

Counting: Prompt “two suitcases above, three donuts below.” First draft shows four donuts. Reflect: “Remove one donut on the right.” Refine: edits to three. Next Reflect: “No further edit needed.”
Color binding: “yellow grasshopper under wooden fence.” If it’s green, Reflect says: “Change grasshopper color to yellow.” Refine: recolors the insect; stop.
Text rendering: “SIMPLY BULLET PROOF NOTHING SAUCER.” If letters are wrong, Reflect outputs the exact phrase to render; Refine fixes it; stop.

04Experiments & Results

🍞 Top Bread (Hook): When you study with quizzes after each lesson, you usually learn faster and score higher because you catch mistakes early.

🥬 Filling (The Actual Concept):

The Test: The team measured two things—how well images follow instructions (generation) and how well the model can judge and answer about images (understanding).
The Competition: They compared their R3-trained model to BAGEL (the strong baseline it’s built on) and to other leading systems.

Scoreboard with context:

GenEval++ (hard instruction-following):
- BAGEL baseline overall: 0.371 (like a C).
- R3 with Reason only: 0.593 (big jump).
- Full R3 (Reason + Reflect–Refine): 0.689 (like a strong A-). That’s a +0.32 absolute improvement over BAGEL.
- In complex “Multi-Count” cases, R3 does especially well, beating strong SOTA like Echo-4o.
Understanding (their new tests):
- ITA (Image–Text Alignment): From 60.60% (BAGEL) to 73.37% (R3 full). That’s like going from a D+ judge to a solid B+ judge.
- VQA (compositional questions about generated images): From 86.48% to 89.63% overall. A steady, meaningful gain.
General benchmark (TIIF): R3 also improves broadly beyond one dataset, showing transfer.

Surprising findings:

Reflection matters a lot: Reason-only gave small understanding gains (e.g., ITA +1.16), but adding Reflect–Refine unlocked big jumps (ITA +12.77). So the self-check-and-fix loop is the real engine.
Inference-time scaling: Most gains happen after the first Reflect–Refine turn; extra turns help but saturate by 4–5 turns.
Co-evolution over training: Early steps look like normal generation, but once reflection “kicks in,” understanding rises—and that then boosts generation further.
Domain specificity: Training on counting helps counting most; color training helps color most. Understanding grows where it’s exercised—like muscles used at the gym.

Concrete examples (qualitative):

Text correction: From misspellings on a concert poster to the exact requested words.
Color fixes: Turning a green grasshopper yellow to match the prompt.
Counting fixes: Removing an extra donut so “three donuts below” becomes true.
Spatial fixes: Moving or resizing items to satisfy “above/below” or “left/right.”

Why these results make sense:

The model isn’t just told “be better”; it’s taught to look, critique, and revise with rewards that care about final alignment. That’s how humans improve, too.

🍞 Bottom Bread (Anchor): Think of an art assignment: “Draw five balloons tied to a red bench.” The R3-trained model sketches, counts, and corrects until exactly five balloons hang from a red bench—then it stops itself to save time.

05Discussion & Limitations

Limitations:

Compute and latency: Iterative checking and editing adds time. While the model stops early on easy prompts, each Reflect–Refine turn takes extra seconds. For real-time apps, this may be too slow unless optimized or batched.
Domain-specific gains: Understanding improves most where it is trained (e.g., counting or color). Generalizing to brand-new kinds of reasoning needs more diverse training or better reward design.
Reliance on external judges: Rewards often come from a large vision-language model. If that judge is biased or wrong, the training signal can mislead the generator.
Text rendering remains tricky: Fine-grained typography (e.g., long, exact phrases) can still produce small errors and sometimes the model stops too early.
Credit assignment complexity: Even with Tree-RL, deciding which step deserves the credit can be noisy; careful tuning is needed.

Required resources:

A unified multimodal backbone (like BAGEL) that can both write and draw (including image editing).
Strong reward models (e.g., Qwen-2.5-VL-72B) to score alignment.
GPU resources capable of diffusion sampling and RL updates.
A prompt dataset and evaluation tools (e.g., GenEval++, TIIF) to measure progress.

When NOT to use it:

Ultra-low-latency use cases where even a single extra second is too much.
Very simple prompts that a one-shot generator already nails (the loop gives small extra benefit).
Settings without a reliable reward model; poor judges can teach bad habits.

Open questions:

How to make understanding gains more general, not just domain-specific? Can we design broader, curriculum-style rewards?
Can we reduce the number of refinement turns further with better planning or learned stopping rules?
How do we best evaluate understanding without relying on proprietary models? Can open, standardized judges replace them?
Can the same approach unify even more modalities (audio, video) with tight, reflection-driven loops?
How do we make text rendering and exact symbol placement as robust as counting and colors?

06Conclusion & Future Work

Three-sentence summary: This paper confronts a core problem in multimodal AI—improving image generation often weakens understanding, and vice versa. The authors introduce Reason–Reflect–Refine (R3), a loop that makes the model plan, check, and edit its own images, using understanding as part of generation. With reinforcement learning and stage-wise rewards, R3 lifts both creation quality and comprehension across tough benchmarks.

Main achievement: Turning generation from a one-shot jump into a “think, check, fix” loop that trains understanding and generation to grow together.

Future directions: Build reward models and datasets that encourage broader, more general understanding beyond specific domains; speed up the loop with smarter planning and early stopping; push the method to video, audio, and multi-step interactive tasks; and develop open, reliable evaluators to reduce dependence on proprietary judges.

Why remember this: R3 shows that the best way to make an AI draw better is to make it truly look at what it drew—and use that understanding to fix mistakes. That simple shift—from generation alone to generation powered by reflection—turns a tug-of-war into teamwork, pointing the way toward unified models that are both imaginative and accurate.

Practical Applications

•Interactive design tools that iteratively self-correct layouts, colors, and counts to meet a creative brief.
•Educational content generators that ensure images match lesson requirements (e.g., exact numbers or spatial relationships).
•E-commerce listing creators that reliably render product attributes (right color, size, count) and fix mismatches automatically.
•Data labeling assistants that judge image–text alignment (ITA) and propose precise edits to improve dataset quality.
•Storyboarding tools that plan scenes (Reason), check for continuity and counts (Reflect), and adjust frames (Refine).
•Marketing automation that drafts visuals, audits them against campaign specs, and refines until compliant.
•Accessibility aids that verify images match alt-text, adjusting visuals or text for clarity.
•Scientific illustration helpers that respect exact constraints (e.g., number of molecules, positions) via reflect–refine loops.
•Game asset pipelines that generate and then auto-correct props to match style guides and level design rules.
•Content moderation pre-checkers that spot and fix prompt adherence issues before publication.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes