Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou; Jiachun Jin; Zetong Zhou; Ye Ma; Yugang Wang; Quan Chen; Peng Jiang; Xiao Yang; Jun Zhu; Kai Yu; Zhijie Deng

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Intermediate

Siqi Kou, Jiachun Jin, Zetong Zhou et al.1/15/2026

arXiv PDF

Key Summary

•Most text-to-image models act like word-to-pixel copy machines and miss the hidden meaning in our prompts.
•This paper teaches the text part (an LLM) to think first, rewrite the prompt clearly, and then feed that better plan to the image maker.
•They call this the think-then-generate (T2G) paradigm: plan with reasoning, then paint.
•A special two-part training (Dual-GRPO) rewards the LLM for good understanding and the image model for beautiful, correct pictures.
•The method uses step-by-step Chain-of-Thought to turn fuzzy ideas into precise visual instructions.
•On the WISE benchmark, the model scores 0.79, close to GPT-4o and much better than many open models.
•On T2I-ReasonBench, it reaches 92.2 quality, showing strong reasoning-to-visual alignment.
•For image editing, it makes smarter changes (like melting ice cream in the sun) rather than just restyling the photo.
•A small supervised fine-tuning step activates the LLM’s ‘think and rewrite’ habit without breaking the image generator.
•This is a step toward unified models that can reason, explain, and show their answers as images.

Why This Research Matters

When AI can think before drawing, it makes pictures that truly match what people mean, not just what they say literally. This helps teachers create clear diagrams and step-by-step visuals that students can understand. Designers and storytellers can turn complex ideas into accurate scenes without endless prompt tweaking. Scientists and communicators get images that respect cause and effect and basic physics, improving trust and clarity. Everyday users can edit photos more intelligently—like showing changes over time or explaining how something works—rather than just adding filters. In short, this approach makes AI a better helper for learning, creating, and explaining.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a friend might draw exactly what you say, even if you meant something deeper? If you say “Draw a holiday about Jesus’s birth,” they might literally draw a baby instead of a Christmas celebration, unless they think about what you really meant.

🥬 The Concept (Text-to-Image Diffusion Models): What it is: These are programs that turn your words into pictures by gradually shaping noise into an image that fits your prompt. How it works: 1) Start with random noise, 2) Read your text, 3) Step-by-step remove noise while following the text clues, 4) End with a picture that matches the description. Why it matters: Without this process, the model can’t reliably turn words into clear, detailed images. 🍞 Anchor: Say “a red fox in a snowy forest at sunset.” The model slowly carves a fox, snow, and warm sky out of static, like a sculptor revealing a statue from marble.

🍞 Hook: Imagine you ask a super-smart librarian to explain a tricky sentence before you draw it. That’s what Large Language Models (LLMs) can do—they don’t just read; they reason.

🥬 The Concept (Large Language Models): What it is: LLMs are text experts that can understand and generate language and often know lots of world facts. How it works: 1) Read your words, 2) Predict likely next words, 3) String those into useful thoughts or explanations, 4) Use their knowledge to clarify meaning. Why it matters: If the drawing system only reads words literally, it misses hidden context; an LLM can uncover what you really meant. 🍞 Anchor: If you say “a holiday celebrating Jesus’s birth,” the LLM infers “Christmas” and adds details like a decorated tree, warm lights, and family dinner.

The world before: Many text-to-image models, even those that include LLM encoders, froze the LLM as a plain “text to numbers” tool. They trained on piles of descriptive captions, so they handled literal prompts well (like “a green striped shirt”) but stumbled on conceptual prompts (like “illustrate how a lever works”). These systems behaved like text–pixel mappers: good at obeying surface words, weak at deeper intent.

The problem: We often want images that reflect ideas, stories, rules, or cultural knowledge. Without reasoning, models confuse concepts (like chemistry reactions, idioms, or time changes) and generate mismatched or shallow pictures. This is frustrating for education, design, and communication.

Failed attempts: 1) Prompt engineering—humans adding lots of hints to the prompt—helped a bit but was tiring and inconsistent. 2) Adding an LLM but keeping it frozen gave stronger word embeddings but didn’t unlock reasoning. 3) Pure image-model RL improved style or legibility but not abstract understanding, because the text encoder never learned to think.

The gap: We needed a pipeline where the LLM doesn’t just encode but actively reasons, rewrites the prompt into a clear visual plan, and then passes that plan to the image generator. Also, both parts must learn together so the “thinker” and the “painter” speak the same language.

🍞 Hook: Imagine a chef writes a recipe before cooking; clear steps prevent kitchen disasters.

🥬 The Concept (Think-Then-Generate, T2G): What it is: A two-step plan—first the LLM thinks and rewrites the prompt into a precise recipe, then the diffusion model cooks the image. How it works: 1) Do step-by-step reasoning (Chain-of-Thought), 2) Summarize into a refined prompt, 3) Turn refined prompt into embeddings, 4) Guide the image generator. Why it matters: Without thinking first, the model guesses and often misinterprets conceptual tasks. 🍞 Anchor: “Show the story of marking a boat to find a sword” becomes panels detailing each step, not random boat pictures.

🍞 Hook: When you solve math, you write steps to avoid mistakes. Models can do that too.

🥬 The Concept (Chain-of-Thought, CoT): What it is: The model’s step-by-step reasoning notes. How it works: 1) List important facts, 2) Infer what must appear, 3) Organize scene elements, 4) Summarize into a clean prompt. Why it matters: Without CoT, subtle details (like time changes or cause and effect) get lost. 🍞 Anchor: For “2x−4=10 on a classroom board,” CoT ensures the board shows each solving step, not just random math doodles.

🍞 Hook: Training a team is better than training only the striker—you also train the playmaker.

🥬 The Concept (Dual-GRPO): What it is: A two-part reinforcement learning method that trains both the LLM (the thinker) and the diffusion model (the painter) together using image-grounded rewards. How it works: 1) Sample several reasonings for the same prompt, 2) For each, sample several images, 3) Score images on meaning, looks, and physics, 4) Use group-relative advantages to update both models. Why it matters: If you only train the painter or only the thinker, they drift apart; training both keeps them aligned. 🍞 Anchor: For “Einstein’s favorite instrument,” the LLM proposes “violin” and the painter draws a wooden violin; the reward reinforces both the correct choice and a beautiful, accurate picture.

Real stakes: This matters for classrooms (diagrams that truly teach), content creation (storyboards that follow logic), science communication (accurate experiments), cultural depictions (holidays shown correctly), and everyday tools (editing photos with cause-and-effect). Thinking before drawing turns AI from an obedient copier into a helpful visual explainer.

02Core Idea

🍞 Hook: Picture a builder who sketches a blueprint before raising the walls—mistakes drop when planning comes first.

🥬 The Concept (Aha! Moment): What it is: Let the text brain (LLM) think and rewrite first, then hand that refined plan to the image brain (diffusion), and train both together with image-based rewards. How it works: 1) Activate the LLM’s think-then-rewrite habit via supervised examples, 2) For each rewrite, generate images, 3) Score images for meaning and quality, 4) Use Dual-GRPO to improve both the LLM and the image model in tandem. Why it matters: Without planning and co-training, models misread intent or paint pretty but wrong pictures. 🍞 Anchor: Asking for “a holiday celebrating Jesus’s birth,” the system outputs a warm Christmas scene—tree, lights, gathering—rather than a literal baby portrait.

Multiple analogies:

Teacher and student: The LLM is the teacher writing a clear assignment; the diffusion model is the student drawing it. Better instructions, better drawings.
Coach and athlete: The LLM designs the play; the diffusion model runs it. Dual training sharpens both strategy and execution.
Recipe and cooking: The LLM writes the recipe (ingredients and steps); the diffusion model cooks the dish. Judging taste, nutrition, and presentation gives feedback to both.

Before vs After:

Before: Prompts were treated literally; abstract tasks led to confused or shallow images.
After: The model unpacks hidden meaning (using CoT), writes a precise visual plan, and renders images that align with ideas, not just words.

🍞 Hook: You know how understanding the goal makes every step easier?

🥬 The Concept (Why It Works—Intuition): What it is: A feedback loop where reasoning shapes the prompt, and image quality teaches the reasoner what works. How it works: 1) The LLM’s rewrite clarifies key objects, relations, and steps, 2) The image model tries to paint that plan, 3) Rewards tell both parts which choices matched the intended meaning and looked right, 4) Over time, planning and painting synchronize. Why it matters: Without feedback from images, the LLM might write unrenderable or vague instructions; without better instructions, the painter can’t express concept-heavy ideas. 🍞 Anchor: If the LLM writes “lavender fields in winter with frost,” but the image looks like summer, the reward nudges the LLM to emphasize frost and the painter to show icy textures.

Building blocks:

🍞 Hook: Imagine selecting the best plan by trying a few drafts, then picking the clearest one.

🥬 The Concept (Group-wise reasoning samples): What it is: Generate several reasoning rewrites for the same prompt. How it works: 1) Sample J different CoT rewrites, 2) For each rewrite, make K images, 3) Compare groups by scores, 4) Improve future rewrites using the group’s relative standing. Why it matters: One rewrite might miss a key fact; comparing groups finds better plans. 🍞 Anchor: For “1912 ship hitting an iceberg,” one rewrite might forget the year; the higher-scoring one includes the Titanic’s look and night setting.

🍞 Hook: When you learn drawing, you’re judged on accuracy, neatness, and realism.

🥬 The Concept (Three kinds of rewards): What it is: Scores for semantic alignment (did it match meaning?), aesthetics (does it look good?), and physical consistency (is it believable?). How it works: 1) A meaning checker compares prompt and image, 2) A style judge rates beauty, 3) A physics checker scores realism, 4) Combine them with weights. Why it matters: Only focusing on looks makes pretty-but-wrong images; only focusing on meaning can be ugly. 🍞 Anchor: A science diagram should be both correct (labels, steps) and clear to read.

🍞 Hook: If two friends speak different dialects, they misunderstand each other.

🥬 The Concept (Co-optimization via Dual-GRPO): What it is: Train the LLM and the diffusion model at the same time so their languages match. How it works: 1) Share image-based rewards across both, 2) Keep each close to a safe reference (KL regularization), 3) Use clipping to avoid wild updates, 4) Repeat until stable. Why it matters: Training only one side leads to mismatches and poor results. 🍞 Anchor: The LLM stops suggesting odd tokens the painter can’t draw; the painter learns the LLM’s new style of instruction.

Together, these pieces turn “think-then-generate” from a slogan into a working recipe that handles culture, science, space, time, and logic in pictures.

03Methodology

At a high level: Input prompt → LLM thinks with Chain-of-Thought → LLM rewrites into a refined prompt → Convert refined prompt to embeddings → Diffusion model generates images → Image-grounded rewards score results → Dual-GRPO updates both LLM and diffusion model.

🍞 Hook: Before cooking a new dish, you collect example recipes and practice them.

🥬 The Concept (Supervised Fine-Tuning, SFT): What it is: A small training step that teaches the LLM to always “think, then rewrite.” How it works: 1) Build a dataset: raw prompt → long reasoning → refined prompt, 2) Train the LLM to mimic this pattern, 3) Keep its embedding space stable so the image model doesn’t break, 4) Check with visualization (like t-SNE) to confirm stability. Why it matters: Without SFT, the LLM won’t reliably produce clear visual plans. 🍞 Anchor: For “traditional food for Dragon Boat Festival,” SFT pushes the LLM to reason “zongzi” and describe it clearly for the painter.

Concrete recipe steps:

Reasoning activation: The LLM (e.g., Qwen2.5-VL) is fine-tuned on pairs where it writes Chain-of-Thought and a refined prompt. The goal is a dependable habit: think first, then summarize clearly.
Hierarchical sampling: For a user prompt, sample J different reasoning rewrites. For each rewrite, sample K images from the diffusion model. This creates a tree of possibilities.
Image-grounded scoring: For each generated image, compute three scores: semantic alignment (meaning match), aesthetics (looks), and physical consistency (believability). Combine them with chosen weights.
Group-relative updates (LLM): Compare the sets of K images that came from each rewrite. The average score of each set becomes feedback for that rewrite. Normalize scores within the group so the LLM learns which rewrite style works best.
Group-relative updates (Diffusion): Within each rewrite’s K images, scores guide the diffusion model to prefer steps that produce accurate, beautiful, plausible pictures.
Joint training loop: Update both LLM and diffusion model together with Dual-GRPO so the planner and painter stay in sync.

🍞 Hook: When choosing class posters, you compare groups fairly to find the best ideas.

🥬 The Concept (Group Relative Policy Optimization, GRPO): What it is: A training rule that normalizes rewards within related groups to reduce noise. How it works: 1) Gather outputs for the same input, 2) Compute each one’s score relative to the group’s average and spread, 3) Update the model to favor better-than-average samples, 4) Keep changes controlled with a safety clip and stay close to a reference model. Why it matters: Without group-relative scoring, training can be unstable or chase lucky flukes. 🍞 Anchor: If four rewrites exist, the one whose images score best gets boosted; the worst gets gently discouraged.

🍞 Hook: A painter needs a bit of creative wiggle room to explore styles.

🥬 The Concept (Flow-GRPO for Diffusion): What it is: A tweak that gives the image model controlled randomness during generation so it can explore and learn, while staying trainable. How it works: 1) Add tiny noise to the backward steps (turn a fixed path into a slightly random one), 2) This creates multiple trajectories for the same prompt, 3) Score final images and attribute credit to the steps, 4) Learn to choose better steps next time. Why it matters: Without this, the model can’t explore alternatives, and learning stalls. 🍞 Anchor: Trying a few brushstroke paths helps the painter find the one that best matches the plan.

🍞 Hook: Judges at a fair look for meaning, beauty, and realism, not just one thing.

🥬 The Concept (Stage-specific rewards): What it is: Different rewards for different stages. How it works: 1) During LLM reasoning, average the semantic scores from the K images made from that rewrite and feed that back to the LLM, 2) During diffusion, use a weighted mix of semantic, aesthetic, and physical consistency for each image to guide rendering, 3) Use a scheduler to balance how much each stage’s rewards matter over time. Why it matters: If all steps share one final reward, important parts of the process get ignored. 🍞 Anchor: The LLM learns to mention key elements the painter can draw; the painter learns to present them clearly.

🍞 Hook: Sometimes you balance your study time between math and art to grow in both.

🥬 The Concept (Reward Scheduler): What it is: A rule for how strongly LLM rewards and diffusion rewards count during training. How it works: 1) Balanced scheduler: give both equal weight all the time, 2) Staged scheduler: focus on LLM early, diffusion later, 3) Pick what performs best on validation. Why it matters: Bad balance makes either the planner too fancy to draw or the painter too literal to teach. 🍞 Anchor: The paper finds the balanced scheduler works slightly better overall.

Putting it together with an example:

Prompt: “Show how a lever lifts a heavy rock in three panels.”
LLM CoT: Defines the pivot, arm, effort, load, and sequence across panels.
Refined prompt: “Three-panel diagram: Panel 1 shows a long rod over a small pivot with a rock on one side; Panel 2 shows a child pushing down the other side; Panel 3 shows the rock raised…”
Diffusion: Generates K images per rewrite, scores them, and both models update via Dual-GRPO. Over time, the LLM consistently mentions pivots and forces, and the images consistently show correct, clear mechanics.

04Experiments & Results

🍞 Hook: Think of a school fair with different booths—culture, science, space, and time—testing how well drawings match tricky ideas.

🥬 The Concept (WISE Benchmark): What it is: A big test with 1000 prompts across culture, space-time, and natural science that checks whether images match world knowledge and reasoning. How it works: 1) Give models complex prompts, 2) Score their images for correctness and alignment, 3) Compare across many sub-domains, 4) Report an overall score. Why it matters: Without such tests, models might look good but fail on real understanding. 🍞 Anchor: A prompt about a 1912 iceberg collision expects Titanic-like scenes, not random ships.

🍞 Hook: Another fair booth asks for diagrams, idioms, and scientific steps, then grades how well the picture follows the plan.

🥬 The Concept (T2I-ReasonBench): What it is: A test focused on reasoning-to-visual alignment across idioms, textual image design, entity reasoning, and scientific reasoning. How it works: 1) For each prompt, an evaluator lists what must appear, 2) A multimodal judge checks if the picture includes those elements, 3) Scores are aggregated by category, 4) Both accuracy and quality are reported. Why it matters: It measures if a model can turn step-by-step thinking into the right visuals. 🍞 Anchor: For “solve 2x−4=10 on a board,” credit is given only if the image shows the right steps, not just any math scribbles.

The competition: The paper compares against diffusion models (like FLUX and Stable Diffusion variants), unified multimodal models (like Bagel, HunyuanImage), and strong closed systems (GPT-4o, Gemini-2.0). Many baselines are great at pretty pictures but struggle with deep semantic tasks when prompts are conceptual.

The scoreboard with context:

On WISE, the proposed method scores 0.79. That’s like getting an A when others got B’s and C’s. It’s close to GPT-4o’s around 0.80 and much higher than vanilla Qwen-Image (about 0.61).
On T2I-ReasonBench, it reaches 92.2 in quality, surpassing even Gemini-2.0 in this assessment, and showing that planning-then-painting pays off in reasoning-heavy prompts.
Zero-shot “just think a bit” (adding CoT at inference without training) gives only a small bump (e.g., WISE from ~0.61 to ~0.65). The big gains come from SFT plus Dual-GRPO co-training.
For image editing, the T2G-trained editor beats strong open baselines and even surpasses a closed model on UniREditBench, and shows strong results on RISEBench, especially in temporal and causal edits.

Surprising findings:

Balanced scheduler (training both thinker and painter together) outperforms staged training (first thinker, then painter). Joint growth keeps the languages aligned.
After Dual-GRPO, the LLM stops suggesting odd or unrenderable tokens; the diffusion model adapts to the LLM’s clearer instructions, raising both semantic accuracy and aesthetics.
The method fixes conceptual misunderstandings in editing: for “ice cream after one hour in the sun,” it shows melting (cause and effect) rather than just a sunny filter.

Qualitative highlights:

Cultural prompts (e.g., Christmas) become rich, context-aware scenes rather than literal-only renderings.
Science and math prompts show correct steps, labels, and plausible physics.
Multi-panel stories follow logical order and stay visually consistent across panels.

Overall, the results show that thinking first and co-training both parts turn the model from a literal copier into a reliable visual reasoner.

05Discussion & Limitations

Limitations:

Data dependence: The supervised fine-tuning relies on curated examples with high-quality reasoning and rewrites. If coverage is narrow or biased, the model may underperform on unseen concepts.
Compute and sampling: Generating J rewrites and K images per prompt for training is compute-heavy, and inference can be slower when thinking steps are long.
Reward design: If semantic, aesthetic, or physics scorers are imperfect, the model can be nudged in the wrong direction.
Safety and culture: Better reasoning can amplify biases if training data is skewed; cultural depictions need careful evaluation.

🍞 Hook: If a game only rewards flashy moves, players might learn to show off instead of playing correctly.

🥬 The Concept (Reward Hacking): What it is: When a model learns to chase the scoring function instead of true task quality. How it works: 1) The model detects patterns that trigger high scores, 2) It repeats them even if they miss real meaning, 3) Over time, it optimizes the metric, not the user’s intent. Why it matters: Misaligned rewards can create pretty-but-wrong images. 🍞 Anchor: A model could overuse saturated colors to impress an aesthetic scorer while missing key scene details.

Required resources:

A capable LLM encoder (e.g., Qwen2.5-VL or similar), a diffusion backbone (e.g., a DiT), and GPU resources for Dual-GRPO.
Access to or construction of reasoning-rich SFT data (raw prompt → CoT → refined prompt).
Image evaluators or reward models for semantic, aesthetic, and physical checks.

When not to use:

Ultra-low-latency generation with minimal compute, where extra reasoning steps are too slow.
Tasks that demand strict photorealism from minimal text (e.g., quick style swaps), where simple text–pixel mapping might suffice.
Domains with scarce or unreliable reward signals (e.g., unusual physics) where scoring can mislead training.

Open questions:

How to automatically learn better reward models that track human judgment across cultures and domains?
Can we compress or distill the think-then-generate process for faster inference without losing reasoning quality?
How far can this approach go for video or 3D, where time and physics are even more complex?
What is the best balance between LLM creativity and the diffusion model’s renderability to avoid unpaintable plans?

06Conclusion & Future Work

Three-sentence summary: This paper turns text-to-image systems from literal copiers into thoughtful visual explainers by making the LLM think first and then guide the painter. A joint reinforcement learning method (Dual-GRPO) uses image-based rewards to improve both the LLM’s planning and the diffusion model’s rendering at the same time. The result is strong gains on reasoning-heavy benchmarks and smarter image editing that respects cause, effect, and context.

Main achievement: Establishing a practical, training-stable think-then-generate pipeline that co-optimizes reasoning (LLM) and rendering (diffusion) to close the gap between user intent and final images.

Future directions: Expand to video and 3D with temporal reasoning; develop richer, fairer reward models that better reflect human preferences; distill or cache reasoning for faster inference; and integrate safety and cultural sensitivity checks directly into the reward loop.

Why remember this: It shows that planning before painting—and training both planner and painter together—lets AI handle deeper ideas, not just surface words. That shift moves us toward unified models that can reason, explain, and demonstrate knowledge visually, making AI more helpful for teaching, design, and everyday problem-solving.

Practical Applications

•Classroom diagrams that show correct scientific steps (e.g., phases of the Moon, lever mechanics).
•Storyboarding with consistent, logically ordered panels for films or comics.
•Concept art that reflects cultural and historical context accurately (e.g., holidays, notable events).
•Math instruction images that display full solution steps on a board or worksheet.
•Smart image editing that applies cause-and-effect changes (aging, weathering, melting, growth).
•Technical illustrations for manuals that highlight parts and operational flow.
•Marketing visuals that faithfully depict product features and use-cases without misleading details.
•Accessible explainers for news and science articles that combine accuracy with visual clarity.
•Prototype-to-visual pipelines where rough ideas are rewritten into precise renderable prompts.
•Educational quiz generators that create images matching reasoning-based criteria.

Version: 1