Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Harold Haodong Chen; Xinxiang Yin; Wen-Jie Shu; Hongfei Zhang; Zixin Zhang; Chenfei Liao; Litao Guo; Qifeng Chen; Ying-Cong Chen

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Intermediate

Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu et al.2/2/2026

arXiv PDF

Key Summary

•LatentMorph teaches an image-making AI to quietly think in its head while it draws, instead of stopping to write out its thoughts in words.
•It keeps a small visual memory of what has been drawn so far and decides smartly when to pause and rethink.
•When rethinking is needed, it turns those silent thoughts into gentle steering signals that guide the next brushstrokes.
•Because the thinking stays in the AI’s hidden space (not text), it keeps rich details and avoids losing information.
•A special “invoker” learns with reinforcement learning to trigger thinking only when it will help, saving time and tokens.
•On tough tests like GenEval and T2I-CompBench, it boosts the base model Janus-Pro by 16% and 25% respectively.
•It also beats text-based reasoning baselines on abstract tasks (like WISE and IPV-Txt) by up to 15% and 11%.
•Inference is much faster (44% less time) and uses fewer tokens (51% fewer) than explicit reasoning methods.
•Humans agreed with the model’s timing for when to think 71.8% of the time, showing good cognitive alignment.
•The method plugs into autoregressive generators without changing their structure by injecting control tokens into the KV cache.

Why This Research Matters

LatentMorph makes image generation more faithful to your instructions while running faster and using fewer tokens. It fixes a long-standing bottleneck by keeping the model’s reasoning in its natural hidden space instead of forcing it into text. This helps with tricky requests—like exact object counts and precise layouts—that often trip up other systems. Designers, teachers, and creators can get better first results, saving time and reducing retries. The adaptive timing feels human, which means fewer awkward pauses and smoother workflows. Because it’s model-agnostic and lightweight, it can be adopted by many autoregressive generators. Overall, it points toward creative AI that thinks more like we do—quietly, continuously, and effectively.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine drawing a picture while telling a story. Sometimes you pause, look at what you’ve sketched, and adjust before adding the next part. You don’t say your thoughts out loud every time—you just ‘feel’ what to fix.

🥬 The Situation Before: Text-to-image (T2I) AIs turned words into pictures very well, but mostly like a one-shot translation machine: text in, pixels out. They didn’t naturally pause to reflect or refine mid-drawing like people do when creating art. Some systems tried to add “reasoning” by asking a language model to write down intermediate thoughts (chain-of-thought) and feed them back to the image model at set times.

How that worked: (1) Generate part of the image, (2) Decode it to pixels, (3) Ask a text reasoner to explain what to do next in words, (4) Re-encode those words back into the image generator, (5) Repeat.
What went wrong: Every decode–re-encode pass costs time, uses up token budget, and squeezes rich visual hunches into narrow text, which can miss subtle details like fine textures or lighting.

🥬 The Problem: Forcing the AI to explain itself in words at fixed steps causes three pains:

Information loss: Rich visual thoughts get flattened into text. 2) Inefficiency: Frequent decoding/recoding slows everything down. 3) Cognitive mismatch: Humans don’t narrate each micro-thought while drawing; we guide ourselves with quiet, continuous intuitions.

🍞 Anchor Example: Think of trying to describe a sunset’s glow using only a few words versus just adjusting the colors with your eyes and brush—the second way keeps more nuance and is faster.

🍞 Hook: You know how sometimes you keep a quick mental snapshot of what your drawing already looks like so you don’t forget the big picture?

🥬 What People Tried and Why It Fell Short: Two main styles emerged.

External-loop: A separate language model edits prompts or critiques results after seeing images. Good feedback, but lots of back-and-forth.
Internal-loop: A unified model pauses at fixed steps to explain what to fix in text during generation. Better integration, but still text bottlenecks and fixed timing. Both styles usually rely on explicit written thoughts at preset times, not the flexible, quiet tweaks humans do.

🍞 Anchor Example: It’s like asking a friend to stop you every 20 seconds while you draw and make you write a paragraph about what to do next—helpful sometimes, but slow and not how artists naturally work.

🍞 Hook: Imagine if the AI could keep its thoughts as colors and shapes in its head instead of turning them into sentences.

🥬 The Missing Piece: Let the model think in its hidden space (continuous latents) and only pause when needed. Keep a compact visual memory of progress, let a small brain decide if it’s time to think, and then steer the next strokes directly—no text detour.

🍞 Anchor Example: A painter silently steps back, scans the canvas, and adds a few guiding brushstrokes to fix composition—no diary entry required.

🍞 Hook: Why should anyone care? Because better, faster, and more faithful images mean nicer book covers, more accurate product mockups, and less fuss for artists.

🥬 Real Stakes: If models can refine like humans do, we get images that follow instructions precisely (like correct counts and positions), capture abstract ideas (like surreal physics), and do it faster with fewer resources. This helps designers, teachers, storytellers, and anyone who turns ideas into visuals.

🍞 Anchor Example: When you ask for “three red birds on the left branch and one blue on the right,” the model that thinks silently mid-draw is more likely to deliver exactly that—without five slow retries.

02Core Idea

🍞 Hook: You know how you can silently adjust your plan while building LEGO—no need to say every step out loud?

🥬 Aha in One Sentence: LatentMorph lets an image model quietly reason inside its hidden layers and gently steer the next tokens at the right moments, skipping slow text explanations.

🥬 Multiple Analogies:

Art Teacher: Instead of writing a long critique note, the teacher lightly taps the canvas to nudge the student’s brush in real time.
GPS Recalculation: The car GPS doesn’t read you a novel; it quietly recalculates and updates your route when needed.
Orchestra Conductor: The conductor doesn’t stop the concert to give a speech; they adjust tempo and volume with subtle gestures.

🍞 Anchor Example: When generating “a rabbit near a train,” the model notices the rabbit is missing and silently adjusts the next strokes so the rabbit appears at the right place and size.

🍞 Hook: Imagine keeping a tiny scrapbook of what you’ve drawn so far to avoid repeating mistakes.

🥬 Concept 1 — Visual Memory

What it is: A tiny, smart summary of the recent and overall drawing progress stored as hidden vectors.
How it works:
1. Short-term condenser packs the latest steps into a small local memory.
2. Long-term condenser keeps a compact summary of the whole history.
3. These memories are updated as the image grows.
Why it matters: Without it, the model forgets what’s already on the canvas and can’t judge when to rethink.
Anchor: Like glancing at a thumbnail of your drawing to remember global composition.

🍞 Hook: You often “just know” what tweak to make next.

🥬 Concept 2 — Implicit Latent Reasoning

What it is: The model’s quiet, continuous thinking inside hidden states instead of written text.
How it works:
1. Read visual memory.
2. Form a latent thought (a dense vector) about what to fix.
3. Turn that thought into guidance signals.
4. Steer the next tokens while staying silent.
Why it matters: No wordy detours; fewer lost details; faster decisions.
Anchor: Adjusting color balance by eye instead of explaining it.

🍞 Hook: When do you pause to check your work? Not every second—only when it feels necessary.

🥬 Concept 3 — Cognitive Alignment

What it is: Making the model’s pause-and-fix rhythm match how humans naturally create.
How it works:
1. Track alignment with the prompt, confidence, and changes over time.
2. Use reinforcement learning to learn good timing.
3. Trigger reflection only when signals say it will help.
Why it matters: Avoids annoying, wasteful pauses; fixes at the right moments.
Anchor: Like checking a map only when the streets look unfamiliar.

🍞 Hook: Picture shaping a lump of clay into a neat figure—subtle pushes at the right spots.

🥬 Concept 4 — Latent Morphing

What it is: Turning those quiet thoughts into the exact kind of signals the image generator understands.
How it works:
1. Combine latent thought + long-term memory + prompt embedding.
2. Translate them into control vectors.
3. Insert them so future tokens follow better paths.
Why it matters: Thoughts become actions without breaking the flow.
Anchor: Whispering cues to a performer through in-ear monitors.

🍞 Hook: Imagine adding a few helpful guide-notes to a music score mid-performance so the next bars sound right.

🥬 Concept 5 — Dynamic Control Injection

What it is: Injecting small control tokens into the model’s attention cache to guide upcoming predictions.
How it works:
1. Build control tokens from the translated thought.
2. Insert them into the key–value (KV) cache of the transformer.
3. Let the next attention look at these tokens for better choices.
Why it matters: Guidance is added softly and instantly; no need to rewrite the prompt or stop the show.
Anchor: Sticky notes added beside a recipe while you keep cooking.

Before vs After:

Before: Stop-and-talk loops, text bottlenecks, fixed steps.
After: Quiet, continuous, adaptive steering that’s faster and more faithful.

Why It Works (intuition): Hidden vectors can store rich, fine-grained visual cues that words struggle to express (like subtle textures and precise spatial layouts). By keeping thinking inside this space and injecting guidance into attention, the generator gets precise hints exactly when needed.

Building Blocks:

Short-term condenser, Long-term condenser (visual memory)
Invoker (RL-trained timing of reflection)
Translator (latent thoughts → control signals)
Shaper (inject signals into KV cache)
All running inside an autoregressive generator stream.

03Methodology

High-Level Recipe: Prompt → Generate tokens while monitoring → (Sometimes) Reason in latent space → Translate thought → Inject control tokens → Continue generating → Final image.

🍞 Hook: Think of it like drawing while wearing smart glasses that show tiny hints only when you’re drifting off-plan.

🥬 Step 1 — Autoregressive Generation with Monitoring

What: The model creates image tokens one by one, while we watch its recent hidden activity.
How:
1. As tokens are generated, hidden states form a trail.
2. Every window (e.g., 64 tokens), a short-term condenser compresses the last steps into a small vector memory.
3. We compute signals: prompt similarity, uncertainty, recent change, and stability.
Why needed: Without monitoring, we can’t know when to rethink.
Example: If similarity to the prompt drops and uncertainty rises, that’s a hint we might be going off track.

🍞 Hook: You don’t always stop to think, only when you sense something’s off.

🥬 Step 2 — The Invoker (When to Think)

What: A tiny policy network decides CONTINUE vs REASON.
How:
1. Input signals: semantic consistency with the prompt, token uncertainty (entropy), recent change (delta), and variance over time.
2. Output: probability of invoking reasoning now.
3. Training: Reinforcement Learning (GRPO) rewards better images and lightly penalizes over-invoking.
Why needed: Fixed schedules waste time or miss critical moments.
Example: On an easy prompt, it rarely pauses; on complex spatial scenes, it pauses more often.

🍞 Hook: When you do pause, you look at the whole canvas, not just the last stroke.

🥬 Step 3 — Long-Term Visual Memory (What We’ve Done So Far)

What: A long-term condenser summarizes the entire token history into a compact memory.
How:
1. Process the history in chunks with attention to pick the most informative bits.
2. Keep a small set of memory tokens and a pooled summary.
3. Feed this, plus the prompt, to the reasoning branch.
Why needed: The reasoner needs a big-picture snapshot without decoding a full image.
Example: It remembers that “two birds already placed on the left” so it won’t add extra birds accidentally.

🍞 Hook: Instead of writing notes, you form a clear mental intention about the next fix.

🥬 Step 4 — Implicit Latent Reasoning (Quiet Thinking)

What: The understanding branch forms a latent thought vector (z) about how to refine.
How:
1. Read long-term memory and prompt embeddings.
2. Deliberate inside hidden space to produce z.
3. Keep it continuous—no text decoding.
Why needed: Preserves rich cues (textures, lighting, layout) that text can’t capture well.
Example: “The rabbit is too small and too far right”—stored as a dense vector, not a sentence.

🍞 Hook: Now turn that intention into a gentle nudge the painter can follow.

🥬 Step 5 — Translator (Thought → Guidance)

What: Convert z + memory + prompt into a control signal c.
How:
1. Fuse them with a small MLP and a gate that filters noise.
2. Output a clean control vector.
Why needed: The generator speaks in hidden vectors; translator makes thoughts compatible and grounded in context.
Example: The control signal says, in vector form, “slightly enlarge rabbit; shift left; keep color tone consistent.”

🍞 Hook: Add helpful sticky notes to the model’s attention so future strokes listen.

🥬 Step 6 — Shaper (Inject Guidance into KV Cache)

What: Turn c into a few control tokens and insert them into the transformer’s key–value (KV) cache.
How:
1. Build j control tokens from c.
2. Append them to KV so the next attention reads them like extra context.
3. Do not overwrite the original prompt; softly add guidance.
Why needed: This preserves the global goal while steering the next decisions.
Example: The very next tokens attend to these control tokens and correct object size/position.

🍞 Hook: Train the helpers first, then teach the timing.

🥬 Training Two-Stage Recipe

Stage 1 (SFT):
1. Train long-term condenser, translator, and shaper on 20k text–image pairs.
2. Randomly pick one intervention step per sample and inject control tokens.
3. Optimize normal token prediction loss; no extra reasoning labels needed.
Stage 2 (RL):
1. Freeze the above modules.
2. Train the invoker and short-term condenser with GRPO.
3. Reward = CLIP score + Human Preference Score; add a penalty to avoid over-invoking.
Secret Sauce: All thinking happens in latent space; we never force thoughts into text, and we inject guidance directly into attention, so it’s fast and information-rich.

🍞 Anchor Example: For the prompt “three red birds on the left branch, one blue on the right,” the model starts placing birds. When it senses a counting drift, it pauses, forms a latent thought, injects control tokens, and neatly finishes with the correct counts and positions—without any text back-and-forth.

04Experiments & Results

🍞 Hook: Imagine a school contest where artists must follow tricky instructions quickly and accurately—now swap in AI models.

🥬 The Tests

What measured: Instruction following (fidelity), compositional reasoning (colors, shapes, spatial relations, numeracy), abstract understanding (physics-defying prompts), efficiency (time and tokens), and human-like timing.
Why: To see if quiet, in-head reasoning really beats writing thoughts out loud.

🥬 The Competition

Compared against: Vanilla Janus-Pro, SFT, GRPO, Self-CoT, T2I-R1, TIR, T2I-Copilot, MILR, and TwiG (ZS and RL).
Benchmarks: GenEval, T2I-CompBench (+PlusPlus), WISE, IPV-Txt.

🥬 The Scoreboard (with context)

General + Compositional: • +16% on GenEval over base Janus-Pro: like jumping from a solid B to an A. • +25% on T2I-CompBench overall: big gains on hard categories (e.g., Non-Spatial, 3D-Spatial, Texture). • On T2I-CompBench++, LatentMorph tops TwiG-RL by about 7% overall and >8% in 3D-Spatial.
Abstract Reasoning: • Beats explicit reasoning (e.g., TwiG) on WISE and IPV-Txt by up to 15% and 11%: like solving riddles better because it keeps subtle, non-verbalizable cues.
Efficiency: • 44% less inference time and 51% fewer tokens than explicit reasoning setups: closer to real-time drawing.
Cognitive Alignment: • 71.8% agreement with human judges on when to pause and think: timing feels natural, not robotic.

🥬 Surprises

For “impossible prompts” (like walking on water), text explanations weren’t enough; latent reasoning captured the feel of the paradox better.
Even a variant that forced thoughts into text (w/o latent) lost subtle textures and lighting, confirming the value of staying in latent space.

🍞 Anchor Example: In side-by-sides, LatentMorph fixes missing objects, counts better (e.g., exact number of birds), and places items more precisely—all while finishing faster than methods that stop to write their thoughts.

05Discussion & Limitations

🍞 Hook: Even great tools have edges you should know before using them.

🥬 Limitations

What it can’t do (yet):
1. If the base generator lacks knowledge (e.g., rare objects), latent reasoning can’t invent it from thin air.
2. Extremely long, multi-scene prompts may still need more than one latent intervention.
3. Safety/bias mirrors the base model—this method doesn’t retrain its values.
4. Requires access to hidden states and KV cache; not all closed APIs allow this.

🥬 Required Resources

Needs an autoregressive T2I model (e.g., Janus-Pro), GPU acceleration, and modest extra modules (condensers, translator, shaper, invoker). RL training uses reward models like CLIP and HPS.

🥬 When NOT to Use

If your deployment forbids KV-cache access or custom adapters.
If you must provide human-readable rationales (this method keeps thoughts silent).
If latency is ultra-critical but hardware is extremely limited (though LatentMorph is faster than explicit methods, it’s still more than a bare vanilla run when invoking).

🥬 Open Questions

Can we learn even better timing signals (e.g., from gaze-like attention patterns)?
How to blend safety guidance into the same latent control pipeline?
Can similar latent steering improve video generation’s long-range coherence?
How many control tokens are optimal across different model sizes?

🍞 Anchor Example: If a client needs step-by-step written justifications for every change, LatentMorph’s silent style isn’t the right fit—choose an explicit reasoning method instead.

06Conclusion & Future Work

🍞 Hook: Think of an artist who quietly adjusts as they go—fewer pauses, better results.

🥬 3-Sentence Summary

LatentMorph lets text-to-image models think silently in their hidden space and steer the next strokes without writing out thoughts.
It uses visual memory, an RL-trained invoker to time reflections, and a translator–shaper pair to inject control tokens into the attention cache.
This delivers higher fidelity, stronger abstract reasoning, and faster inference than text-based reasoning loops, while better matching human creative timing.

🥬 Main Achievement

Turning implicit latent reasoning into direct, dynamic control of image token generation—no text detours, just smooth, on-the-fly refinement.

🥬 What’s Next

Extending to video and multi-scene stories, integrating safety constraints into latent control, and exploring adaptive numbers of control tokens for larger models.

🥬 Why Remember This

It marks a shift from “explain then act” to “quietly think while acting,” proving that keeping thoughts in latent space can be both smarter and faster for creative AI.

Practical Applications

•Product design mockups that honor exact counts, colors, and placements without many retries.
•Educational illustrations that need precise spatial relations (e.g., ‘three planets aligned left to right’).
•Storybook art where characters must appear consistently across pages with correct poses and sizes.
•Marketing visuals that match detailed brand guides (exact tones, textures, and logo positions).
•Scientific diagrams that require faithful composition and accurate object relationships.
•UI/UX concept renders with strict layout constraints (elements in precise positions).
•Architectural sketches that keep global structure consistent while refining local details.
•Fashion lookbooks where outfits and accessories must follow exact styling prompts.
•Game asset generation that respects spatial logic and object counts in scenes.
•Fast iterative ideation: better first-pass images reduce back-and-forth revision cycles.

Version: 1