Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang; Yuhang Chen; Zhihan Hu; Dongping Chen; Wenhu Chen; Sarah Wiegreffe; Tianyi Zhou

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Intermediate

Chenlong Wang, Yuhang Chen, Zhihan Hu et al.2/2/2026

arXiv PDF

Key Summary

•This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).
•The authors build GAPEVAL, a fair, two-way test where every question can be answered as text or as an image, so the same knowledge must work in both directions.
•They introduce a Gap Score, based on a careful testing method (MIRT), to measure how big the split is between understanding and generation while accounting for question difficulty.
•Across nine unified multimodal models, there is a persistent gap: models frequently get one side right and the other side wrong on the same item.
•Higher overall performance does not always mean better unification; some strong models still show large gaps between understanding and generation.
•There is a lagging effect: as models get better, understanding tends to improve earlier than generation, which temporarily widens the gap before it later shrinks.
•Knowledge injection/editing experiments show knowledge often stays separated across modalities—training one side rarely fixes the other.
•Even state-of-the-art unified models can underperform non-unified baselines on some tasks, suggesting current 'unification' is mostly engineering-level, not truly cognitive.
•GAPEVAL’s Gap Score correlates with synergy benchmarks, implying that closing the gap is a prerequisite for real cross-modal teamwork inside models.

Why This Research Matters

Real-world assistants must both understand your photos and create accurate edits or illustrations when asked—otherwise they’re unreliable. In education, a tutor that explains diagrams but can’t draw them (or draws them wrong) fails students who learn visually. For accessibility, tools that convert between words and images must be consistent so people who rely on them get trustworthy information. In robotics, a system that reads a floor plan but can’t sketch a correct path risks safety and efficiency. GAPEVAL gives builders a fair, difficulty-aware way to measure and fix these issues, so future models share knowledge across modalities instead of acting like two tools taped together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine a student who can read a story out loud perfectly (understanding) but struggles to write their own story (generation). Or another student who writes beautiful stories but doesn’t really understand the ones they read. Would you say either student has mastered language? Probably not—you’d want both skills to work together.

🥬 Filling (Concept 1: Unified Multimodal Models, UMMs)

What it is: UMMs are single AI systems that can both understand inputs like images and text and also generate outputs like answers (text) or pictures (images).
How it works:
1. Take in text and/or images.
2. Reason about what’s being asked.
3. Output either text (like an answer) or an image (like a drawing).
Why it matters: Without one brain handling both sides well, the model may act like two loosely connected tools rather than one smart, coordinated thinker. 🍞 Bottom Bread (Anchor): Think of a Swiss Army knife: it’s supposed to be one tool that can cut, twist, and open. If the screwdriver works but the knife doesn’t, it isn’t truly unified.

🍞 Top Bread (Hook): You know how in class, sometimes you need to explain what you see in a picture? That’s understanding. And sometimes, you’re asked to draw something based on words? That’s generation.

🥬 Filling (Concept 2: Understanding)

What it is: Understanding is when the AI reads text and/or looks at images and answers with text that explains or solves the task.
How it works:
1. Look at the input (image and/or text).
2. Figure out what is important.
3. Produce a clear, correct text answer.
Why it matters: Without solid understanding, the AI can’t reason about what it sees—it might miss the point and answer incorrectly. 🍞 Bottom Bread (Anchor): If shown a photo of two cats and asked, “How many cats?” understanding means answering “Two.”

🍞 Top Bread (Hook): Now flip it: imagine getting a sentence like “Draw a red apple on a plate,” and you have to create the image.

🥬 Filling (Concept 3: Generation)

What it is: Generation is when the AI reads text and produces an image that matches the instruction.
How it works:
1. Read the text instruction.
2. Plan which objects, colors, and layouts are needed.
3. Create an image that matches the plan.
Why it matters: Without reliable generation, the AI can’t show what it knows visually, which limits creativity and utility. 🍞 Bottom Bread (Anchor): If asked, “Make a picture with three balloons and one cake,” generation is drawing that exact scene.

The world before: Researchers made great progress training models to do image understanding (answering questions about pictures) and image generation (drawing from text). But they mostly tested these skills separately. That’s like testing reading on Monday and writing on Tuesday—good scores don’t prove the skills connect inside one brain. When newer ‘unified’ models arrived—promising both skills in one—they were mostly still judged with one-way tests (only text answers or only images), so no one could tell if the two abilities reinforced each other.

The problem: Do understanding and generation really share one set of smarts inside a unified model? Or are they just two tools taped together? If the model can name “American Gothic” in text but fails to draw a correct image of it, those skills aren’t truly in sync.

Failed attempts: Past benchmarks like MMMU and MMBench focused on understanding. Others like GenEval focused on generation. Newer unified tests (e.g., RealUnify, GIR-Bench) mixed tasks but often didn’t directly compare the same knowledge in both directions. That made it hard to measure the ‘gap’ between understanding and generation on identical content.

🍞 Top Bread (Hook): You know how a great athlete should dribble and shoot well in the same game, not just be good at one drill on a different day?

🥬 Filling (Concept 4: Cross-modal Consistency)

What it is: Cross-modal consistency checks that what a model understands in words and pictures matches what it can create back as pictures or words, on the very same idea.
How it works:
1. Ask the same question in two ways: one needs a text answer; the other needs an image.
2. Compare whether both answers are correct and agree.
3. Track mismatches to find where the model falls apart.
Why it matters: Without consistency, the model’s ‘left hand’ and ‘right hand’ don’t coordinate—you can’t trust it to transfer knowledge between seeing and drawing. 🍞 Bottom Bread (Anchor): If the model says “This sport is Ssireum” from a photo but cannot generate a plausible image of Ssireum when asked, that’s inconsistency.

The gap: What was missing was a fair, symmetric, two-way test that uses the same question to demand both a text answer and an image answer—and a principled score that accounts for question difficulty. That’s exactly what this paper creates with GAPEVAL.

Real stakes: In everyday life, assistants that edit photos from instructions, tutoring systems that both explain diagrams and draw them, robots that read a map and sketch a safe path—all need understanding and generation to agree. If your assistant can count apples in a photo but can’t draw the right number when asked, you’ll waste time, make mistakes, or lose trust. That’s why measuring and shrinking this gap matters.

02Core Idea

🍞 Top Bread (Hook): Imagine a two-sided quiz: you answer a question in words, and then you must draw the same answer. If your drawing doesn’t match your words, your teacher spots the mismatch immediately.

🥬 Filling (Concept 5: GAPEVAL)

What it is: GAPEVAL is a two-way benchmark where every question can be answered in text (understanding) and as an image (generation) with the same meaning.
How it works:
1. Build paired prompts that ask the same thing in both directions (e.g., “Who is this?” vs. “Generate an image of this person”).
2. Collect correct references for both text and image.
3. Have models answer both sides; use a careful judge to mark correct/incorrect.
4. Compute a Gap Score that tells how far apart the two abilities are.
Why it matters: Without a paired, symmetric test, we can’t tell if a model’s abilities are truly integrated or just coexisting. 🍞 Bottom Bread (Anchor): If a model reads a puzzle and picks option D in text, it should also draw an image circling D when asked to answer visually. GAPEVAL checks both.

The “Aha!” moment in one sentence: Test the same knowledge in both directions and measure the difference with a fairness-aware metric so we can finally see how unified a model truly is.

Multiple analogies for the same idea:

Language and art class: You must explain a scene and then paint it; the two should match, or you didn’t internalize the idea.
Recipe and dish: You write the recipe (understanding) and then cook the dish (generation); if the dish doesn’t taste like the recipe, something’s off.
Map and route: You read a map (understanding) and then draw the route (generation); if you can’t draw it, maybe you didn’t really get the map.

Before vs. After:

Before: Models were praised for single skills—great readers or great artists—but not checked for how well those skills aligned inside one brain.
After: With GAPEVAL and the Gap Score, we can tell whether a model’s reading and drawing share the same internal knowledge and where they fail to meet.

🍞 Top Bread (Hook): You know how some test questions are just harder than others? If two students both miss the hardest question, it doesn’t mean they’re bad at the subject.

🥬 Filling (Concept 6: MIRT—Multidimensional Item Response Theory)

What it is: MIRT is a testing method that estimates both the model’s ability and each item’s difficulty to score fairly across different skills.
How it works:
1. Treat understanding and generation as two ability dimensions.
2. For each question, consider how tough it is.
3. Fit a model that explains which errors come from difficulty vs. limited ability.
4. Output ability estimates that are adjusted for question difficulty.
Why it matters: Without adjusting for difficulty, a model that fails only the hardest items might look unfairly weak. 🍞 Bottom Bread (Anchor): If two math problems are hard, MIRT prevents us from blaming a student for missing only those toughest ones.

🍞 Top Bread (Hook): Think of a ruler that measures not height, but the gap between your reading and drawing skills.

🥬 Filling (Concept 7: Gap Score)

What it is: A single number that shows how far apart a model’s understanding and generation abilities are, after fairly accounting for difficulty.
How it works:
1. Use MIRT to estimate the model’s ability on understanding and on generation.
2. Take the difference between those abilities.
3. Normalize it to a 0–100 scale with extra penalties for always failing both and rewards when both succeed.
Why it matters: Without a clear, comparable score, we can’t track progress or compare models on true unification. 🍞 Bottom Bread (Anchor): If a model always answers correctly in text but only sometimes draws the right picture, its Gap Score will reveal that split.

Building blocks (broken down):

Bidirectional items: Each question has twin prompts (understanding vs. generation) that share the same meaning.
Four skill areas: Instruction Following, Numerical Perception, World Knowledge, and Reasoning—covering how models follow rules, count/transform numbers, use facts, and think logically.
Reliable judging: GPT-5-mini is used as an automatic judge with 92% agreement with human ratings on a held-out set.
Difficulty-aware math: MIRT separates ‘it was hard’ from ‘the model can’t do it,’ making the Gap Score meaningful.

Why it works (intuition): If two abilities really share one brain, then learning or recalling knowledge on one side should show up on the other. By asking the same thing in both directions and scoring with difficulty in mind, GAPEVAL exposes when the internal bridge between understanding and generation is strong—or when it’s wobbly.

03Methodology

High-level pipeline: Input (paired, two-way question) → Model produces text answer and image answer → Judge marks correct/incorrect → Aggregate results → MIRT estimates abilities and question difficulty → Gap Score summarizes the split.

Step 1: Build paired questions (the heart of symmetry)

What happens: Each item includes an image (or none) and two prompts with the same meaning—one expects a text answer (understanding), the other expects an image (generation).
Why this step exists: It forces the model to use the same knowledge in both directions, removing excuses like “that was a different task.”
Example: Numerical Perception—If the image shows 4 cameras and 2 headphones, the understanding prompt asks you to say the swapped counts, and the generation prompt asks you to draw the swapped counts.

🍞 Top Bread (Hook): Like following a recipe in words and also cooking it correctly. 🥬 Filling (Concept 8: Instruction Following)

What it is: The ability to obey rules or edits consistently in text and images.
How it works:
1. Read the instruction (explicit or implicit rule).
2. Apply the change consistently.
3. Describe it (text) or show it (image) without contradictions.
Why it matters: If a model can’t follow rules the same way in both modes, edits become unreliable. 🍞 Bottom Bread (Anchor): Add a fox next to a horse—describe it correctly and also draw it correctly.

🍞 Top Bread (Hook): Imagine counting apples in a photo and then drawing exactly that many apples. 🥬 Filling (Concept 9: Numerical Perception)

What it is: Seeing, understanding, and manipulating quantities across text and images.
How it works:
1. Detect object categories and counts.
2. Apply a numerical change (like swapping numbers).
3. Report via text or generate an image with correct counts.
Why it matters: Without precise numbers, models overcount, undercount, or just copy the input instead of reasoning. 🍞 Bottom Bread (Anchor): If there are 3 ducks and 2 birds, the model must produce 2 ducks and 3 birds (text or image).

🍞 Top Bread (Hook): Think of facts you know—like book authors or famous paintings—and matching them to pictures. 🥬 Filling (Concept 10: World Knowledge)

What it is: Using factual knowledge consistently between reading images/text and drawing images from text.
How it works:
1. Link words to real entities (people, places, culture).
2. Recognize them in images; recall them when drawing from text.
3. Keep the mapping stable in both directions.
Why it matters: If the model names ‘American Gothic’ but can’t draw it when asked, knowledge isn’t unified. 🍞 Bottom Bread (Anchor): Correctly say “American Gothic by Grant Wood” and also generate a compatible scene.

🍞 Top Bread (Hook): Picture solving a puzzle, then sketching the solution. 🥬 Filling (Concept 11: Reasoning)

What it is: Logical thinking that connects clues to answers in text and images.
How it works:
1. Understand the context (diagrams, choices, physical setups).
2. Infer what happens next or which option is correct.
3. Express the answer in text and render it visually.
Why it matters: Without reasoning, models guess or copy instead of truly solving. 🍞 Bottom Bread (Anchor): Choose option D in text and also draw a red circle around D in an image version.

Step 2: Judging correctness fairly

What happens: An LLM judge (GPT-5-mini) compares the model’s outputs with reference answers. It marks correct (1) or incorrect (0), with anti-plagiarism checks for images.
Why this step exists: Many answers are semantic (is the key idea right?), so an automatic but careful judge speeds evaluation while staying reliable (92% agreement with humans on a hold-out set).
Example: In Instruction Following, the judge checks whether the rule-driven change is present in both text and image outputs, even if style differs.

Step 3: Aggregate outcomes into counts

What happens: For each model, count how often it got: text-only correct, image-only correct, both correct, or both wrong.
Why this step exists: These patterns reveal whether successes are shared or siloed by modality.
Example: If a model often gets text right but image wrong on the same items, that’s a warning sign for unification.

🍞 Top Bread (Hook): Like a coach who knows which drills are hard and which skills each player has. 🥬 Filling (Concept 12: MIRT in practice)

What it is: A tool to estimate two abilities (understanding, generation) while separating out question difficulty.
How it works:
1. Fit abilities per model and difficulty per task/modality.
2. Use a smooth function to map ability–difficulty differences to success probabilities.
3. Apply a prior to tie the two abilities together and avoid overfitting.
Why it matters: It ensures the Gap Score reflects true ability differences, not just that some items were tougher. 🍞 Bottom Bread (Anchor): Two students missing the same very hard item shouldn’t look as far apart as if one missed lots of easy items.

Step 4: Compute the Gap Score

What happens: Take the difference between the model’s estimated understanding and generation abilities, normalize it to 0–100, and adjust using observed co-success (reward) and co-failure (penalty) patterns.
Why this step exists: A single interpretable number makes it easy to compare models and track progress toward true unification.
Example: A model that often succeeds on both sides will earn a smaller gap than one that splits successes between modalities.

🍞 Top Bread (Hook): Suppose you try to teach the model a new fact; will it show up in both reading and drawing? 🥬 Filling (Concept 13: Knowledge Manipulation—Injection and Editing)

What it is: Directly adding new facts (injection) or changing old ones (editing) during fine-tuning to see whether updates transfer across modalities.
How it works:
1. Pick knowledge triples (subject, relation, object) with text and images.
2. Fine-tune only on understanding or only on generation.
3. Test both sides to see if knowledge transfers.
Why it matters: If updates don’t transfer, the model’s knowledge pools are disconnected, not unified. 🍞 Bottom Bread (Anchor): Teach “Microwave Oven → Rice Cooker” as an edit. If only the image side changes but text side doesn’t (or vice versa), knowledge isn’t shared well.

The secret sauce: Symmetric, bidirectional items plus difficulty-aware scoring (MIRT) make GAPEVAL sensitive to true alignment rather than raw performance. It cleanly separates ‘can do both in sync’ from ‘can do each separately’—that is the leap from engineering-level coupling to cognitive-level unification.

04Experiments & Results

The test: GAPEVAL includes 646 carefully built, bidirectional items across four areas—Instruction Following, Numerical Perception, World Knowledge, and Reasoning. Each item can be answered in text and as an image, enabling a direct comparison on the same knowledge. An LLM judge rates correctness; MIRT estimates abilities and difficulty; the Gap Score summarizes the split.

The competition: The authors evaluate nine unified multimodal models (open- and closed-source) and compare them to strong understanding-only and generation-only baselines (e.g., GPT, Gemini, Qwen, FLUX). This spans different architectures, from LLM+diffusion hybrids to token-based unified transformers and mixture-of-experts/transformers designs.

The scoreboard (with context):

Big picture: A persistent, significant gap appears across many UMMs—models often get one modality right and the other wrong on the same item.
Understanding vs. generation: Understanding-only models like GPT5-mini and Gemini-Flash achieve very high text accuracy (around mid-90% or more), often beating unified models on comprehension. On pure image generation, closed-source unified models (e.g., GPT-Image-1, Gemini 2.5 Flash Image) outperform generation-only baselines, but still show noticeable gaps between modalities.
A striking case: OmniGen2, built on a strong diffusion generator (FLUX.1-dev), underperforms its own diffusion backbone on some generation tasks, signaling that adding unification components can dilute specific strengths.
Category patterns: Gaps show up in all four areas. For World Knowledge and Instruction Following, some models read facts or rules correctly but fail to render them; for Numerical Perception, swapping counts exposes whether the model truly reasons numerically rather than copying; for Reasoning, image selection and visual rendering must agree, and they often don’t.
What “87%” means: Think of 87% as getting an A, but if the other modality sits at 60% (a D), the model isn’t unified. GAPEVAL highlights these mixed report cards that single-direction tests hide.

Surprising findings:

Performance vs. unification is decoupled: Some models with higher overall accuracy still have larger gaps. Conversely, a slightly weaker model may show a smaller gap, indicating better internal alignment.
Lagging effect: As capability increases, understanding often improves first, generation later. This temporarily widens the gap before it narrows at the high end (e.g., GPT-Image-1, Gemini 2.5 Flash Image). It suggests different learning speeds for the two abilities.
Non-unified baselines can win: On certain tasks, specialized understanding-only or generation-only models beat unified ones, showing that today’s ‘unified’ designs can hurt specialized performance.

Knowledge manipulation results (why the gap exists):

Training only the understanding side sharply boosts understanding but barely helps generation; training only generation helps the other side only a little. In editing experiments, outdated knowledge often lingers on the untouched side.
Convergence speeds differ: Understanding typically saturates faster; generation improves more slowly, explaining the temporary widening of the gap as models scale.

Takeaway: Today’s UMMs often achieve surface-level coupling: two tools in one box rather than one shared brain. GAPEVAL’s Gap Score reveals this and correlates with synergy benchmarks, suggesting that closing the gap is essential for strong cross-modal teamwork.

05Discussion & Limitations

Limitations:

Benchmark scope: Even with 646 bidirectional items and human curation, no single benchmark covers all real-world tasks or cultural contexts. Hidden biases could remain.
Judge dependence: LLM-as-a-judge reaches 92% agreement with humans on a held-out set, but automatic judging is not perfect, especially for edge cases or novel styles.
Metric focus: The Gap Score captures alignment between two abilities, but not every dimension of quality (e.g., style fidelity, safety, or long-horizon planning).
Model coverage: The study tests many important models, but not all architectures or training recipes; future designs might unify better.

Required resources:

Paired, high-quality, human-reviewed datasets; access to an LLM judge; compute for MIRT fitting; and multiple model runs to reduce sampling variance.

When not to use:

If you only need single-direction performance (e.g., only text Q&A or only image drawing) and cross-modal transfer doesn’t matter, GAPEVAL’s bidirectional setup may be overkill.
If your domain is outside the benchmark’s coverage (e.g., medical imaging with unique constraints), you may need a domain-specific extension.

Open questions:

Architecture: What designs best share knowledge between understanding and generation (e.g., tighter shared tokenizers or unified latent spaces)?
Training signals: Which curricula or losses align modalities fastest without sacrificing specialized skill?
Knowledge editing: How can we guarantee synchronized updates across modalities with minimal forgetting?
Causality and reasoning: How do we ensure that the model’s chain of thought drives both text and image in the same, verifiable way?
Evaluation: Can future judges be more transparent and robust, or include human-in-the-loop verification efficiently?

Bottom line: GAPEVAL doesn’t just hand out scores—it diagnoses whether a model’s two big brains act as one. That diagnosis enables targeted fixes: architectures, data, and training that aim for genuine, cognitive-level unification rather than taped-together parts.

06Conclusion & Future Work

Three-sentence summary: This paper introduces GAPEVAL, a bidirectional benchmark where every item can be answered in text and as an image, enabling a fair test of whether understanding and generation truly share knowledge. Using a difficulty-aware testing method (MIRT), it defines a Gap Score that quantifies the split between the two abilities and shows a persistent gap across many unified multimodal models. Knowledge manipulation experiments reveal that updates often fail to transfer across modalities, suggesting current unification is mostly engineering-level rather than cognitive.

Main achievement: A clean, symmetric experimental framework and principled Gap Score that finally measure—not just guess—how well a model’s understanding and generation are integrated.

Future directions: Design architectures and training strategies that explicitly synchronize the two modalities’ knowledge; create curricula that balance learning speeds; invent editing methods that update both sides together; and expand GAPEVAL to more domains and longer, multi-step tasks.

Why remember this: As AI systems become everyday helpers, we need them to read and draw from the same playbook. GAPEVAL gives us the ruler to measure that unity—and the map to improve it—so future models don’t just do two tricks, but do them as one mind.

Practical Applications

•Benchmark multimodal assistants to ensure they can both describe and correctly draw edits to user photos.
•Audit enterprise AI tools for cross-modal consistency before deployment in design, marketing, and e-commerce.
•Design training curricula that balance understanding-first gains with generation catch-up to reduce the lagging gap.
•Evaluate knowledge updates (injection/editing) to confirm new facts transfer from text understanding to image generation.
•Select models for education platforms that must both explain diagrams and generate accurate visualizations.
•Tune robotics or mapping systems to ensure read–draw consistency for plans, paths, and scene reconstructions.
•Compare architectures (LLM+diffusion vs. unified tokenizers) with the Gap Score to guide R&D choices.
•Monitor model upgrades over time to ensure improved performance does not secretly increase the modality gap.
•Extend GAPEVAL with domain-specific items (e.g., medical or scientific diagrams) to check safety-critical alignment.

Version: 1