Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
Key Summary
- âąThis paper shows that many AI models that both read images and write images are not truly unified insideâthey often understand well but fail to generate (or the other way around).
- âąThe authors build GAPEVAL, a fair, two-way test where every question can be answered as text or as an image, so the same knowledge must work in both directions.
- âąThey introduce a Gap Score, based on a careful testing method (MIRT), to measure how big the split is between understanding and generation while accounting for question difficulty.
- âąAcross nine unified multimodal models, there is a persistent gap: models frequently get one side right and the other side wrong on the same item.
- âąHigher overall performance does not always mean better unification; some strong models still show large gaps between understanding and generation.
- âąThere is a lagging effect: as models get better, understanding tends to improve earlier than generation, which temporarily widens the gap before it later shrinks.
- âąKnowledge injection/editing experiments show knowledge often stays separated across modalitiesâtraining one side rarely fixes the other.
- âąEven state-of-the-art unified models can underperform non-unified baselines on some tasks, suggesting current 'unification' is mostly engineering-level, not truly cognitive.
- âąGAPEVALâs Gap Score correlates with synergy benchmarks, implying that closing the gap is a prerequisite for real cross-modal teamwork inside models.
Why This Research Matters
Real-world assistants must both understand your photos and create accurate edits or illustrations when askedâotherwise theyâre unreliable. In education, a tutor that explains diagrams but canât draw them (or draws them wrong) fails students who learn visually. For accessibility, tools that convert between words and images must be consistent so people who rely on them get trustworthy information. In robotics, a system that reads a floor plan but canât sketch a correct path risks safety and efficiency. GAPEVAL gives builders a fair, difficulty-aware way to measure and fix these issues, so future models share knowledge across modalities instead of acting like two tools taped together.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): Imagine a student who can read a story out loud perfectly (understanding) but struggles to write their own story (generation). Or another student who writes beautiful stories but doesnât really understand the ones they read. Would you say either student has mastered language? Probably notâyouâd want both skills to work together.
đ„Ź Filling (Concept 1: Unified Multimodal Models, UMMs)
- What it is: UMMs are single AI systems that can both understand inputs like images and text and also generate outputs like answers (text) or pictures (images).
- How it works:
- Take in text and/or images.
- Reason about whatâs being asked.
- Output either text (like an answer) or an image (like a drawing).
- Why it matters: Without one brain handling both sides well, the model may act like two loosely connected tools rather than one smart, coordinated thinker. đ Bottom Bread (Anchor): Think of a Swiss Army knife: itâs supposed to be one tool that can cut, twist, and open. If the screwdriver works but the knife doesnât, it isnât truly unified.
đ Top Bread (Hook): You know how in class, sometimes you need to explain what you see in a picture? Thatâs understanding. And sometimes, youâre asked to draw something based on words? Thatâs generation.
đ„Ź Filling (Concept 2: Understanding)
- What it is: Understanding is when the AI reads text and/or looks at images and answers with text that explains or solves the task.
- How it works:
- Look at the input (image and/or text).
- Figure out what is important.
- Produce a clear, correct text answer.
- Why it matters: Without solid understanding, the AI canât reason about what it seesâit might miss the point and answer incorrectly. đ Bottom Bread (Anchor): If shown a photo of two cats and asked, âHow many cats?â understanding means answering âTwo.â
đ Top Bread (Hook): Now flip it: imagine getting a sentence like âDraw a red apple on a plate,â and you have to create the image.
đ„Ź Filling (Concept 3: Generation)
- What it is: Generation is when the AI reads text and produces an image that matches the instruction.
- How it works:
- Read the text instruction.
- Plan which objects, colors, and layouts are needed.
- Create an image that matches the plan.
- Why it matters: Without reliable generation, the AI canât show what it knows visually, which limits creativity and utility. đ Bottom Bread (Anchor): If asked, âMake a picture with three balloons and one cake,â generation is drawing that exact scene.
The world before: Researchers made great progress training models to do image understanding (answering questions about pictures) and image generation (drawing from text). But they mostly tested these skills separately. Thatâs like testing reading on Monday and writing on Tuesdayâgood scores donât prove the skills connect inside one brain. When newer âunifiedâ models arrivedâpromising both skills in oneâthey were mostly still judged with one-way tests (only text answers or only images), so no one could tell if the two abilities reinforced each other.
The problem: Do understanding and generation really share one set of smarts inside a unified model? Or are they just two tools taped together? If the model can name âAmerican Gothicâ in text but fails to draw a correct image of it, those skills arenât truly in sync.
Failed attempts: Past benchmarks like MMMU and MMBench focused on understanding. Others like GenEval focused on generation. Newer unified tests (e.g., RealUnify, GIR-Bench) mixed tasks but often didnât directly compare the same knowledge in both directions. That made it hard to measure the âgapâ between understanding and generation on identical content.
đ Top Bread (Hook): You know how a great athlete should dribble and shoot well in the same game, not just be good at one drill on a different day?
đ„Ź Filling (Concept 4: Cross-modal Consistency)
- What it is: Cross-modal consistency checks that what a model understands in words and pictures matches what it can create back as pictures or words, on the very same idea.
- How it works:
- Ask the same question in two ways: one needs a text answer; the other needs an image.
- Compare whether both answers are correct and agree.
- Track mismatches to find where the model falls apart.
- Why it matters: Without consistency, the modelâs âleft handâ and âright handâ donât coordinateâyou canât trust it to transfer knowledge between seeing and drawing. đ Bottom Bread (Anchor): If the model says âThis sport is Ssireumâ from a photo but cannot generate a plausible image of Ssireum when asked, thatâs inconsistency.
The gap: What was missing was a fair, symmetric, two-way test that uses the same question to demand both a text answer and an image answerâand a principled score that accounts for question difficulty. Thatâs exactly what this paper creates with GAPEVAL.
Real stakes: In everyday life, assistants that edit photos from instructions, tutoring systems that both explain diagrams and draw them, robots that read a map and sketch a safe pathâall need understanding and generation to agree. If your assistant can count apples in a photo but canât draw the right number when asked, youâll waste time, make mistakes, or lose trust. Thatâs why measuring and shrinking this gap matters.
02Core Idea
đ Top Bread (Hook): Imagine a two-sided quiz: you answer a question in words, and then you must draw the same answer. If your drawing doesnât match your words, your teacher spots the mismatch immediately.
đ„Ź Filling (Concept 5: GAPEVAL)
- What it is: GAPEVAL is a two-way benchmark where every question can be answered in text (understanding) and as an image (generation) with the same meaning.
- How it works:
- Build paired prompts that ask the same thing in both directions (e.g., âWho is this?â vs. âGenerate an image of this personâ).
- Collect correct references for both text and image.
- Have models answer both sides; use a careful judge to mark correct/incorrect.
- Compute a Gap Score that tells how far apart the two abilities are.
- Why it matters: Without a paired, symmetric test, we canât tell if a modelâs abilities are truly integrated or just coexisting. đ Bottom Bread (Anchor): If a model reads a puzzle and picks option D in text, it should also draw an image circling D when asked to answer visually. GAPEVAL checks both.
The âAha!â moment in one sentence: Test the same knowledge in both directions and measure the difference with a fairness-aware metric so we can finally see how unified a model truly is.
Multiple analogies for the same idea:
- Language and art class: You must explain a scene and then paint it; the two should match, or you didnât internalize the idea.
- Recipe and dish: You write the recipe (understanding) and then cook the dish (generation); if the dish doesnât taste like the recipe, somethingâs off.
- Map and route: You read a map (understanding) and then draw the route (generation); if you canât draw it, maybe you didnât really get the map.
Before vs. After:
- Before: Models were praised for single skillsâgreat readers or great artistsâbut not checked for how well those skills aligned inside one brain.
- After: With GAPEVAL and the Gap Score, we can tell whether a modelâs reading and drawing share the same internal knowledge and where they fail to meet.
đ Top Bread (Hook): You know how some test questions are just harder than others? If two students both miss the hardest question, it doesnât mean theyâre bad at the subject.
đ„Ź Filling (Concept 6: MIRTâMultidimensional Item Response Theory)
- What it is: MIRT is a testing method that estimates both the modelâs ability and each itemâs difficulty to score fairly across different skills.
- How it works:
- Treat understanding and generation as two ability dimensions.
- For each question, consider how tough it is.
- Fit a model that explains which errors come from difficulty vs. limited ability.
- Output ability estimates that are adjusted for question difficulty.
- Why it matters: Without adjusting for difficulty, a model that fails only the hardest items might look unfairly weak. đ Bottom Bread (Anchor): If two math problems are hard, MIRT prevents us from blaming a student for missing only those toughest ones.
đ Top Bread (Hook): Think of a ruler that measures not height, but the gap between your reading and drawing skills.
đ„Ź Filling (Concept 7: Gap Score)
- What it is: A single number that shows how far apart a modelâs understanding and generation abilities are, after fairly accounting for difficulty.
- How it works:
- Use MIRT to estimate the modelâs ability on understanding and on generation.
- Take the difference between those abilities.
- Normalize it to a 0â100 scale with extra penalties for always failing both and rewards when both succeed.
- Why it matters: Without a clear, comparable score, we canât track progress or compare models on true unification. đ Bottom Bread (Anchor): If a model always answers correctly in text but only sometimes draws the right picture, its Gap Score will reveal that split.
Building blocks (broken down):
- Bidirectional items: Each question has twin prompts (understanding vs. generation) that share the same meaning.
- Four skill areas: Instruction Following, Numerical Perception, World Knowledge, and Reasoningâcovering how models follow rules, count/transform numbers, use facts, and think logically.
- Reliable judging: GPT-5-mini is used as an automatic judge with 92% agreement with human ratings on a held-out set.
- Difficulty-aware math: MIRT separates âit was hardâ from âthe model canât do it,â making the Gap Score meaningful.
Why it works (intuition): If two abilities really share one brain, then learning or recalling knowledge on one side should show up on the other. By asking the same thing in both directions and scoring with difficulty in mind, GAPEVAL exposes when the internal bridge between understanding and generation is strongâor when itâs wobbly.
03Methodology
High-level pipeline: Input (paired, two-way question) â Model produces text answer and image answer â Judge marks correct/incorrect â Aggregate results â MIRT estimates abilities and question difficulty â Gap Score summarizes the split.
Step 1: Build paired questions (the heart of symmetry)
- What happens: Each item includes an image (or none) and two prompts with the same meaningâone expects a text answer (understanding), the other expects an image (generation).
- Why this step exists: It forces the model to use the same knowledge in both directions, removing excuses like âthat was a different task.â
- Example: Numerical PerceptionâIf the image shows 4 cameras and 2 headphones, the understanding prompt asks you to say the swapped counts, and the generation prompt asks you to draw the swapped counts.
đ Top Bread (Hook): Like following a recipe in words and also cooking it correctly. đ„Ź Filling (Concept 8: Instruction Following)
- What it is: The ability to obey rules or edits consistently in text and images.
- How it works:
- Read the instruction (explicit or implicit rule).
- Apply the change consistently.
- Describe it (text) or show it (image) without contradictions.
- Why it matters: If a model canât follow rules the same way in both modes, edits become unreliable. đ Bottom Bread (Anchor): Add a fox next to a horseâdescribe it correctly and also draw it correctly.
đ Top Bread (Hook): Imagine counting apples in a photo and then drawing exactly that many apples. đ„Ź Filling (Concept 9: Numerical Perception)
- What it is: Seeing, understanding, and manipulating quantities across text and images.
- How it works:
- Detect object categories and counts.
- Apply a numerical change (like swapping numbers).
- Report via text or generate an image with correct counts.
- Why it matters: Without precise numbers, models overcount, undercount, or just copy the input instead of reasoning. đ Bottom Bread (Anchor): If there are 3 ducks and 2 birds, the model must produce 2 ducks and 3 birds (text or image).
đ Top Bread (Hook): Think of facts you knowâlike book authors or famous paintingsâand matching them to pictures. đ„Ź Filling (Concept 10: World Knowledge)
- What it is: Using factual knowledge consistently between reading images/text and drawing images from text.
- How it works:
- Link words to real entities (people, places, culture).
- Recognize them in images; recall them when drawing from text.
- Keep the mapping stable in both directions.
- Why it matters: If the model names âAmerican Gothicâ but canât draw it when asked, knowledge isnât unified. đ Bottom Bread (Anchor): Correctly say âAmerican Gothic by Grant Woodâ and also generate a compatible scene.
đ Top Bread (Hook): Picture solving a puzzle, then sketching the solution. đ„Ź Filling (Concept 11: Reasoning)
- What it is: Logical thinking that connects clues to answers in text and images.
- How it works:
- Understand the context (diagrams, choices, physical setups).
- Infer what happens next or which option is correct.
- Express the answer in text and render it visually.
- Why it matters: Without reasoning, models guess or copy instead of truly solving. đ Bottom Bread (Anchor): Choose option D in text and also draw a red circle around D in an image version.
Step 2: Judging correctness fairly
- What happens: An LLM judge (GPT-5-mini) compares the modelâs outputs with reference answers. It marks correct (1) or incorrect (0), with anti-plagiarism checks for images.
- Why this step exists: Many answers are semantic (is the key idea right?), so an automatic but careful judge speeds evaluation while staying reliable (92% agreement with humans on a hold-out set).
- Example: In Instruction Following, the judge checks whether the rule-driven change is present in both text and image outputs, even if style differs.
Step 3: Aggregate outcomes into counts
- What happens: For each model, count how often it got: text-only correct, image-only correct, both correct, or both wrong.
- Why this step exists: These patterns reveal whether successes are shared or siloed by modality.
- Example: If a model often gets text right but image wrong on the same items, thatâs a warning sign for unification.
đ Top Bread (Hook): Like a coach who knows which drills are hard and which skills each player has. đ„Ź Filling (Concept 12: MIRT in practice)
- What it is: A tool to estimate two abilities (understanding, generation) while separating out question difficulty.
- How it works:
- Fit abilities per model and difficulty per task/modality.
- Use a smooth function to map abilityâdifficulty differences to success probabilities.
- Apply a prior to tie the two abilities together and avoid overfitting.
- Why it matters: It ensures the Gap Score reflects true ability differences, not just that some items were tougher. đ Bottom Bread (Anchor): Two students missing the same very hard item shouldnât look as far apart as if one missed lots of easy items.
Step 4: Compute the Gap Score
- What happens: Take the difference between the modelâs estimated understanding and generation abilities, normalize it to 0â100, and adjust using observed co-success (reward) and co-failure (penalty) patterns.
- Why this step exists: A single interpretable number makes it easy to compare models and track progress toward true unification.
- Example: A model that often succeeds on both sides will earn a smaller gap than one that splits successes between modalities.
đ Top Bread (Hook): Suppose you try to teach the model a new fact; will it show up in both reading and drawing? đ„Ź Filling (Concept 13: Knowledge ManipulationâInjection and Editing)
- What it is: Directly adding new facts (injection) or changing old ones (editing) during fine-tuning to see whether updates transfer across modalities.
- How it works:
- Pick knowledge triples (subject, relation, object) with text and images.
- Fine-tune only on understanding or only on generation.
- Test both sides to see if knowledge transfers.
- Why it matters: If updates donât transfer, the modelâs knowledge pools are disconnected, not unified. đ Bottom Bread (Anchor): Teach âMicrowave Oven â Rice Cookerâ as an edit. If only the image side changes but text side doesnât (or vice versa), knowledge isnât shared well.
The secret sauce: Symmetric, bidirectional items plus difficulty-aware scoring (MIRT) make GAPEVAL sensitive to true alignment rather than raw performance. It cleanly separates âcan do both in syncâ from âcan do each separatelyââthat is the leap from engineering-level coupling to cognitive-level unification.
04Experiments & Results
The test: GAPEVAL includes 646 carefully built, bidirectional items across four areasâInstruction Following, Numerical Perception, World Knowledge, and Reasoning. Each item can be answered in text and as an image, enabling a direct comparison on the same knowledge. An LLM judge rates correctness; MIRT estimates abilities and difficulty; the Gap Score summarizes the split.
The competition: The authors evaluate nine unified multimodal models (open- and closed-source) and compare them to strong understanding-only and generation-only baselines (e.g., GPT, Gemini, Qwen, FLUX). This spans different architectures, from LLM+diffusion hybrids to token-based unified transformers and mixture-of-experts/transformers designs.
The scoreboard (with context):
- Big picture: A persistent, significant gap appears across many UMMsâmodels often get one modality right and the other wrong on the same item.
- Understanding vs. generation: Understanding-only models like GPT5-mini and Gemini-Flash achieve very high text accuracy (around mid-90% or more), often beating unified models on comprehension. On pure image generation, closed-source unified models (e.g., GPT-Image-1, Gemini 2.5 Flash Image) outperform generation-only baselines, but still show noticeable gaps between modalities.
- A striking case: OmniGen2, built on a strong diffusion generator (FLUX.1-dev), underperforms its own diffusion backbone on some generation tasks, signaling that adding unification components can dilute specific strengths.
- Category patterns: Gaps show up in all four areas. For World Knowledge and Instruction Following, some models read facts or rules correctly but fail to render them; for Numerical Perception, swapping counts exposes whether the model truly reasons numerically rather than copying; for Reasoning, image selection and visual rendering must agree, and they often donât.
- What â87%â means: Think of 87% as getting an A, but if the other modality sits at 60% (a D), the model isnât unified. GAPEVAL highlights these mixed report cards that single-direction tests hide.
Surprising findings:
- Performance vs. unification is decoupled: Some models with higher overall accuracy still have larger gaps. Conversely, a slightly weaker model may show a smaller gap, indicating better internal alignment.
- Lagging effect: As capability increases, understanding often improves first, generation later. This temporarily widens the gap before it narrows at the high end (e.g., GPT-Image-1, Gemini 2.5 Flash Image). It suggests different learning speeds for the two abilities.
- Non-unified baselines can win: On certain tasks, specialized understanding-only or generation-only models beat unified ones, showing that todayâs âunifiedâ designs can hurt specialized performance.
Knowledge manipulation results (why the gap exists):
- Training only the understanding side sharply boosts understanding but barely helps generation; training only generation helps the other side only a little. In editing experiments, outdated knowledge often lingers on the untouched side.
- Convergence speeds differ: Understanding typically saturates faster; generation improves more slowly, explaining the temporary widening of the gap as models scale.
Takeaway: Todayâs UMMs often achieve surface-level coupling: two tools in one box rather than one shared brain. GAPEVALâs Gap Score reveals this and correlates with synergy benchmarks, suggesting that closing the gap is essential for strong cross-modal teamwork.
05Discussion & Limitations
Limitations:
- Benchmark scope: Even with 646 bidirectional items and human curation, no single benchmark covers all real-world tasks or cultural contexts. Hidden biases could remain.
- Judge dependence: LLM-as-a-judge reaches 92% agreement with humans on a held-out set, but automatic judging is not perfect, especially for edge cases or novel styles.
- Metric focus: The Gap Score captures alignment between two abilities, but not every dimension of quality (e.g., style fidelity, safety, or long-horizon planning).
- Model coverage: The study tests many important models, but not all architectures or training recipes; future designs might unify better.
Required resources:
- Paired, high-quality, human-reviewed datasets; access to an LLM judge; compute for MIRT fitting; and multiple model runs to reduce sampling variance.
When not to use:
- If you only need single-direction performance (e.g., only text Q&A or only image drawing) and cross-modal transfer doesnât matter, GAPEVALâs bidirectional setup may be overkill.
- If your domain is outside the benchmarkâs coverage (e.g., medical imaging with unique constraints), you may need a domain-specific extension.
Open questions:
- Architecture: What designs best share knowledge between understanding and generation (e.g., tighter shared tokenizers or unified latent spaces)?
- Training signals: Which curricula or losses align modalities fastest without sacrificing specialized skill?
- Knowledge editing: How can we guarantee synchronized updates across modalities with minimal forgetting?
- Causality and reasoning: How do we ensure that the modelâs chain of thought drives both text and image in the same, verifiable way?
- Evaluation: Can future judges be more transparent and robust, or include human-in-the-loop verification efficiently?
Bottom line: GAPEVAL doesnât just hand out scoresâit diagnoses whether a modelâs two big brains act as one. That diagnosis enables targeted fixes: architectures, data, and training that aim for genuine, cognitive-level unification rather than taped-together parts.
06Conclusion & Future Work
Three-sentence summary: This paper introduces GAPEVAL, a bidirectional benchmark where every item can be answered in text and as an image, enabling a fair test of whether understanding and generation truly share knowledge. Using a difficulty-aware testing method (MIRT), it defines a Gap Score that quantifies the split between the two abilities and shows a persistent gap across many unified multimodal models. Knowledge manipulation experiments reveal that updates often fail to transfer across modalities, suggesting current unification is mostly engineering-level rather than cognitive.
Main achievement: A clean, symmetric experimental framework and principled Gap Score that finally measureânot just guessâhow well a modelâs understanding and generation are integrated.
Future directions: Design architectures and training strategies that explicitly synchronize the two modalitiesâ knowledge; create curricula that balance learning speeds; invent editing methods that update both sides together; and expand GAPEVAL to more domains and longer, multi-step tasks.
Why remember this: As AI systems become everyday helpers, we need them to read and draw from the same playbook. GAPEVAL gives us the ruler to measure that unityâand the map to improve itâso future models donât just do two tricks, but do them as one mind.
Practical Applications
- âąBenchmark multimodal assistants to ensure they can both describe and correctly draw edits to user photos.
- âąAudit enterprise AI tools for cross-modal consistency before deployment in design, marketing, and e-commerce.
- âąDesign training curricula that balance understanding-first gains with generation catch-up to reduce the lagging gap.
- âąEvaluate knowledge updates (injection/editing) to confirm new facts transfer from text understanding to image generation.
- âąSelect models for education platforms that must both explain diagrams and generate accurate visualizations.
- âąTune robotics or mapping systems to ensure readâdraw consistency for plans, paths, and scene reconstructions.
- âąCompare architectures (LLM+diffusion vs. unified tokenizers) with the Gap Score to guide R&D choices.
- âąMonitor model upgrades over time to ensure improved performance does not secretly increase the modality gap.
- âąExtend GAPEVAL with domain-specific items (e.g., medical or scientific diagrams) to check safety-critical alignment.