Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma; Yu Zhang; Xuefeng Bai; Kehai Chen; Yuwei Wang; Zeming Liu; Jun Yu; Min Zhang

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Intermediate

Jinlong Ma, Yu Zhang, Xuefeng Bai et al.2/4/2026

arXiv PDF

Key Summary

•The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
•It studies a task called GMNER, where the model must find names in a sentence, tell what type they are (person, organization, etc.), and point to the right spot in the picture—or say None if it isn’t there.
•The authors discover a big problem called modality bias: models take shortcuts by trusting only one modality (textual bias or visual bias) and skip cross-checking.
•They propose Modality-aware Consistency Reasoning (MCR), a two-part recipe that forces careful, step-by-step cross-modal thinking.
•Part 1, Multi-style Reasoning Schema Injection (MRSI), builds many clear, template-like thinking paths so the model explains how it used text and image at each step.
•Part 2, Constraint-guided Verifiable Optimization (CVO), gives rule-based, checkable rewards (entity count, span, type, visibility, and box overlap) and trains the model with GRPO to follow those rules.
•Across benchmarks (Twitter-GMNER, MNER-MI, GREC), MCR reduces both visual and textual bias and beats strong baselines, improving GMNER F1 by up to about 11.9 points over the best previous unified method.
•MCR even makes smaller open models (like Qwen2.5-VL-7B and MimoVL-7B) much stronger than simple fine-tuning, and sometimes rivals or passes bigger models without MCR.
•The method is efficient (LoRA, verifiable rewards) and stable (multi-style reasoning helps explore safely), but it still depends on the model’s built-in knowledge for rare or unseen entities.

Why This Research Matters

Apps that match words to pictures power news, shopping, education, and accessibility; when they guess from only text or only image, people get mislabeled and users lose trust. This work shows how to make models prove their steps and get rewarded only for checks that pass, so the answers are grounded and fair. It reduces hallucinations like inventing an entity or boxing the wrong person, which helps prevent misinformation. It also boosts the usefulness of smaller, open models, lowering costs and widening access. In the long run, this approach makes multimodal AI more reliable for building knowledge graphs, moderating content, and assisting users who rely on accurate, grounded answers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a matching game: you read a sentence about "Kevin Durant" and look at a photo to see if he’s there. If he is, you point to him; if not, you say, "not in the picture." Easy when you look and read carefully—hard if you only do one.

🥬 The Concept (MLLMs): You know how a superhero can use both eyes and ears? Multimodal Large Language Models (MLLMs) are AI superheroes that read text and look at images together to understand questions and give answers.

How it works (simple): 1) Read the words. 2) Look at the picture. 3) Combine clues from both. 4) Answer the question.
Why it matters: If an AI only reads or only looks, it can guess wrong—like naming the wrong team because it saw a logo but didn’t read the sentence. 🍞 Anchor: When you ask, “Is ‘Iggy’ in this sentence and where is he in the photo?”, an MLLM should check both the sentence and the image before deciding.

🍞 Hook: You know how a treasure map tells you what the treasure is and where to find it?

🥬 The Concept (GMNER): Grounded Multimodal Named Entity Recognition (GMNER) is a task where the AI finds named entities in text (like people or teams), decides their type, and either points to their location in the image (with a box) or says None if they’re not visible.

How it works: 1) Find names in the sentence. 2) Pick the right type (person, organization, location, miscellaneous). 3) Check the image: visible or not? 4) If visible, draw a box; if not, write None.
Why it matters: Without this, apps can’t reliably link what’s written to what’s shown (like matching a news caption to the right person in a photo). 🍞 Anchor: Sentence: “Louis van Gaal forgets who has won a Premier League in his squad.” Photo: a team picture. GMNER should find “Louis van Gaal” (person), “Premier League” (organization) and only box things actually visible; if “Premier League” isn’t shown, say None.

🍞 Hook: Think of a yearbook where you match names to faces.

🥬 The Concept (EEG): Entity Extraction & Grounding (EEG) is the part of GMNER that decides whether each text entity is visible in the image and, if so, where.

How it works: 1) For each found name, ask “Is it in the picture?” 2) If yes, place a bounding box; if no, say None.
Why it matters: Without EEG, we’d know the names but not their places—or we’d point to the wrong thing. 🍞 Anchor: If the sentence says “Kevin Durant” and he’s in the photo, EEG puts a box around him; if he’s not there, it outputs None.

🍞 Hook: When reading a picture book, you point to the dragon when the words say “dragon.”

🥬 The Concept (VG): Visual Grounding (VG) links words to the correct parts of an image.

How it works: 1) Read the description. 2) Search the image for the matching object/person. 3) Draw a box if found.
Why it matters: Without VG, the AI might talk about the right thing but point to the wrong place. 🍞 Anchor: If the text says “the man in the red hat,” VG should point to the man actually wearing a red hat.

🍞 Hook: Have you ever guessed the answer without checking all the clues—and got it wrong?

🥬 The Concept (Modality Bias): Modality bias is when the AI takes a shortcut and trusts either text alone (textual bias) or image alone (visual bias) instead of checking both together.

How it works: 1) The model sees a strong cue in one modality. 2) It jumps to a conclusion. 3) It skips verifying with the other modality. 4) It outputs a confident but wrong answer.
Why it matters: Textual bias can incorrectly box someone just because they’re in the picture; visual bias can invent an entity in the text just because of a visible logo. 🍞 Anchor: The model sees Kevin Durant in the image and wrongly boxes him for the text entity “Iggy,” or it invents “Manchester United” from a team logo even if the sentence never mentioned it.

The world before: Researchers used MLLMs inside pipelines: one piece to read text, another to find boxes, and a separate piece to match them. That works but can be slow and can pass mistakes from one step to the next. People tried prompting tricks like Chain-of-Thought (CoT) or few-shot examples to nudge models to think more carefully, which helped a bit, but models still slipped into modality bias.

The problem: End-to-end MLLMs for GMNER often hallucinate and choose easy, unimodal clues instead of doing solid cross-checking across text and image.

Failed attempts: 1) Plain prompting—too weak to stop shortcuts. 2) Simple fine-tuning—learns patterns but not reliable verification. 3) Pipelines—reduce some confusion but add cost and error chains.

The gap: We needed a way to make models actually reason across modalities—with clear steps and rules—and to reward them only when they truly follow these steps.

Real stakes: Incorrect grounding can mislabel people in news photos, mismatch products in shopping apps, or build wrong links in knowledge graphs. Getting cross-modal reasoning right makes everyday tools more trustworthy and safe.

02Core Idea

🍞 Hook: You know how teachers ask you to “show your work,” not just the final answer?

🥬 The Concept (MCR): Modality-aware Consistency Reasoning (MCR) is a two-part method that makes the model show its cross-modal work and grades it with rules it can’t cheat.

How it works: 1) Give the model many clear, structured ways to reason (MRSI). 2) Check the model’s reasoning and answers with rule-based rewards and train it to improve (CVO with GRPO).
Why it matters: Without MCR, models keep taking shortcuts (modality bias). With MCR, they must verify text and image step-by-step. 🍞 Anchor: Instead of guessing “Durant” for any basketball sentence with a player photo, the model lists entities from text, checks visibility in the picture, and only then draws a box—or writes None.

Aha! moment (one sentence): If we force models to produce structured, multi-style reasoning and then reward only answers that pass verifiable cross-modal checks, they stop relying on unimodal shortcuts.

Three analogies:

Detective notebook: Write down each clue from text and image, then only accuse when clues agree.
Cooking with recipes: Multiple step-by-step recipes (schemas) help make the same dish reliably; taste-test rules (rewards) ensure the dish meets standards.
Crossing the street: Look left (text), look right (image), then go (answer); a crossing guard (rewards) only gives a thumbs-up if you followed the steps.

Before vs after:

Before: Models guessed from text or image alone; hallucinations were common.
After: Models explain entity finding, type choosing, visibility checking, and boxing; they get rewarded for being consistent and precise.

Why it works (intuition without math):

Multi-style reasoning gives the mind “paths” to think clearly and diversely, so it doesn’t overfit one pattern.
Verifiable rewards act like objective graders: count entities correctly, match spans, pick correct types, agree on visibility, and overlap boxes well. No opinion, just facts.
GRPO compares groups of attempts and pushes the whole distribution toward better, rule-following behavior—reducing collapse and gaming the system.

Building blocks in small pieces:

🍞 Hook: Imagine a toolbox with different checklists.

🥬 The Concept (MRSI): Multi-style Reasoning Schema Injection creates and injects many reasoning styles (templates) that turn abstract rules into concrete step-by-step thinking.

How it works: 1) Define constraints for entity finding, typing, visibility (entailment), and boxing. 2) Use templates/LLMs/MLLMs to generate multiple reasoning styles on labeled data. 3) Fine-tune so the model can produce these styles and then answer.
Why it matters: Without MRSI, the model has no structured path and reverts to shortcuts. 🍞 Anchor: A style might say: “1) list entities from text, 2) give types, 3) mark visible/invisible, 4) give boxes or None, 5) output final triples.”

🍞 Hook: Think of a game where points only count if a referee can check them.

🥬 The Concept (CVO): Constraint-guided Verifiable Optimization trains the model with checkable rewards that reflect the constraints, nudging it toward reliable cross-modal reasoning.

How it works: 1) Generate several answers per input. 2) Score each with rules (counts, spans, types, visibility, IoU). 3) Use GRPO to move the model toward higher-scoring behavior, carefully and stably.
Why it matters: Without CVO, the model learns to sound smart but not to be right. 🍞 Anchor: If the model says there are 2 entities but labels 3, it loses count reward; if it draws a bad box, it loses IoU reward.

🍞 Hook: Imagine comparing all players in a team to decide how to practice.

🥬 The Concept (GRPO): Group Relative Policy Optimization improves the model by comparing a group of its own answers and favoring the better ones, while clipping updates to stay stable.

How it works: 1) Sample multiple responses. 2) Compute group mean and spread. 3) Give each response a relative advantage. 4) Update the model, but keep steps small.
Why it matters: Without GRPO, training can be unstable or collapse to one style. 🍞 Anchor: If three answers score 0.2, 0.6, and 0.7, GRPO learns from the higher ones more, but doesn’t jump so far that it forgets how to generalize.

🍞 Hook: Think of a teacher’s answer key that you can check line-by-line.

🥬 The Concept (Verifiable Rewards): These are rule-based scores that anyone can compute from outputs—no fuzzy judge needed.

How it works: 1) Count reward: match predicted vs true number of entities. 2) Span reward: token-level overlap for entity text. 3) Type reward: correct category. 4) Entailment (visibility) reward: both None or both visible. 5) Grounding reward: IoU overlap for boxes.
Why it matters: Without verifiable rewards, models might learn to write pretty explanations that aren’t correct. 🍞 Anchor: If “Premier League” isn’t shown, the best answer marks its location as None and earns the entailment reward; a random box scores zero there.

03Methodology

High-level recipe: Input (sentence + image) → MRSI (learn step-by-step schemas) → Supervised injection (model learns to generate reasoning paths and answers) → CVO (optimize with verifiable rewards using GRPO) → Output (entity, type, location triples).

Step A: Multi-style Reasoning Schema Injection (MRSI)

What happens: We turn four core constraints—entity recognition (Cs), type classification (Ct), visual entailment/visibility (Ce), and grounding with boxes (Cu)—into many concrete, readable reasoning styles. Using templates and helper models, we generate multiple “ways of thinking” for each labeled example and fine-tune the MLLM to follow them.
Why this step exists: Without explicit schemas, the model drifts back to shortcuts (e.g., guessing from an image logo). Schemas make the model show how it checked both text and image.
Example: For the sentence “Helps on the way @ NFL.” and an image with an NBA logo, a schema guides: 1) entities in text: {NFL}; 2) types: NFL → organization; 3) visibility: check image—NBA logo ≠ NFL, so NFL is invisible; 4) location: NFL → None; 5) final triple: (NFL, organization, None).
What breaks without it: The model might see the NBA logo and wrongly box it for “NFL” (textual bias), failing the task.

Step B: Supervised injection objective

What happens: We fine-tune the MLLM to first generate a reasoning path z given (text, image), then produce the final answer y from that path. The training nudges the model to keep its internal chain consistent with the output.
Why this step exists: It teaches the model that the path matters, not just the end result.
Example: The model learns to output: <process> ... entities = 1 (NFL), type = organization, visible = no, box = None ... </process> <answer> (NFL, organization, None) </answer>.
What breaks without it: The model might output correct answers by luck during training but fail to generalize its checking behavior.

Step C: Constraint-guided Verifiable Optimization (CVO)

What happens: We switch to rule-checked training. For each input, the current model produces several candidate answers. We score each with verifiable rewards: entity count (match number), span overlap (token F1), type match, visibility match (None vs not None), and grounding quality (IoU for boxes). Then we apply GRPO to push the model toward higher-scoring behaviors.
Why this step exists: It aligns the model’s behavior with objective, checkable criteria and discourages pretty-but-wrong rationales.
Example with data: If gold has 2 entities and the model predicts 3, count reward shrinks. If it labels “Premier League” as visible when it’s not in the image, the visibility reward is zero. If it draws a tight box on “Kevin Durant,” the IoU reward is high.
What breaks without it: The model can keep verbose explanations that don’t truly check cross-modal consistency, reintroducing modality bias.

Secret sauce 1: Multi-style vs single-style

Multi-style schemas expose the model to many valid thinking routes. Early on, single-style may look stronger (more focused), but later, multi-style catches up and surpasses it by exploring reasoning space safely. This avoids collapse (overfitting one path) and improves stability.

Secret sauce 2: GRPO group advantage

Comparing a batch of attempts (rather than one) makes the policy updates robust. Normalizing within the group and clipping updates prevents wild swings and keeps learning steady.

Secret sauce 3: Verifiable rewards

No learned reward model needed. Simple rules—counts, overlaps, matches—are transparent and tamper-resistant. If a box isn’t overlapping enough (IoU < threshold), it earns little to no reward.

Putting it together in plain steps:

Build reasoning paths: For each labeled example, generate multiple reasoning styles that explicitly follow Cs, Ct, Ce, Cu.
Teach the model to follow paths: Fine-tune so the MLLM writes a process first and then the final answer.
Sample multiple answers: For each input, generate several candidates to encourage exploration.
Score with rules: Compute rewards for entity count, span overlap, type correctness, visibility agreement, and box IoU.
Update with GRPO: Prefer the better candidates, but keep changes clipped and length-normalized for stability.
Iterate: Over epochs, the model’s reasoning becomes more concise and consistent, with fewer hallucinations and better grounding.

🍞 Hook: Imagine two chore charts—one tells you what to do (MRSI), the other checks if you did it right (CVO).

🥬 The Concept (Cross-modal Reasoning in Action): The whole pipeline ensures the model actually uses text to list entities and types, and uses the image only to decide visibility and location—never the other way around.

How it works: 1) Text anchors entity identity and type. 2) Image verifies presence and place. 3) Rewards and GRPO enforce this separation and cooperation.
Why it matters: This division of labor prevents logo-driven hallucinations or text-only name guessing. 🍞 Anchor: For “Rory Calhoun,” the model classifies “person” from text/knowledge and then checks the image: if he isn’t visible, location is None—even if a cat is visible and tempting.

04Experiments & Results

The test: The authors evaluate three things: 1) GMNER full task (entity, type, and correct box/None), 2) MNER-MI (entity and type, weak text-image correlation), and 3) GREC visual grounding (including cases with no target region). They report Precision, Recall, F1, and special no-target measures (N-acc; plus N-Precision/N-Recall/N-F1 for textual bias in GMNER).

The competition: They compare against pipeline methods (like SCANNER), unified generation methods (like MQSPN, TIGER, H-Index), and end-to-end MLLMs with strong prompting baselines (direct prompting, Chain-of-Thought, few-shot) and simple Supervised Fine-Tuning (SFT). They test on strong open MLLMs (Qwen2.5-VL and Mimo-VL) and also reference closed models.

The scoreboard with context:

Training-free prompts (CoT, few-shot) help some, like doing extra practice before a test, but they don’t fix modality bias fully.
With MCR (MRSI + CVO), end-to-end MLLMs consistently beat prior baselines. On GMNER, the method improves F1 over the best previous unified method MQSPN by about 11.87 points—a big leap, like going from a B to a solid A.
Against a strong pipeline (SCANNER), MCR still adds around 2.11 F1 points—showing that structured end-to-end reasoning can surpass even knowledge-enhanced pipelines.
For small open models: On Qwen2.5-VL-7B and Mimo-VL-7B, MCR greatly boosts performance over plain SFT (roughly +8.05 F1 and +7.57 F1 respectively), turning them from middling students into top performers.
On MNER-MI (where images can mislead), MCR > SFT, and the second stage (CVO) usually beats MRSI alone—evidence that verifiable rewards sharpen the model’s use of text vs image.
On GREC (includes no-target cases), MCR raises N-acc and precision, meaning it gets better at saying “None” when nothing is there and draws boxes correctly when something is.

Surprising findings:

Bias almost disappears: Using Qwen2.5-VL-7B or Mimo-VL-7B, the number and rate of invented entities not in the sentence (a visual bias sign) drop to near zero with MCR.
Textual bias metrics (N-Precision/N-Recall/N-F1) jump by about 14 points on Qwen2.5-VL-7B versus SFT. The model becomes more conservative about marking things as visible and more accurate about using None.
Multi-style schemas win the long game: Single-style looks good early but stalls; multi-style keeps exploring and ends higher, with more stable learning curves and converging reasoning length.
Smaller can rival bigger: A 7B model with MCR can outperform a much larger model without it in end-to-end GMNER settings, highlighting that better reasoning beats brute size.

What these numbers mean: Think of grading on three skills—reading names, knowing their type, and pointing to the right place or saying None. MCR improves all three together, especially the “say None when not visible” part, which is where many models either over-claim (visual bias) or under-check (textual bias). The improvements are not just a little polish; they are strong, repeatable gains across datasets and subtasks.

05Discussion & Limitations

Limitations:

Knowledge ceiling: MCR still depends on what the base MLLM already knows. If an entity is very rare or new, the model can misclassify the type or fail to recognize it.
Visual familiarity: If the model hasn’t seen enough of how certain people or logos look, it might be fooled by look-alikes.
Span sensitivity: Detecting exact entity text spans can still trip the model, affecting the span reward and downstream matching.

Required resources:

Training uses LoRA adapters on 7B-scale MLLMs, 8 GPUs, and a mix of GMNER/MNER/VG data. While lighter than full-model fine-tuning, it’s still a non-trivial setup.
You need labeled triples and, for CVO, the ability to compute rule-based rewards (token overlap, IoU), plus a sampler that produces multiple candidates per input.

When not to use:

Extremely low-resource, on-device scenarios where even LoRA is too heavy and reward computation is impractical.
Domains with highly specialized, rapidly changing entities (e.g., brand-new esports teams) unless you add external knowledge.
Cases where images often have multiple valid boxes per text mention (incompatible with the single-box GMNER setting unless adapted).

Open questions:

How to integrate external knowledge (like Wikipedia or live APIs) so the model can generalize to unseen entities while keeping the cross-modal checks strict?
Can we extend verifiable rewards to richer spatial relations (left of, overlapping, group presence) or multi-instance entities?
What’s the best way to auto-generate even more diverse, high-quality reasoning schemas without noise?
How to maintain the benefits in streaming or video settings where objects move and appear/disappear over time?

06Conclusion & Future Work

Three-sentence summary: This paper tackles modality bias in end-to-end grounded multimodal named entity recognition by making MLLMs show and verify their cross-modal reasoning. It introduces MCR, which combines multi-style reasoning schema injection (to structure thinking) with constraint-guided verifiable optimization (to reward only correct, checkable steps using GRPO). The result is significantly better accuracy and much less hallucination across benchmarks.

Main achievement: Turning cross-modal verification from a hope (“maybe the model will do it”) into a habit (“the model must do it and gets rewarded only if it does”), thereby outperforming strong pipelines and unified baselines.

Future directions: Plug in external knowledge so the system recognizes new or rare entities; expand rewards to relations and multi-instance grounding; adapt the approach to videos and longer documents; and explore lighter training for edge devices.

Why remember this: It shows that careful reasoning beats guessing—especially in multimodal AI. By enforcing explain-then-grade steps that anyone can verify, we get models that are not just eloquent but correct, making everyday tools like news curation, shopping, and assistive tech more reliable and fair.

Practical Applications

•News photo captioning that correctly links names in the article to the right faces—or says None when the person isn’t shown.
•E-commerce listings that match product mentions in reviews to the exact item image region, reducing confusion and returns.
•Assistive tools that read a sentence and highlight the corresponding object in a scene (or confirm it’s not present), helping low-vision users.
•Knowledge graph construction that safely extracts who/what from text and grounds it to visual evidence, improving data reliability.
•Social media analysis that tags people/teams/places mentioned in posts only when truly visible, reducing false positives.
•Content moderation that verifies if a claimed object actually appears in the image before taking action, minimizing mistakes.
•Education apps that teach students to connect text descriptions with matching picture regions, reinforcing careful evidence use.
•Photo library search that finds images where a text-mentioned person really appears, not just similar logos or scenes.
•Sports highlight tools that pinpoint the named player in a frame instead of guessing from jerseys or colors.
•Customer support systems that validate item mentions and locations in user-uploaded photos to speed up troubleshooting.

Version: 1