GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Key Summary
- •Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
- •The authors built a new test called GroundingME with 1,005 tough, real-world-style questions that require fine detail, tricky spatial logic, tiny/hidden objects, and knowing when there is no correct answer.
- •Across 25 modern models, the best score was only 45.1% and most models got 0% at saying “there is no such object,” which is risky for safety-critical uses.
- •GroundingME checks four dimensions: Discriminative (tell apart lookalikes), Spatial (understand relationships and counting), Limited (occluded or tiny), and Rejection (say ‘null’ when the description doesn’t match).
- •Turning on “thinking mode” and picking the best reasoning path at test time boosted performance by up to 2.9%, especially for Spatial and Rejection cases.
- •Training with a mix of normal and “trick” negative data taught a model to reject wrong descriptions, lifting rejection accuracy from 0% to 27.9%.
- •The data was carefully built: auto-detect boxes, generate descriptions with an MLLM, and then humans refined each to be unique, specific, and fact-checked.
- •Results reveal a visual grounding gap: models often pattern-match keywords instead of truly checking every attribute and relation in the image.
- •This benchmark is a diagnostic tool and a roadmap—showing what to fix and how (better training data and smarter test-time selection).
- •Real-world impact includes safer robots, more reliable AR assistants, and fewer hallucinations in apps that must point to the right thing.
Why This Research Matters
GroundingME exposes when an AI points to the wrong thing—or invents something that isn’t there—so we can fix it before deploying in the real world. This matters for safety: robots should grab the correct tool, cars should not see objects that don’t exist, and AR glasses should highlight the right item. It also improves trust: assistants that can reliably say “no match” are more honest and useful. By pushing models to handle fine detail, complex relations, tiny or hidden items, and safe rejection, apps from shopping to healthcare imaging become more dependable. The benchmark also shows what training or testing strategies actually help, guiding the community toward practical improvements. In short, better grounding equals more accurate, safer, and more trustworthy AI systems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how when a friend says, “Pass me the red book on the left of the blue cup,” you don’t just grab any book—you look carefully, compare colors, check positions, and make sure it matches exactly.
🥬 The Concept: Visual Grounding
- What it is: Connecting a natural-language description to the exact location of the right object in an image.
- How it works:
- Read the sentence (e.g., “the small red mug behind the big plate”).
- Scan the image for candidates that might match.
- Check each attribute (color, size, texture) and each relation (left of, behind, first from the right).
- Pick the one box that matches all details—or say “none” if nothing fits.
- Why it matters: Without it, an AI can answer in words but can’t reliably point to the right thing, which breaks tasks like robot picking or highlighting objects.
🍞 Anchor: In a kitchen photo with three mugs, visual grounding lets AI find “the tiny green mug behind the kettle,” not just any mug.
🍞 Hook: Imagine a student who can read stories and also look at pictures—strong at both language and vision.
🥬 The Concept: Multimodal Large Language Models (MLLMs)
- What it is: Big AI systems that process and reason over both text and images.
- How it works:
- A vision part turns pixels into features.
- A language part understands the sentence and uses the vision features.
- Together, they match words to image regions and decide an answer.
- Why it matters: MLLMs power tasks like “point to the dog with the red collar” or “count the cars in the back row.”
🍞 Anchor: When you ask a robot, “Pick the third apple from the left,” an MLLM helps the robot know which apple you mean.
🍞 Hook: Think of old quizzes that are so easy everyone gets an A—do they really show who understands?
🥬 The Concept: Existing Benchmarks Often Had Shortcuts
- What it is: Past tests (like RefCOCO series and even newer ones) often used short phrases or had unique class names that made guessing too easy.
- How it works:
- The model spots a keyword (e.g., “giraffe”) that appears only once.
- It ignores tricky attributes and relations.
- It gets high scores without real reasoning.
- Why it matters: High scores can be misleading if models don’t truly compare details or reason about spatial relations.
🍞 Anchor: If the sentence says “the cup to the left of the plate,” shortcutting might grab any cup, not the left-of-one.
🍞 Hook: Sometimes the right answer is “there is no such thing.” Humans do this easily.
🥬 The Concept: Rejection (Saying “none” when no object matches)
- What it is: The model should output “null” if the description doesn’t match anything.
- How it works:
- Check all stated details (sleeve length, color, text on object).
- If any required detail fails for every candidate, say “none.”
- Don’t “make up” objects just to answer.
- Why it matters: Without rejection, AIs hallucinate—dangerous for safety (e.g., a robot grabbing the wrong tool or a car misidentifying obstacles).
🍞 Anchor: If the request is “the washing machine in the corner,” but there is no washing machine, the safe answer is null.
🍞 Hook: Imagine a tougher obstacle course that tests balancing, sprinting, puzzles, and knowing when to stop.
🥬 The Concept: The Gap and Why This Paper Exists
- What it is: Today’s MLLMs ace easy tests but struggle with real-life complexity.
- How it works:
- Old tests didn’t force careful discrimination, rich spatial reasoning, tiny/occluded object handling, or safe rejection.
- The paper builds a new benchmark that stresses all four.
- Why it matters: We need a realistic scorecard to guide safer, smarter systems.
🍞 Anchor: If a model gets 95% on easy worksheets but only 45% on a realistic final exam, you know what to fix.
02Core Idea
🍞 Hook: Imagine a referee who doesn’t just count goals, but checks offsides, tiny fouls, team positions, and also calls “no goal” when needed.
🥬 The Concept: GroundingME Benchmark
- What it is: A new, multi-dimensional test for visual grounding that mimics real-world difficulty.
- How it works (four dimensions):
- Discriminative: Tell apart very similar objects by fine details (appearance, parts, text, state).
- Spatial: Understand relations and counting among multiple objects.
- Limited: Handle occlusion and tiny targets in huge images.
- Rejection: Say “none” when the description can’t be grounded.
- Why it matters: Without all four, models may score high by guessing but fail in real-life tasks.
🍞 Anchor: In a crowd of near-identical planes, GroundingME asks for “the one slightly below and left,” not just “any plane.”
🍞 Hook: Think of three easy-to-grasp stories that explain the same idea from different angles.
🥬 The Concept: Multiple Analogies for the Idea
- What it is: The key insight is to test what truly requires careful matching, not just keyword hunting.
- How it works:
- Detective story: The model must match every clue (color, text, relations) and admit “case unsolved” if none fits.
- Gym class: Balance (discriminative), teamwork positions (spatial), weights with gloves (limited), and knowing when to stop (rejection).
- Bakery order: The pastry chef checks every ingredient (attributes), where it sits in the tray (spatial), finds small cookies in a big oven pic (limited), and cancels the order if the recipe is impossible (rejection).
- Why it matters: These dimensions stop shortcut tricks and reveal true grounding skill.
🍞 Anchor: A student who passes all four mini-tests is actually ready for the real world, not just cramming.
🍞 Hook: Before and after pictures make changes obvious.
🥬 The Concept: Before vs After
- What it is: How evaluation changes with GroundingME.
- How it works:
- Before: 90%+ on old datasets using keywords and simple scenes.
- After: Only 45.1% best on GroundingME; 0% on rejection for many.
- Why it matters: We now see the real ceiling and can target weaknesses.
🍞 Anchor: It’s like switching from basic arithmetic drills to real-life word problems—you find different gaps.
🍞 Hook: Picture a rule-of-thumb instead of long equations.
🥬 The Concept: Why It Works (Intuition)
- What it is: Rich, constraint-heavy descriptions in crowded, high-res scenes force genuine reasoning.
- How it works:
- Many similar distractors remove keyword shortcuts.
- Spatial and counting require stepwise logic.
- Tiny/occluded objects demand precise perception.
- Rejection punishes hallucinations.
- Why it matters: The test aligns with how humans check every clue and say “none” when needed.
🍞 Anchor: A model that truly verifies long sleeves, black pants, and position will reject a child in blue shorts even if a white shirt matches.
🍞 Hook: Big ideas are easier when broken into Lego bricks.
🥬 The Concept: Building Blocks
- What it is: Pieces that make GroundingME effective and fair.
- How it works:
- Diverse images from SA-1B and 8K HR-Bench.
- Careful bounding boxes (RAM++, GroundingDINO, custom NMS; manual for 8K).
- Rich descriptions bootstrapped by an MLLM then human-refined for uniqueness, clarity, specificity, and factuality.
- Balanced subcategories (12 total) for diagnostic insights.
- Why it matters: Each block prevents shortcuts and ensures challenge variety.
🍞 Anchor: The result is a “final exam” that actually checks the skills we want—fine detail, relations, tiny targets, and safe rejection.
03Methodology
🍞 Hook: Imagine assembling a tough obstacle course: pick the field, place the cones, write the rules, and decide how winners are judged.
🥬 The Concept: High-Level Pipeline
- What it is: Input → Box the objects → Write and refine descriptions → Evaluate models with strict scoring.
- How it works:
- Data source selection (SA-1B, HR-Bench).
- Bounding box annotation (semi-auto for SA-1B; manual for 8K HR-Bench).
- Description generation (MLLM) and human refinement.
- Unified prompts + Accuracy@0.5 scoring.
- Why it matters: Each step removes shortcuts and raises real-world difficulty.
🍞 Anchor: It’s like building a fair, challenging tournament so scores truly mean skill.
🍞 Hook: Think of finding items in a giant Where’s Waldo, sometimes with a magnifying glass.
🥬 The Concept: Data Sources (SA-1B and HR-Bench)
- What it is: Large, complex images with many objects (SA-1B) and ultra-high-res 8K images (HR-Bench) for tiny targets.
- How it works:
- Use raw images only—no existing masks or labels.
- SA-1B supplies crowded, diverse scenes; HR-Bench supplies tiny-object challenges.
- Why it matters: Prevents dataset leakage and ensures variety and difficulty.
🍞 Anchor: For “the person holding a camera, far away,” HR-Bench’s 8K resolution makes it visible but still hard.
🍞 Hook: Imagine robots that first list what’s in the room, then draw boxes around each thing, then remove duplicates.
🥬 The Concept: Bounding Box Annotation
- What it is: How they found and cleaned object boxes.
- How it works (SA-1B):
- RAM++ lists object classes in the image.
- GroundingDINO, prompted by those classes, proposes boxes.
- Custom NMS keeps boxes from common classes to avoid redundant overlaps.
- Manual annotation for HR-Bench due to 8K complexity.
- Why it matters: Good boxes are the foundation; bad boxes break everything else.
🍞 Anchor: If two boxes overlap on the same plane, custom NMS picks the right one to keep the candidate list clean.
🍞 Hook: Picture a friendly assistant writing first drafts, then a teacher polishing them.
🥬 The Concept: Description Generation and Human Refinement
- What it is: Rich referring expressions created by an MLLM and perfected by people.
- How it works:
- Use visual prompting or crops to describe attributes and relations.
- Humans fix boxes and rewrite text to ensure: Uniqueness, Subject Clarity, Task Specificity, Factual Accuracy.
- For Rejection, intentionally include subtle contradictions.
- Why it matters: This prevents keyword-guessing and forces full-detail matching.
🍞 Anchor: “Tall, weathered stone spire, first from the right” leaves no ambiguity.
🍞 Hook: Tests need clear rules so everyone plays the same game.
🥬 The Concept: Unified Prompting and Scoring
- What it is: A standard prompt template and a strict metric.
- How it works:
- Prompts define viewer perspective, number of outputs (at most one), and exact JSON output format.
- Scoring uses Accuracy@0.5: a prediction counts if the Intersection over Union (IoU) with ground truth ≥ 0.5; for Rejection, output must be null when required.
- Why it matters: Consistent rules make results comparable and fair.
🍞 Anchor: If the box covers at least half the correct region, it’s a pass for that item; otherwise, it’s a miss.
🍞 Hook: Sometimes thinking out loud helps—and picking the best thought helps even more.
🥬 The Concept: Test-Time Scaling (Best-of-N by Thinking Trajectory)
- What it is: Generate multiple “thinking” answers and let a judge choose the best.
- How it works:
- Sample N=16 thoughtful responses from a strong MLLM.
- A text-only LLM judge (can’t see the image) compares reasoning quality pairwise and picks the winner.
- Also compare to multimodal judges that only see outputs, not the thinking.
- Why it matters: Better-chosen reasoning paths improve hard cases like Spatial and Rejection.
🍞 Anchor: Two students show their work; a teacher picks the solution with consistent logic and careful checks.
🍞 Hook: If you never practice saying “no match,” you won’t say it during the test.
🥬 The Concept: Data-Mixture Training for Rejection
- What it is: Mix positive and negative (mismatched) examples in fine-tuning.
- How it works:
- Start from RefCOCOg positives; create matching negatives by altering descriptions.
- Fine-tune with different negative:positive ratios.
- Evaluate both in-domain and on GroundingME.
- Why it matters: Exposure to non-matching cases teaches models to safely reject.
🍞 Anchor: After seeing many “trick” questions, a model learns to answer “null” when details don’t add up.
04Experiments & Results
🍞 Hook: Imagine inviting 25 top athletes to a new obstacle course and timing each on every section.
🥬 The Concept: The Test Setup
- What it is: Evaluate 25 state-of-the-art MLLMs on GroundingME using a unified prompt and Accuracy@0.5.
- How it works:
- Feed the image and the refined description.
- Require at most one box or null.
- Score as correct if IoU ≥ 0.5 (or correct null for Rejection).
- Why it matters: Fair, apples-to-apples comparisons reveal real strengths and weaknesses.
🍞 Anchor: Everyone runs the same course, with the same stopwatches.
🍞 Hook: Who’s the champion—and by how much?
🥬 The Concept: The Scoreboard
- What it is: Overall and per-category results.
- How it works:
- Best overall model: Qwen3-VL-235B-A22B at 45.1%—like getting a middling C when old tests gave easy As.
- Most models scored 10–40%; several under 10%; commercial models didn’t dominate.
- Model size helps: scaling up consistently improved scores across families.
- Why it matters: The course is genuinely hard and reveals a visual grounding gap.
🍞 Anchor: Going from 90% on old tests to ~45% here shows real-world difficulty bites.
🍞 Hook: Which parts of the course were hardest?
🥬 The Concept: Subtask Patterns
- What it is: Category-wise insights.
- How it works:
- Discriminative is usually best (still challenging); State sub-skill remains harder.
- Spatial: models do better on qualitative relations (left/right/behind) than Counting (quantitative).
- Limited: occlusion and tiny objects remain tough.
- Rejection: the weakest—many models scored 0%, often hallucinating boxes instead of saying null.
- Why it matters: We now know exactly where to focus improvements.
🍞 Anchor: Counting “the fifth flag” is trickier than “left of the blue flag,” and saying “none” is the rarest skill.
🍞 Hook: What happens if we let models think out loud?
🥬 The Concept: Thinking Mode and Test-Time Scaling (TTS)
- What it is: Enabling chain-of-thought helped, and TTS (best-of-N) helped more.
- How it works:
- Thinking mode alone gave 4.7–7.4% gains depending on model.
- TTS with a text-only judge boosted performance up to +2.9% more, especially on Spatial and Rejection.
- Multimodal judges helped mostly on perception-heavy cases; smaller judges helped less.
- Why it matters: Selecting the best reasoning trajectory improves accuracy where logic matters most.
🍞 Anchor: In a case study, one trajectory carefully rejected a mismatched child description (correct); another “forgave” mismatches and guessed (wrong).
🍞 Hook: Can we teach “say none” during training?
🥬 The Concept: Data-Mixture Training for Rejection
- What it is: Mixing negative samples into fine-tuning to build rejection skill.
- How it works:
- Using RefCOCOg positives plus crafted negatives, train Qwen3-VL-8B.
- Rejection accuracy jumped from 0% to 27.9% on GroundingME for the best ratio; in-domain rejection soared above 90%.
- Trade-off: some drop on out-of-domain positives (non-rejection parts), showing generalization limits.
- Why it matters: Negative data is a simple, effective way to add safety-critical behavior, though careful balancing is needed.
🍞 Anchor: Like practicing “trick” riddles so you know when to say, “No valid answer.”
05Discussion & Limitations
🍞 Hook: Even great students have weak spots they must admit to fix.
🥬 The Concept: Honest Assessment
- What it is: Strengths, limits, and open puzzles revealed by GroundingME.
- How it works:
- Limitations: The benchmark is curated; it can’t cover all real scenes and may inherit some auto-generation noise despite human refinement. Rejection construction is challenging and may include edge cases humans debate.
- Required resources: High-res images, strong detectors (RAM++, GroundingDINO), powerful MLLMs for bootstrapping, and careful human annotation; evaluation needs consistent prompting and IoU tools.
- When not to use: If you need quick, low-res, single-object demos or class-name-only tasks, simpler datasets may suffice.
- Open questions: How to build rejection that generalizes without hurting positive grounding? Can we unify perception (tiny/occluded) with robust reasoning (relations, counting) at scale? What is the best judge for TTS—text-only, multimodal, or hybrid? Can we create curricula that systematically grow these skills without catastrophic trade-offs?
- Why it matters: Knowing the edges of today’s abilities helps researchers invest in the right fixes.
🍞 Anchor: Treat GroundingME like a doctor’s report: it shows where the system is healthy and where more therapy is needed.
06Conclusion & Future Work
🍞 Hook: Think of a final exam that finally matches real life.
🥬 The Concept: Takeaways
- What it is: Three-sentence summary, key win, next steps, and why it sticks.
- How it works:
- The paper introduces GroundingME, a realistic benchmark that stresses four pillars of visual grounding: Discriminative, Spatial, Limited, and Rejection.
- Testing 25 top models shows a large grounding gap (best 45.1%, many 0% on rejection), but thinking-based test-time selection and negative-sample training offer clear gains.
- The path forward is not just bigger models but better evaluation, smarter inference (TTS), and training that includes saying “none.”
- Why it matters: The main achievement is a diagnostic, multi-dimensional lens that reveals what’s missing and how to fix it—especially safe rejection.
🍞 Anchor: Remember GroundingME as the tough but fair coach that made the team stronger by testing what truly matters and teaching when not to take the shot.
Practical Applications
- •Robotics pick-and-place: Reliably select the exact object (e.g., “the third screw from the left”) and avoid grabbing when no match exists.
- •AR assistants: Highlight the correct product on a shelf based on detailed descriptions and reject when the item isn’t present.
- •Smart cameras: Track a specific person or vehicle in crowded scenes using fine-grained attributes and spatial relations.
- •Industrial inspection: Locate tiny defects or specific components in high-resolution images with precise grounding.
- •Medical imaging assistance: Point to the exact region matching a textual note (with safeguards to say “no match” if absent).
- •Autonomous vehicles: Cross-check textual cues with vision to avoid hallucinating obstacles and confirm spatial relations.
- •Photo organization: Tag and localize objects with precise bounding boxes, handling lookalikes and small targets.
- •UI accessibility: Let users ask for “the small blue share icon under the photo,” then focus or click the correct element.
- •Security monitoring: Identify targets with detailed attributes and count entities accurately in crowded footage.
- •Content editing: Accurately select the described region for editing (e.g., “brighten the small sign behind the tree”).