VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
Key Summary
- •VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.
- •It uses a two-step think–rethink process: first it guesses on its own, then it checks helper tool results and either agrees or fixes them.
- •A special reward system teaches the model to trust good tool outputs and correct bad ones, instead of blindly following tools.
- •A fair test protocol called PiTER injects tool results into the prompt so every model is graded the same way on its ability to refine.
- •Two new scores, CCR and NSRI, show how often and how well the model fixes wrong tool outputs.
- •On the RefCOCO family of benchmarks, VG-Refiner reaches state-of-the-art accuracy and even beats much larger models in refinement ability.
- •The approach works with both strong tools (like EVF-SAM) and weak tools (like unfine-tuned Grounding DINO-T), showing strong robustness.
- •The model keeps its general visual QA skills while learning to refine, trained with only a small, task-focused dataset.
- •This makes AI more reliable for everyday tasks like finding items in photos, guiding robots, and assisting accessibility tools.
Why This Research Matters
In real life, helper tools are rarely perfect, and AI must decide when to trust them and when to double-check. VG-Refiner shows a simple, teachable habit—think, then rethink—that makes AI more dependable in everyday visual tasks. This means better highlighting for accessibility, fewer robotic grasping mistakes, and more accurate product search in stores. By measuring refinement fairly, teams can improve models where it counts: fixing wrong hints without breaking right ones. The approach keeps general skills intact, so systems don’t trade reliability for narrow gains. That balance—robustness with practicality—brings AI a step closer to trustworthy assistants in homes, schools, and workplaces.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you're playing I-Spy with a friend. You say, “Find the blue couch with three people on it.” Your friend looks around, and a helpful (but sometimes clumsy) helper whispers the answer. Sometimes the helper is right; sometimes they point to a black chair. If your friend never questions the helper, you’ll lose the game a lot.
🥬 Referring Expression Comprehension (REC): What it is: REC is when an AI finds the exact thing in a picture that matches a sentence like “the boy with the red hat.” How it works:
- Read the sentence.
- Scan the picture for matching clues (color, position, action).
- Draw a box around the best-matching object. Why it matters: Without REC, apps can’t reliably highlight the right object you’re talking about. Anchor: Ask your phone, “Highlight the dog wearing a scarf.” REC decides which dog to box.
🍞 Tool-Integrated Visual Reasoning (TiVR): You know how a chef uses gadgets—like a peeler or a timer—to cook better and faster? 🥬 What it is: TiVR is when an AI uses external helper tools (like object detectors or OCR) to improve its vision-and-language reasoning. How it works:
- The AI reads the question and sees the image.
- It calls a helper tool (e.g., detector) to get candidate boxes.
- It uses these hints to answer better. Why it matters: Without tools, the AI may miss tiny details or struggle with hard tasks. Anchor: If the task is “count apples,” a detector tool lists apple spots so the AI counts accurately.
🍞 Agentic Reinforcement Learning (RL): Picture a dog learning tricks: do a good trick, get a treat. 🥬 What it is: A training style where the AI takes actions (like picking a box or calling a tool) and gets rewards when the result is correct. How it works:
- Try a step (think, call tool, refine).
- Get a reward if closer to the right answer.
- Repeat, learning which choices pay off. Why it matters: Without RL, the model may not learn good habits for tough, multi-step problems. Anchor: The AI learns to ask the right tool at the right time—like a kid learning that asking a librarian beats guessing book locations.
The world before: Big vision-language models (LVLMs) got much better at visual tasks by using tools (detectors, depth estimators, OCR). Research added chain-of-thought (CoT) text reasoning and RL to reduce expensive labels and boost generalization across tasks. But almost everyone treated tools as if they were always trustworthy. In reality, even the best tools sometimes point at the wrong object or draw a too-tight/too-loose box.
The problem: In REC—finding exactly the object a sentence describes—small tool mistakes cause big failures. If the detector proposes a black chair for “blue couch,” many models just explain around the wrong box (a “hallucination”) instead of questioning it. Even advanced agentic RL methods mostly teach models to pick which tool to call, not how to challenge bad tool feedback.
Failed attempts:
- Supervised fine-tuning on big grounding datasets helped but didn’t teach models to distrust wrong tools.
- CoT-only systems explained their choices but often explained the wrong box.
- Tool-using RL agents (like REVPT) learned to call tools but lacked mechanisms to flag or fix tool errors, so they could follow tools off a cliff.
The gap: Models needed a way to explicitly reason about tool outputs—agree when tools are right, and refine when they’re wrong—plus a reward that teaches this behavior. And the community needed a fair test that isolates this specific skill: the ability to refine tool outputs.
Real stakes: This isn’t just benchmark points. It affects:
- Accessibility: correctly highlighting “the sign that says Exit” for low-vision users.
- Home robots: grabbing “the red mug on the left,” not the ketchup bottle.
- Shopping and inventory: finding “the blue size-M jacket under the red one.”
- Education and creation: pointing to “the triangle closest to the circle” in a diagram.
- Safety: in maps, dashboards, or UIs, double-checking tools reduces costly mistakes.
Enter VG-Refiner. It adds a rethink step and a smart reward that together teach a simple but powerful habit: trust—but verify—your tools.
02Core Idea
🍞 Think–Rethink Mechanism: You know how you first make a guess on a test, then check it against notes and fix it if needed? 🥬 What it is: A two-stage process where the model first thinks independently, then rethinks using tool feedback to accept or correct it. How it works:
- Think: Make an initial guess from the image and phrase.
- Call tool: Get helper boxes as references.
- Rethink: Compare your guess and the tool’s; accept if consistent, refine if not.
- Output: Return a final bounding box in JSON. Why it matters: Without rethinking, the AI can be confidently wrong, especially when tools mislead. Anchor: For “blue couch with three people,” it first guesses, then notices the tool’s box points to a black chair, and corrects it.
Aha! in one sentence: Add a second, explicit check (rethink) plus rewards that celebrate correcting bad tools and safely following good ones.
Three analogies:
- Student + answer key: You solve the problem, then peek at the key; if it disagrees, you rework the steps.
- GPS + your eyes: The GPS says “turn left,” but you see a roadblock; you adapt instead of blindly turning.
- Doctor + lab test: A doctor forms a diagnosis, checks lab results, then confirms or adjusts the plan.
Before vs After:
- Before: Tools were treated like oracles; models explained around wrong boxes; strong tools helped, weak tools sank the ship.
- After: The model treats tools as advisors: it confirms good hints and improves or rejects bad ones. Performance rises with both strong and weak tools.
Why it works (intuition): Two ingredients align behavior with reality. First, structure: the rethink stage forces the model to explicitly weigh visual evidence, the text description, and tool feedback side by side. Second, incentives: the refinement reward gives a bigger treat for fixing wrong tools and a modest treat for safely following correct ones, shaping the model’s instincts.
Building blocks:
- Think–Rethink stages: explicit tags (<think>, <rethink>) keep reasoning organized and teachable.
- Refinement Reward: 1.0 when the tool is wrong and the model gets it right; 0.5 when the tool is right and the model follows correctly.
- Format Reward: ensures clean, structured outputs for reasoning and the final JSON box.
- Agentic RL (with a stable reference model): the model explores rollouts but is gently tethered (via KL) to avoid drifting away from core skills.
- Tools: strong (EVF-SAM) and weak (Grounding DINO-T) expose a spectrum of feedback quality.
- PiTER evaluation: a one-shot, tool-injected prompt that fairly measures pure refinement skill, not fancy multi-turn scaffolding.
🍞 Refinement Reward: Imagine getting a gold star for fixing a wrong hint and a silver star for agreeing with a right hint. 🥬 What it is: A reward scheme that pays more for correcting wrong tools (1.0) and some for confirming right tools (0.5). How it works:
- Compare tool box to ground truth (IoU).
- If tool is wrong (<0.5 IoU) and model is right (≥0.5), reward = 1.0.
- If tool is right and model is also right, reward = 0.5.
- Else, reward = 0. Why it matters: Without this, models either over-trust tools or over-edit them. Anchor: When a weak tool misses the “orange on the left,” the model fixes it and earns the full point.
🍞 Intersection-over-Union (IoU): Think of two clear stickers on a window; the more they overlap, the better the match. 🥬 What it is: A number from 0 to 1 that measures how much the predicted box overlaps the true box. How it works:
- Find the overlap area.
- Divide by the area covered by either box.
- Higher is better; 0.5+ counts as correct here. Why it matters: Without IoU, we can’t score how close a box is to the truth. Anchor: If your box covers half the real couch, IoU ~0.5—barely a pass.
🍞 PiTER Evaluation Protocol: Picture a fair test where everyone answers the same question with the same hint written at the top. 🥬 What it is: A single-stage prompt that injects the tool’s result and asks for a JSON box, so every model is judged on refinement, not conversation tricks. How it works:
- Put the tool’s box in the prompt.
- The model must output a final JSON box—no extra tool calls.
- Compare to ground truth with consistent metrics. Why it matters: Without PiTER, it’s hard to know which model truly refines versus who just had fancier staging. Anchor: Two students both get the same noisy hint; the one who best corrects it gets the higher score.
🍞 CCR and NSRI (refinement metrics): Like a coach tracking not only how many shots you make (success rate) but also how much your aim improved. 🥬 Critical Correct Rate (CCR): What it is: Among cases where the tool failed, how often did the model fix it (reach IoU ≥ 0.5)? How it works:
- Filter samples where tool IoU < 0.5.
- Count how many model predictions reach IoU ≥ 0.5.
- Divide to get a percentage. Why it matters: It isolates true “rescue missions.” Anchor: If the tool fails on 100 images and you fix 73, CCR = 73%.
🥬 Normalized Signed Relative IoU (NSRI): What it is: A score from -1 to +1 that measures how much the model improved or worsened the tool’s IoU, scaled by how much improvement was possible. How it works:
- Compute the IoU change.
- Normalize by the remaining room to grow (or shrink).
- Average over tool-failed cases (NSRI_w) to see net fix quality. Why it matters: It tells you not just if you fixed it, but how well. Anchor: A small nudge from 0.48 to 0.51 is a save; a big leap from 0.10 to 0.70 is a heroic fix—and NSRI shows the difference.
03Methodology
High-level recipe: Image + Phrase → Think (initial guess) → Call tool (get candidate box) → Rethink (accept or refine) → Output final JSON box. Training uses agentic RL with rewards that favor good refinement.
Step-by-step details:
- Inputs and initial think:
- What happens: The model sees the image and the sentence (e.g., “a blue couch with three people”) and writes out its reasoning (<think>...</think>) plus an action to call a tool.
- Why it exists: Thinking first prevents the tool from dominating; the model forms its own hypothesis.
- Example: It might say, “I see several sofas; the left one is blue with three people,” and propose a box [30, 247, 218, 348].
- Tool call and feedback:
- What happens: A specialized REC tool (strong EVF-SAM or weak Grounding DINO-T) returns a candidate box or mask (converted to a box), e.g., [240, 249, 395, 360].
- Why it exists: Tools add extra eyes—especially helpful for tiny or ambiguous objects.
- Example: The tool might mistakenly highlight a black chair instead of the blue couch, flagging a potential conflict.
- Rethink with explicit comparison:
- What happens: The model writes a second reasoning pass (<rethink>...</rethink>) that checks consistency between its initial guess, the tool’s box, the image, and the phrase.
- Why it exists: This is the error-catcher. Without it, the model may rationalize the tool’s mistake.
- Example: “The tool’s box looks like a black chair; description says blue couch with three people. I’ll keep my earlier box and adjust slightly to include all three people.”
- Final JSON answer:
- What happens: The model outputs the bounding box in a strict JSON format: [{"bbox_2d":[x1,y1,x2,y2],"label":"..."}].
- Why it exists: Structured outputs are easy to grade and reuse.
- Example: [{"bbox_2d":[30,247,218,348],"label":"blue couch with three people"}].
- Rewards that shape behavior:
- Format Reward: What happens: If the model uses the exact required tags (<think>, <rethink>, <answer>) and JSON, it gets +1; else 0. Why it exists: Clean reasoning traces support learning and reliable evaluation. Example: Miss a tag? Lose the point.
- Refinement Reward: What happens: Compare tool IoU (IoU_t) and model IoU (IoU_f) with ground truth. • If the tool was wrong (IoU_t < 0.5) and the model is right (IoU_f ≥ 0.5): +1.0. • If the tool was right (IoU_t ≥ 0.5) and the model is also right: +0.5. • Else: 0. Why it exists: It encourages big saves on bad tools and safe following on good tools. Example: Tool misses the “orange on the left” (IoU_t=0.0), model nails it (IoU_f=0.74) → +1.0.
- Training loop with stable exploration:
- What happens: The model generates multiple rollouts per sample (different thoughts/boxes), gets rewards, and is optimized with a method that keeps it close to a strong reference (to avoid drifting or reward hacking).
- Why it exists: Exploration finds better strategies; the reference tether keeps general skills intact.
- Example: Over time, the model learns patterns like “trust this tool when it’s precise; override when it mismatches color/attributes.”
- Tool choices and their roles:
- Strong tool (EVF-SAM/RES): Great, but its mask-to-box conversion can be slightly off (too tight/loose), making room for useful refinements.
- Weak tool (Grounding DINO-T, unfine-tuned): Often wrong or missing; perfect to train and test correction ability.
- Evaluation with PiTER:
- What happens: Everyone gets the same prompt containing the tool result and must answer in one shot with a JSON box.
- Why it exists: It isolates refinement skill from multi-turn tricks.
- Example prompt snippet: Tool Results: {"bbox_2d":[63,238,272,445]}.
Secret sauce:
- The rethink stage forces the model to weigh three signals—image, text, tool—explicitly, not implicitly.
- The refinement reward balances bravery (fix bad tools) and humility (follow good tools).
- Mixed tool quality during training teaches when to trust and when to question.
- A gentle regularizer (keeping the policy near a reference model) preserves broad abilities while sharpening refinement.
04Experiments & Results
The tests and why: Researchers measured three things on classic referring datasets (RefCOCO, RefCOCO+, RefCOCOg):
- Acc@0.5: Did the final box get IoU ≥ 0.5? (Pass/fail accuracy.)
- CCR: On tool-failed cases, how often did the model fix them?
- NSRI_w: On tool-failed cases, how much did the model improve IoU, scaled by room-to-improve? These focus squarely on the skill of refining tool outputs.
The competition: VG-Refiner was compared to strong baselines:
- Base models: Qwen2.5-VL-7B and 32B.
- Tool-only experts: EVF-SAM, Rex-Omni, UNINEXT-H.
- RL and SFT systems: Ground-R1, CogVLM/CogCoM, UniVG-R1, Vitron, UniPixel, REVPT.
Scoreboard (with context):
- Overall accuracy: With EVF-SAM as the tool, VG-Refiner hits around 95.0% on RefCOCO testA and 90.6% on RefCOCOg test, edging past EVF-SAM itself and clearly beating the 7B base model. Think of it as getting an A+ when the strong tool gets an A and the regular model gets a B.
- Robustness with weak tools (PiTER, Grounding DINO-T): VG-Refiner maintains high accuracy and the best CCR/NSRI_w. That’s like steering straight even when your compass is wobbly, while other models veer off.
- Strong tools (PiTER, EVF-SAM): Some models actually made good tool outputs worse (negative NSRI_w). VG-Refiner instead shows positive NSRI_w and high follow-correct behavior—like knowing when not to fix what isn’t broken.
- Cross-tool generalization: Even when tested with other strong tools (Rex-Omni, UNINEXT-H), VG-Refiner adds gains—sometimes surpassing much larger 72B-scale models. That’s like a small, smart team outplaying a giant squad by better strategy.
- Out-of-domain (LISA-grounding): In zero-shot settings requiring more reasoning, EVF-SAM acts as a weak tool; VG-Refiner still improves over the 7B base, showing its rethink skill transfers.
- General QA preserved: On MMbench, OCRBench, ChartQA, RealWorldQA, and MMStar, VG-Refiner matches or slightly improves over the base model, showing refinement training didn’t ruin overall smarts.
Surprising findings:
- Self-improvement: If you feed Qwen2.5-VL-7B’s own boxes back as the “tool,” the base model barely improves itself—but VG-Refiner can still refine those boxes and gain accuracy.
- Two-stage prompting matters: Even without training changes, placing the tool hint in a second stage (after the model’s own think) yields noticeable gains; adding the rethink stage is better still.
- Tiny data, big effect: With only ~9k targeted samples and agentic RL, the model substantially improves refinement behavior without losing general abilities.
- REVPT caveat: Under strong-tool PiTER, REVPT often degraded correct tool outputs, underscoring the need for explicit error-handling and rethink logic.
05Discussion & Limitations
Limitations:
- Tool dependency: If the tool is extremely off (e.g., detects the wrong region entirely or returns nothing), the model’s job gets very hard.
- Box precision vs masks: The system refines bounding boxes; if a task needs pixel-perfect masks, extra steps are required.
- Latency and cost: Think–rethink plus a tool call adds compute time; not ideal for ultra-low-latency scenarios.
- Reward granularity: The piecewise reward (0, 0.5, 1.0) is stable but coarse; it doesn’t differentiate small vs large improvements within the same bucket.
- Domain shift: Trained on RefCOCOg samples; while it generalizes well, very different domains (e.g., medical images) may need adaptation.
Required resources:
- A solid LVLM backbone (e.g., Qwen2.5-VL-7B).
- Access to tools (strong: EVF-SAM; weak: Grounding DINO-T or similar).
- RL training stack (rollouts, reward computation, reference model regularization).
- Moderate GPU compute (the paper used 4Ă—A100) and a small curated dataset (~9k).
When not to use:
- Real-time pipelines where extra seconds matter more than a few accuracy points.
- Settings without any ground truth for training/rewarding (you can’t compute the IoU-based rewards).
- Pure segmentation tasks that require mask-level outputs only, unless extended.
- Highly adversarial inputs where tools are consistently misleading; more robust uncertainty modeling may be needed first.
Open questions:
- Beyond two stages: Would multi-round rethink cycles help, or just add latency?
- Better rewards: Can we design smoother, risk-aware rewards that still avoid reward hacking?
- Uncertainty and trust: How can the model quantify tool reliability on the fly and calibrate its confidence?
- Broader tools: How does this approach extend to text-heavy (OCR), chart, or layout tools—and to video grounding?
- Pixel-level refinement: Can the same idea directly refine segmentation masks, not just boxes?
06Conclusion & Future Work
Three-sentence summary: VG-Refiner teaches an AI to first think for itself, then rethink using tool feedback—accepting good hints and correcting bad ones. A simple, stable reward scheme pays most for fixing wrong tools and a bit for following right ones, while a fair PiTER test shows this true refinement skill. The result is top-tier grounding accuracy and superior correction ability, achieved with modest data and preserved general QA strength.
Main achievement: Turning tools from unquestioned oracles into consultable advisors via an explicit think–rethink routine and targeted reinforcement rewards—and proving it with new, fair refinement metrics (CCR, NSRI) and protocol (PiTER).
Future directions:
- Add uncertainty-aware trust calibration and multi-round refinement.
- Extend from boxes to masks and to video grounding.
- Explore richer, smoother reward shapes and semi/self-supervised signals to reduce need for labels.
- Combine with active perception (zooming, focus) for tiny or cluttered targets.
Why remember this: It shows a practical path to make AI more trustworthy with imperfect tools—a common real-world situation. Instead of blind faith or blind doubt, VG-Refiner learns the grown-up habit: think, then rethink.
Practical Applications
- •Photo assistants that reliably highlight “the person in a red jacket on the right” even when detectors confuse colors.
- •Home robots that grasp the correct object (“the blue mug behind the bowl”) despite noisy detections.
- •Retail inventory systems that find “the size M blue shirt under the red one” with fewer mis-picks.
- •Accessibility tools that point to “the Exit sign above the door” more accurately for low-vision users.
- •AR apps that anchor labels on the correct items in cluttered scenes by refining tool suggestions.
- •Content creation and education tools that box “the triangle closest to the circle” accurately in diagrams.
- •Quality control in manufacturing where the model refines imperfect detector hints to locate tiny defects.
- •UI agents that select the right on-screen control (e.g., “blue Submit button under the form”) by correcting detector noise.
- •Dataset labeling assistants that auto-correct weak tool proposals, speeding up annotation with higher quality.
- •Safety monitoring that flags the right indicator on dashboards by questioning and fixing tool mistakes.