V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Chenrui Fan; Yijun Liang; Shweta Bhardwaj; Kwesi Cobbina; Ming Li; Tianyi Zhou

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Intermediate

Chenrui Fan, Yijun Liang, Shweta Bhardwaj et al.12/12/2025

arXiv PDF

Key Summary

•This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.
•V-REX turns hard visual problems into a Chain-of-Questions (CoQ), separating two skills: Planning (which sub-questions to ask) and Following (answering them correctly).
•By making each step a multiple-choice question, V-REX fairly measures every part of the thinking path, not just the end result.
•Across 32 models, larger models generally do better, and the well-known scaling trend holds for both Planning and Following.
•Most models improved on final answers when given CoQ hints, showing that exploration helps visual reasoning.
•Smaller models are noticeably better at Following than Planning, while larger models are more balanced between the two.
•Models recover more easily from bad Planning choices than from wrong Following answers, meaning wrong facts hurt more than a so-so plan.
•Retrieval-style tasks (like counting or reading words) benefit less from CoQ, because they rely more on direct lookup than multi-step reasoning.
•V-REX provides fine-grained diagnostics that reveal where a model’s reasoning breaks and how to fix it.
•The benchmark spans 4 reasoning categories and 15 real-world scenarios, totaling 702 samples and 2,504 questions.

Why This Research Matters

Many real tasks—like route planning, accident analysis, and app navigation—require careful step-by-step exploration, not one-shot guesses. V-REX shows exactly how well models can plan their questions and follow through with correct answers, which is crucial for building dependable assistants. By revealing whether failures come from poor planning or from wrong facts, teams can fix the right weakness faster. The benchmark’s multiple-choice framing makes evaluations consistent and fair across different models. Over time, these diagnostics encourage training methods that boost deliberate thinking, leading to AIs that are safer, more interpretable, and better at real-world problem solving.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a mystery picture. If you could only give a final guess (“It’s a train station!”) without checking clues like signs, people’s clothes, or street names, you’d often be wrong. Real detectives ask many small questions first.

🥬 The Concept (Vision-Language Models, VLMs):

What it is: VLMs are AIs that look at pictures and understand language at the same time.
How it works (recipe):
1. Read the question in words.
2. Look at the image for clues.
3. Connect what’s asked with what’s seen.
4. Produce an answer in words.
Why it matters: Without strong step-by-step thinking, VLMs can guess from surface clues and miss the real point.

🍞 Anchor: When you ask “What sport is this player doing?”, a good VLM looks at jerseys, the ball, and the court markings before answering “basketball.”

🍞 Hook: You know how big puzzles can’t be solved all at once; you find edge pieces, then corners, then fill in the middle. Hard visual questions are puzzles, too.

🥬 The Concept (Exploratory Visual Reasoning):

What it is: It’s solving image problems by actively exploring—asking sub-questions and gathering clues step by step.
How it works:
1. Break the main question into smaller ones.
2. Look in different regions of the image.
3. Check new clues and adjust the plan.
4. Combine the clues to answer the big question.
Why it matters: Without exploration, models try to jump straight to the end and get confused on open-ended or multi-step tasks.

🍞 Anchor: To judge “Who caused the accident?”, you first ask: Is the road wet? What is each car doing? Are there traffic signs? Then you decide responsibility.

🍞 Hook: Think of a teacher who only grades the final answer on a math test and ignores the work shown. You can’t tell if a student really understood.

🥬 The Concept (Intermediate Steps):

What it is: The small, traceable question-and-answer moves that lead to the final answer.
How it works:
1. Write down each sub-question.
2. Record each answer.
3. Keep them in order so later steps depend on earlier ones.
Why it matters: If we don’t measure steps, we can’t see if the model took shortcuts, got lucky, or used a flawed path.

🍞 Anchor: To find a library on a map, your steps might be: locate Main Street, find the park, then the library next to it. Each step can be checked.

🍞 Hook: Choosing what to ask first can make or break a science experiment. Ask the wrong thing, and you waste time.

🥬 The Concept (The Problem Before This Paper):

What it is: Benchmarks often asked one final, well-defined question and scored only the last answer.
How it works:
1. Show an image and a question.
2. Model answers immediately.
3. Grade correct/incorrect.
Why it matters: This misses whether the model can explore, plan sub-questions, and adapt. It hides weaknesses and blocks progress on real-world tasks.

🍞 Anchor: A model that can identify digits in a chart might score well, but still fail at “Plan a route through this airport” because it never learned to plan steps.

🍞 Hook: Imagine giving a friend a huge Where’s Waldo page and just saying “Find the treasure” with no hints. Most people need a plan and successive checks.

🥬 The Concept (The Gap V-REX Fills):

What it is: A way to measure not only final answers, but the quality of step-by-step exploration in images.
How it works:
1. Turn the problem into a chain of sub-questions.
2. Test the model’s planning (choosing the next helpful question).
3. Test the model’s following (answering sub-questions correctly).
Why it matters: It exposes specific strengths/weaknesses and shows how each step affects the final result.

🍞 Anchor: For “What is the book’s subject?”, V-REX guides steps like: read the author, spot the cover drawing, recall known subjects, then answer.

Real stakes (why care):

Street-view location guessing, accident responsibility, or GUI navigation need adaptive, multi-step exploration.
In daily life, assistants must plan (“Which screen to open?”) and follow through (“Which button now?”) to be truly helpful.
Teachers and builders need diagnostics to see where reasoning breaks: planning choice or factual following.
Products that rely only on final answers can be brittle in the wild; step-aware evaluation leads to sturdier systems.

02Core Idea

🍞 Hook: You know how detectives solve mysteries by asking a chain of smart questions, each unlocking the next clue? If they ask the wrong question, they chase the wrong person.

🥬 The Concept (Chain-of-Questions, CoQ):

What it is: CoQ turns hard visual problems into a step-by-step chain of sub-questions and answers that lead to the final answer.
How it works:
1. Start with the original big question.
2. Ask an informative sub-question; answer it.
3. Use that answer to choose the next sub-question.
4. Repeat until you have enough clues to answer the big question.
Why it matters: Without a chain, the model’s path is a black box. With CoQ, we can measure each step, spot detours, and see why the final answer happened.

🍞 Anchor: In a traffic scene, CoQ might ask: What is the black car doing? What is the silver car doing? Is the road wet? Then decide who’s responsible.

🍞 Hook: Imagine choosing the next move in chess. Picking the right question is like picking the right move—it sets up everything that follows.

🥬 The Concept (Planning in CoQ):

What it is: Planning is choosing the most helpful next sub-question at each step.
How it works:
1. See the big question and prior Q&A.
2. Pick from several candidate questions (some are distractors).
3. Get the ground-truth answer for the question you chose (so you can focus on planning, not perception).
4. Move to the next step.
Why it matters: If Planning is weak, the model wastes steps, follows shiny-but-irrelevant details, and misses key clues.

🍞 Anchor: When guessing a photo’s location, a great plan asks about signs, languages, and road markings—not the cloud shapes.

🍞 Hook: Following a recipe means doing each step carefully; mess up an ingredient and the cake flops.

🥬 The Concept (Following in CoQ):

What it is: Following is correctly answering the given sub-questions along a fixed reasoning path.
How it works:
1. Receive the predefined ground-truth sub-question.
2. Choose the correct answer from multiple choices (with distractors).
3. Accumulate evidence across steps.
4. Use all evidence to answer the final question.
Why it matters: If Following is wrong (wrong facts), errors snowball and the final answer derails.

🍞 Anchor: If the sub-question is “What is the silver car doing?”, picking “turning right” instead of “backing up” can flip who caused the crash.

Three analogies to lock it in:

Detective story: Plan which witness to interview next (Planning), then record accurate testimony (Following).
Treasure hunt: Decide the best next clue to read (Planning), then correctly decode each clue (Following).
Cooking class: Choose the right prep steps first (Planning), then execute each measurement and mix correctly (Following).

Before vs. After:

Before: Benchmarks mostly graded only the final answer. Models could guess, brute-force the image, or rely on lucky shortcuts.
After: V-REX measures the journey. We can pinpoint if a model fails because it asked bad questions (Planning) or gave wrong answers (Following).

Why it works (the intuition, no math):

Infinite exploration becomes manageable when every step is multiple-choice. This shrinks the space and removes subjective judging.
Separating Planning from Following isolates strategy (what to ask) from perception/recall (what’s true), giving cleaner, fairer diagnostics.
Distractors test whether the model resists tempting but unhelpful side paths.

Building blocks:

Ground-truth QA chain: A human-verified sequence of helpful, correctly ordered sub-questions and answers.
Planning task: Pick the helpful sub-question at each step from a set that includes distractors.
Following task: Answer each predefined sub-question with multiple-choice options.
Distractors: Plausible but unhelpful questions/answers that tug the model off-track.
Metrics: Step accuracy for Planning and Following; also final-answer accuracy.

🍞 Anchor: With CoQ, the model that once flailed on “Which route best reaches the Sunflower Garden?” learns to ask about gates, signs, and walkway directions first, then follows through to a confident final choice.

03Methodology

High-level overview: Image and original question → CoQ pipeline → Two tasks (Planning and Following) → Final answer and per-step scores.

🍞 Hook: You know how a coach splits practice into drills (passing vs. shooting) to see exactly where players need help? V-REX drills Planning and Following separately.

🥬 The Concept (Ground-Truth QA Chain):

What it is: A human-built sequence of helpful, correctly ordered sub-questions with their answers.
How it works:
1. Experts create a chain where each step helps with a later step.
2. Ensure no step depends on a future answer.
3. Cross-verify and refine across annotators.
Why it matters: A clean, helpful chain is the backbone; messy chains confuse models and make scores unreliable.

🍞 Anchor: For “Who is mainly responsible for the accident?”, the chain might be: What is the black car doing? What is the silver car doing? Is the ground wet? Then answer responsibility.

Dataset scope:

702 samples, 2,504 questions; 2–6 steps per sample (average 3.57).
4 categories, 15 scenarios:
- Deduction (Flowchart, Pattern, Property, Relationship)
- Guessing (Responsibility, Intention, Location, Time, Topic)
- Navigation (Map, GUI, Traffic, Trend)
- Retrieval (Counting, Word Puzzle)

🍞 Hook: Imagine a teacher adds tricky-but-wrong choices to test if you really understand. Those are distractors.

🥬 The Concept (Planning Task with Distractors):

What it is: At each step, the model must select the most helpful sub-question from options that include distractors.
How it works:
1. Step-level distractors: LLM-generated questions that look relevant locally but don’t help solve the final question.
2. Chain-level distractors: Entire alternative chains that are self-consistent but lead away from the true solution.
3. Integration: Filter, select, and merge the hardest distractors with the ground-truth chain.
4. During evaluation, the model gets the ground-truth answer to whichever sub-question it chooses (so we isolate Planning skill).
Why it matters: This decouples choosing-what-to-ask from answering, providing a clear Planning score.

🍞 Anchor: In a location-guessing task, a distractor might ask about a T-shirt logo color, while the helpful question asks about street signs’ language.

🍞 Hook: After picking the right questions, you still need to get the facts right.

🥬 The Concept (Following Task with Distractors):

What it is: The model follows the predefined chain and must choose the correct answer at each step (multiple-choice).
How it works:
1. Present the ground-truth sub-question.
2. Offer one correct answer plus several plausible wrong ones.
3. Accumulate these answers as evidence.
4. Use all evidence to answer the final question.
Why it matters: Wrong intermediate facts can derail the final decision; Following tests this factual reliability.

🍞 Anchor: Given “What is the algorithm in the flowchart?” answering “merge sort” correctly sets up the final complexity answer “O(N log N).”

🍞 Hook: Report cards need clear rubrics, not vibes.

🥬 The Concept (Evaluation Metrics):

What it is: Step accuracy for Planning and Following; final-answer accuracy with and without CoQ hints.
How it works:
1. Planning accuracy: % of times the model picked the ground-truth next question.
2. Following accuracy: % of times it chose the ground-truth answer to each sub-question.
3. Final accuracy: % correct on the big question, either directly or after CoQ steps.
Why it matters: These metrics separate strategy from fact-handling and show how hints change outcomes.

🍞 Anchor: If a model gets 80% on Following but 55% on Planning, you know it tends to answer well when guided, but struggles to choose the right path alone.

Concrete recipe with an example (accident responsibility):

Input: Crash scene image; Big Q: “Who is mainly responsible for the accident?”
Planning step:
- Options: “What is the black car doing?” (helpful), “How many cars are visible?” (distractor), “Is the sign red?” (distractor), “What is the silver car doing?” (helpful).
- Model must pick the most helpful one; it then receives the true answer to that sub-question.
Following step:
- Given sub-question “What is the black car doing?”
- Answers: “Backing” (correct), “Parking,” “Turning left,” “Driving forward.”
- Model must pick the correct one.
Output: Final responsibility decision, with per-step scores.

Secret sauce (why V-REX is clever):

It shrinks an unbounded exploration space to a controlled, fair multiple-choice game while keeping real-world complexity.
It disentangles Planning and Following, so model builders can fix the right part.
Distractors are crafted both locally and globally, exposing subtle weaknesses.
Hints (CoQ questions or answers) let us see how much guided exploration helps.

04Experiments & Results

The test: Measure two abilities across many real-world-like tasks.

Planning: Can the model pick the most helpful next question?
Following: Can the model answer each sub-question correctly?
Final accuracy: Can the model get the big question right, with and without CoQ?
Dataset: 702 samples, 2–6 steps each, in 4 categories and 15 scenarios.

The competition: 32 VLMs, from small (≈1B) to large (>30B), both open-source and proprietary families.

Examples include InternVL, Qwen, LLaVA-OV, GPT-4o, GPT-5, O1/O3, and Gemini Flash variants.

Scoreboard with context (big-picture takeaways):

Scaling holds: As models get bigger, both Planning and Following improve. Think of it like taller kids often reaching higher shelves.
Following is steadier: Among models of the same size, Following accuracy varies less than Planning accuracy. That’s like most kids of the same height grabbing apples equally well, but planning the best ladder path varies a lot.
CoQ helps final answers: Across most models and categories, final-question accuracy increases when CoQ hints are provided. That’s like getting a map before a hike—results go up.
Retrieval benefits less: Counting and word-puzzle tasks gain less from CoQ, behaving more like direct lookups where planning adds little.

Illustrative results (select highlights for intuition):

Proprietary leaders (e.g., GPT-5, O3) top averages across both tasks, but large open-source models (e.g., InternVL3-38B, InternVL2.5-38B) are competitive, especially in Deduction and Navigation.
Smaller models (<7B) often show a big gap: better at Following than Planning. Larger models (>10B) trend toward balance.
Strong correlations: Following accuracy correlates very strongly with overall performance (~0.95), Planning also correlates well (~0.86). In school terms, improving either skill tends to raise the final grade, but raising Following is like boosting the main exam score directly.

Surprising and insightful findings:

Recovery from mistakes: Models recover better from bad Planning than from bad Following. A so-so choice of question can be fixed later, but a wrong fact poisons the well.
Category expertise shifts with size: Small models have scattered strengths; bigger ones become more uniformly capable across categories.
Occasionally, CoQ can distract: A few models do worse with hints in Planning, suggesting that extra steps can add noise if the model can’t filter them well.

Concrete analogy for scores:

If an A+ is 90%+, top proprietary models often hit A-range on both Planning and Following, while many large open-source models are now in A- to B+ territory, closing the gap. Smaller models land around C to B-, especially on Planning.

05Discussion & Limitations

Limitations (be specific):

CoQ can add cognitive noise: Some models get distracted by extra steps in Planning, leading to worse final answers.
Retrieval mismatch: For tasks that are mostly direct lookup (counting, word reading), CoQ’s multi-step structure may not help and can even get in the way.
Multiple-choice framing: Constraining steps to finite options makes evaluation fair and reproducible, but it may miss creative, valid alternative paths.
Human-and-LLM curation: Ground-truth chains and distractors rely on expert design and LLM assistance; biases or oversights there can influence scores.

Required resources:

Expert annotators to build and verify helpful, correctly ordered QA chains.
Strong LLMs to generate and filter distractors (step-level and chain-level).
Compute to run many VLMs fairly and consistently.

When NOT to use V-REX:

Pure retrieval tasks where a single glance or direct OCR suffices.
Settings where models should propose completely free-form exploration without any multiple-choice constraints.
Domains where many different reasoning paths are equally valid and should be credited automatically.

Open questions:

Can we automatically detect and reward alternative valid chains that reach the right answer?
How can we design distractors that adapt to each model’s unique failure modes without leaking helpful clues?
Can we smoothly blend CoQ with free-form exploration so we get both fairness and creativity?
How should partial credit be assigned when a model picks a decent-but-not-optimal sub-question?
Can Planning be improved via targeted training (e.g., reinforcement learning on question selection) without hurting Following?

06Conclusion & Future Work

Three-sentence summary: V-REX is a benchmark that turns complex visual problems into a Chain-of-Questions, so we can separately measure Planning (what to ask next) and Following (answering correctly). By converting each step into multiple-choice, it enables reliable, fine-grained scoring of the entire reasoning path rather than just the final answer. Experiments on 32 models show consistent scaling, strong links between step skills and final performance, and a common imbalance in smaller models that favor Following over Planning.

Main achievement: A practical, interpretable framework that disentangles and quantifies exploratory reasoning in visual tasks, revealing exactly where and how models succeed or fail along the way.

Future directions:

Train models explicitly on Planning, not just Following, perhaps with reward signals for question choice.
Design adaptive distractors that evolve with model capability.
Blend CoQ with free-form reasoning to credit diverse valid paths.
Extend to multi-image and dynamic video settings where exploration is even more critical.

Why remember this: Real-world AI needs to explore, plan, and verify—not just guess the final answer. V-REX shows how to evaluate that exploration cleanly, making it easier to build trustworthy assistants that think more like careful detectives and less like lucky guessers.

Practical Applications

•Use V-REX to pinpoint whether your VLM needs more training on Planning (question selection) or Following (factual answering).
•Design curriculum learning that first strengthens Following on structured chains, then adds Planning complexity with distractors.
•Integrate CoQ-style prompting in production agents to guide multi-step visual tasks (e.g., GUI navigation) with explicit sub-questions.
•Benchmark new model releases with V-REX to track scaling gains and identify regressions in exploratory reasoning.
•Augment training data with step- and chain-level distractors to harden models against tempting but irrelevant visual cues.
•Build dashboards that visualize a model’s reasoning chain so annotators can correct specific weak links.
•Adopt CoQ hints (planned sub-questions or interim answers) at inference time to improve final accuracy on complex tasks.
•Prioritize Following improvements for applications where factual correctness is critical (e.g., safety decisions).
•Use recovery-from-failure diagnostics to design fallback strategies when a model makes an early mistake.
•Tailor evaluation by scenario (Deduction, Guessing, Navigation, Retrieval) to focus on the skills your product needs most.

Version: 1