Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu; Xiaomin Yu; Zhuoyue Chang; Zhe Huang; Shuo Zhang; Heng Lian; Kunyi Wang; Rui Xu; Sen Hu; Jianheng Hou; Hao Peng; Chengwei Qin; Xiaobin Hu; Hong Peng; Ronghao Chen; Huacan Wang

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Intermediate

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang et al.1/11/2026

arXiv PDF

Key Summary

•VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
•It forces models to depend on both the video (for clues) and the web (for facts), so neither alone is enough to solve the problems.
•Two system styles are compared: Workflow (summarize video cues first, then search) and Agentic (do everything end-to-end in one agent).
•Results show Agentic is not always better; it helps only when the model can keep the original video clues stable across many search steps.
•Big wins require avoiding goal drift (losing the target) and keeping long-horizon consistency (staying on track through many steps).
•Gemini-3-pro-preview performed best overall (up to 76% in the Agentic setting), with GPT-5.2 close behind (69%).
•Longer videos make the task harder and split model performance: strong models improve with Agentic; weaker ones get worse.
•Numbers (like IDs and dates) were a common stumbling block across models, even for top performers.
•Using tools more doesn’t guarantee better results; what matters is turning a few searches into a solid, cross-checked evidence chain.

Why This Research Matters

Many real-life questions start from what we see in videos, but the trustworthy facts live on official web pages. VideoDR tests whether AI can truly bridge that gap, turning visual clues into accurate, verifiable answers. This matters for verifying claims in news clips, planning trips from vlogs, checking product specs from reviews, and learning from educational videos. It also highlights key weaknesses—like drifting off-target and mishandling numbers—so researchers know what to fix next. By comparing two system styles (Workflow vs Agentic), it guides practical design choices for better reliability. Over time, this can make AI assistants safer, more useful fact-finders in the video-first world we all live in.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re watching a short travel video. You see a statue by a museum’s main door. You want to know its official registration number, but the video never says it out loud. What do you do?

🥬 The Concept (World Before, Problem, Failed Attempts, Gap, Stakes):

What it is: Before this paper, most tests for AI and video asked questions that could be answered from the video alone, or they started from text-only questions for web research.
How it worked (before):
1. Video benchmarks usually kept everything inside the video, so models never had to search the web.
2. Deep research benchmarks made AIs plan searches and reason, but almost always began with text queries, not with moving-picture clues.
3. Visual info, if used, was treated as a side note, not as the main anchor that controls the whole search.
Why it matters: Real life mixes both. Videos often hide the key visual clues (like logos, maps, or signs), and the verified facts live on the open web (like museum websites). Without a way to test this mix, we don’t know if AI can truly start from video details and end with a trustworthy, checkable answer.

🍞 Anchor Example: You pause a vlog showing the British Museum entrance, spot a “Don’t miss” sign near the door in a later frame, search the museum’s site, and find the artifact’s accession number WB.67. That’s exactly the type of challenge this paper targets.

🍞 Hook: You know how a treasure hunt needs both the map and the clues hidden in the yard?

🥬 The Concept (The Problem):

What it is: The main challenge is open-domain video question answering where vital clues are in the video frames, but the confirmable facts are scattered across the web.
How it works (why it’s tough):
1. You must grab the right visual anchors across multiple frames (not just a single screenshot).
2. You must turn those anchors into smart web searches.
3. You must hop between pages and cross-check until one answer clearly fits.
Why it matters: If AI can’t keep track of the original video anchors, the web search drifts, leading to wrong results—even if the search tool is powerful.

🍞 Anchor Example: The video shows a company’s unique product label in one frame and a city skyline in another; you need both to search precisely and avoid mixing it up with a different company in a different city.

🍞 Hook: Think of making a sandwich—you can’t skip the bread or the filling.

🥬 The Concept (Failed Attempts):

What it is: Earlier tests either stayed stuck inside the video (no web) or started from plain text (no real video anchoring).
How it works (what was tried):
1. Closed-video QA: Great at tracking time within a video, but no skills for finding facts online.
2. Text-first deep research: Great at web navigation, but ignores the step where video clues must drive the search.
Why it matters: Without both halves, an AI cannot do the real-world “watch, reason, and search” loop.

🍞 Anchor Example: Asking “What color is the car?” (video-only) or “Who discovered penicillin?” (text-only) doesn’t test what happens when a video clue must guide the web search to the one true answer.

🍞 Hook: Picture trying to solve a puzzle where some pieces are on the screen and the rest are out there on the internet.

🥬 The Concept (The Gap and Stakes):

What it is: The missing piece was a benchmark that forces dependence on both video and web, with multi-hop reasoning that ties the two together.
How it works (the new standard):
1. Questions require multiple frames—no single screenshot can solve them.
2. Questions require multiple web hops—no one page gives the full answer.
3. Both the video-only and web-only routes are blocked on purpose.
Why it matters: This mirrors daily life: verifying a news clip, understanding a product demo, or planning a trip based on a vlog, where visuals spark the search and the web confirms the facts.

🍞 Anchor Example: From a tech review video, you spot a motherboard revision code on the PCB and a brand logo in another frame, then search the manufacturer site to confirm the exact model and launch date.

02Core Idea

🍞 Hook: You know how detectives freeze a scene, pick out tiny clues, and then check records to crack the case?

🥬 The Concept (Aha! Moment):

What it is: Treat the video as your starting clue generator and the open web as your fact library, then force the AI to hop between them until one verified answer remains—and build a benchmark (VideoDR) that measures this exact skill.
How it works:
1. Extract cross-frame visual anchors from the video (e.g., signs, logos, maps, model numbers).
2. Turn anchors into targeted web queries and browse live pages.
3. Chain the evidence across multiple pages and frames (multi-hop reasoning).
4. Output a single, checkable answer, not a guess.
Why it matters: Without this, AIs may forget the video details, search the wrong stuff, and sound confident but be wrong.

🍞 Anchor Example: A city tour video shows a tram line number and a station name across different frames; using both, the AI finds the official schedule page and the exact year the line opened.

Multiple Analogies:

🍞 Hook: Imagine a scavenger hunt where each clue points to the next hiding spot online. 🥬 The Concept: The video gives the first set of clues; the internet holds the rest. The AI must alternate between what it saw and what it reads until the prize (the answer) is found. Without tracking the clues carefully, it wanders off. 🍞 Anchor: From a concert clip, you spot a banner date and a venue logo, then confirm the setlist on the venue’s archived page.
🍞 Hook: Think of cooking with a mystery ingredient: you see it in the kitchen video but need a recipe site to know how to use it. 🥬 The Concept: The video shows the ingredient label; the web tells you the safe cooking temperature and recipe. The AI must keep the label details accurate during the search. 🍞 Anchor: A frame shows “wild-caught sockeye”; the AI finds a trusted recipe site to get the correct internal temperature.
🍞 Hook: Picture a relay race: the video passes the baton (anchor details) to the web search, which passes it to reasoning, and back again if needed. 🥬 The Concept: If any runner drops the baton (forgets the anchor), the team loses time or the whole race. 🍞 Anchor: The AI notes a product’s exact SKU in the video and uses it to find the official warranty terms online.

Before vs After:

Before: Video QA meant staying inside the video; deep research meant starting from text. They lived in separate worlds.
After: VideoDR fuses them: video supplies anchors; the web supplies proof. Success means balancing both, not choosing one.

Why It Works (intuition, no equations):

Anchors reduce web noise: precise visual details make search queries sharper, shrinking the haystack.
Multi-hop reasoning builds reliability: cross-checking multiple sources helps avoid one-page mistakes.
Forcing dependence on both sides blocks shortcuts: you can’t guess from the video alone or web alone, so models must truly integrate.

Building Blocks (mini-concepts in sandwich form):

🍞 Hook: You know how a school project needs both your notes and library books? 🥬 Visual Anchors: Key visual details from multiple frames that guide searching; they’re like unique keywords you saw with your eyes. Steps: scan frames, extract distinct clues, write them down. Without anchors, searches become vague and drift. 🍞 Example: A frame shows “Entrance A” on a map and the hall name on a sign; both steer the search.
🍞 Hook: Imagine hopping across stepping stones to cross a stream. 🥬 Multi-hop Reasoning: Linking several facts in order (video → page → another page → answer). Steps: form sub-questions, fetch evidence, verify, move to the next. Without it, you stop too early or assemble the wrong answer. 🍞 Example: From logo → official site → product page → spec sheet → exact release date.
🍞 Hook: Think of doing a science experiment with careful measurements. 🥬 Open-domain QA: Answering questions where the facts could be anywhere on the internet. Steps: search widely, filter sources, verify. Without it, you can’t find reliable, up-to-date info. 🍞 Example: Confirming a museum object ID from the institution’s own database, not a random blog.

03Methodology

High-level Pipeline (for the benchmarked task): Input (Video V + Question Q + Web Tools) → [A] Cross-frame Anchor Extraction → [B] Query Planning → [C] Interactive Web Retrieval → [D] Evidence Reading and Cross-checking → [E] Multi-hop Reasoning over Video+Web → Output (One verifiable answer).

Two System Styles:

🍞 Hook: You know how some people write notes before they study, while others jump right in? 🥬 Workflow Paradigm: A two-stage recipe: first extract and write down the video anchors, then search and reason using that text as an always-available memory. Steps: (1) watch and summarize anchors, (2) search with them, (3) reflect and refine, (4) answer. Why it matters: Without the written anchors, later steps can forget details and drift. 🍞 Example: Write “British Museum, Main Entrance, Don’t miss list” before searching the site.
🍞 Hook: Picture a Swiss Army knife that has every tool in one handle. 🥬 Agentic Paradigm: End-to-end agent that watches, plans searches, retrieves pages, reflects, and answers in one loop. Steps: (1) observe video, (2) decide to search, (3) read/reflect, (4) repeat, (5) answer. Why it matters: It can preserve fine details closely tied to the video, but if it forgets or misreads early anchors, it can’t rewatch to fix them, causing drift. 🍞 Example: The agent sees a map snippet once, then must rely on memory to search “closest exhibit to Main Entrance.”

Detailed Steps (what, why, example):

Cross-frame Anchor Extraction

What happens: The model scans multiple frames to capture distinctive clues (names, numbers, signs, logos, landmarks).
Why this step exists: One frame may miss something key; combining frames gives a fuller, more precise query seed. Without it, search terms are too generic.
Example: From three frames, it notes “The British Museum,” “Main Entrance,” and a specific statue name on a sign.

Query Planning

What happens: Turn anchors into targeted queries and plan sub-questions (e.g., identify museum → find recommended list → find closest item → read its ID).
Why: A single vague query won’t land on the right page. Without planning, the model wastes searches.
Example: Query 1: “British Museum don’t miss list”; Query 2: “British Museum entrance map”; Query 3: “WB.67 accession number confirmation.”

Interactive Web Retrieval

What happens: Use a browser search tool, pick promising links, skim snippets, open pages.
Why: Facts live on the open web. Without actual retrieval, the model can’t verify.
Example: Open the museum’s official site first; only then consider secondary sources.

Evidence Reading and Cross-checking

What happens: Read pages, extract facts, compare across sources, discard conflicts.
Why: The web can be noisy or outdated. Without cross-checking, errors slip in.
Example: Cross-check the “Don’t miss” list across the museum’s guide, a floor map, and a visitor blog.

Multi-hop Reasoning over Joint Evidence

What happens: Link video anchors to web facts step by step until only one answer fits.
Why: Answers often need 2–4 hops. Without chaining, details don’t line up.
Example: Entrance location (video) → closest exhibit on official guide (web) → its accession number (web) → final answer WB.67.

Final Answering with Traceability

What happens: Produce the answer and cite the decisive evidence.
Why: Traceability enables judging and debugging. Without it, we can’t tell if the steps were sound.
Example: “WB.67 (source: British Museum guide; floor map).”

Benchmark Construction (recipe to build VideoDR):

Candidate video pool with diverse sources and lengths; strict negative filtering to avoid trivial or unsourceable cases.
Initial filtering to keep only videos with strong, multi-frame visual anchors.
Question design with two hard rules: multi-frame grounding and multi-hop web dependence; archive key evidence pages.
Two dependency tests: (a) Web-only test—if solvable from text alone, discard; (b) Video-only test—if solvable from video alone, discard.
Human testing: five independent solvers use the video plus web; success rate per sample defines difficulty (Low/Mid/High) based on how many of the five got it right.

Evaluation Setup:

Models: Gemini-3-pro-preview, GPT-5.2, GPT-4o, Qwen3-Omni-30B-A3B, InternVL3.5-14B, MiniCPM-V 4.5.
Paradigms: Workflow vs Agentic for each model.
Judging: 🍞 Hook: Like a fair referee comparing answers by meaning, not exact wording. 🥬 LLM-as-Judge: Another model checks if the predicted answer matches the gold answer semantically. Why it matters: Avoids penalizing small wording differences; focuses on correctness. Without it, fair scoring is hard. 🍞 Example: “WB67” vs “WB.67” counted as equivalent if they refer to the same accession number.

Secret Sauce:

Force true dual dependence (video + web) and multi-hop reasoning so shortcuts don’t work.
Compare Agentic vs Workflow to reveal when memory of anchors helps or hurts.
Stratify by difficulty, video length, and domain to surface where models drift or stay consistent.

04Experiments & Results

The Test (what and why):

What: Accuracy on 100 real-world-style questions where solving requires multi-frame video anchors plus multi-hop web search.
Why: Measures if models can watch for clues, search smartly, and reason across several steps to a single verifiable answer.

The Competition (who):

Closed-source: Gemini-3-pro-preview, GPT-5.2, GPT-4o.
Open-source: Qwen3-Omni-30B-A3B, InternVL3.5-14B, MiniCPM-V 4.5.
Two paradigms each: Workflow (notes-first) and Agentic (end-to-end).

Scoreboard with Context:

Overall: Gemini-3-pro-preview leads (69% Workflow; 76% Agentic). Think of 76% like scoring a strong A when many others are closer to a C.
GPT-5.2: 69% in both settings—top-tier and steady, like consistently getting A- on every test.
GPT-4o: 42%/43%, a solid middle tier.
Open-source trio: around 16–37% depending on model and setting, more like D to low C on this hard exam.
Humans: Average success per sample was about 50% overall, but split strongly by difficulty: about 90% on Low, 51% on Mid, and 11% on High. That means High questions really are tough even for people.

Difficulty Stratification (why it matters):

Everyone—humans and models—drops from Low → Mid → High, so the difficulty labels track real complexity.
Agentic helps when a model can keep anchors stable. Example: Gemini-3-pro-preview improved on Mid (61% → 69%) and High (56% → 66%).
But Agentic can hurt when anchors drift. Example: GPT-4o rose on Low/Mid but fell sharply on High (47% → 28%).

Video Duration Stratification (short vs long):

Longer videos spread clues out, testing memory and consistency.
Gemini-3-pro-preview gained a lot in Agentic on Medium/Long (66% → 84% and 50% → 70%), showing it can carry anchors forward.
Qwen3-Omni-30B-A3B and MiniCPM-V 4.5 dropped on Long in Agentic (50% → 20% and 30% → 10%), showing drift and instability over longer chains.
Takeaway: Workflow’s written anchors are like a steady to-do list; Agentic is powerful but needs a strong memory to avoid drifting.

Domain Stratification (what topics):

Technology: Biggest Agentic boost for Gemini (64% → 86%), likely because tiny visual details (model numbers, chip revisions) turn into powerful, precise queries.
Geography: Agentic sometimes dropped (e.g., Gemini 70% → 50%, GPT-4o 40% → 20%), suggesting geographic searches can be ambiguous unless video anchors remain crystal clear.

Tool-use Insights:

More tool calls don’t equal better grades. What counts is converting a few searches into a clean, cross-checked chain.
Gemini-3-pro-preview used modest think/search calls (about 2.9/2.5) and still topped accuracy, meaning its extra retrieval and reflection were high quality.
Some models searched more but didn’t improve, hinting at low-yield exploration or weak filtering.

Surprising/Notable Findings:

Agentic is not automatically superior; it depends on anchor retention.
Numerical errors cluster across all models—even top ones—showing numbers (like IDs, dates) remain a tricky weak spot.
Reasoning errors (logic mis-steps) were rare; the big issue was perception/anchoring—getting the category or the anchor wrong early and then amplifying the mistake downstream.

Bottom Line: Success hinges less on raw searching and more on keeping the video anchor steady and turning each search into a reliable, multi-hop evidence chain.

05Discussion & Limitations

Limitations:

The benchmark locks in questions with unique, checkable answers, but the exact search paths reflect how expert annotators browsed. Real users might choose different keywords or routes to reach the same truth.
Only 100 samples for now; it’s diverse but still small, so extremely rare patterns may be underrepresented.
Live web dependency introduces variability if sites change; archiving helps, but the open web evolves.

Required Resources:

A multimodal model that can process videos (frames and text) and a browser/search tool interface.
Stable internet access and enough compute to run long-horizon searches and reflections.
For Agentic especially, memory—or an equivalent mechanism—to keep anchors consistent across many steps.

When NOT to Use:

Purely closed-video tasks where the answer is fully inside the footage; simpler video QA benchmarks are better.
Purely text tasks where no video context matters; text-only deep research suites suffice.
High-latency or offline settings where web search is impossible; this benchmark assumes live or archived web access.

Open Questions:

How can models re-ground themselves mid-research if early anchors were wrong (e.g., a safe, limited rewatch or an external memory probe)?
What’s the best way to represent anchors so they stay precise but flexible (structured schemas, sketches, or hybrid notes)?
Can targeted training on numerical reliability (IDs, dates, quantities) cut the persistent number-related errors?
How do we scale the dataset while preserving the dual-dependence property (neither web-only nor video-only should suffice)?
Can collaborative agents (one for perception, one for retrieval, one for verification) outperform single-agent systems without increasing drift?

06Conclusion & Future Work

Three-sentence Summary:

This paper introduces VideoDR, a benchmark where AI must watch a video to extract multi-frame visual anchors, search the open web, and chain the evidence into a single verified answer.
Testing top multimodal models shows that Agentic systems are powerful but only when they keep anchors stable; otherwise, Workflow’s written anchors offer more reliable guidance.
The toughest hurdles are goal drift and long-horizon consistency, especially on longer videos and higher-difficulty questions.

Main Achievement:

A rigorous, first-of-its-kind evaluation that forces real dual dependence on video plus web with multi-hop reasoning, revealing the true boundaries of today’s video research agents.

Future Directions:

Add mechanisms for mid-course re-grounding (e.g., controlled rewatching, external anchor memories), improve numerical reliability, and scale the dataset with diverse human search logs.
Explore team-based agent designs that split perception, search, and verification roles while actively preventing drift.

Why Remember This:

VideoDR reframes video as the spark that ignites trustworthy web research. It shows that the next generation of AI must not just see and not just search, but see-and-search—carefully, consistently, and verifiably.

Practical Applications

•Build video fact-checking assistants that verify claims from news clips using official sources.
•Create travel helpers that watch vlog snippets and fetch exact museum hours, ticket types, or exhibit IDs.
•Develop shopping aides that capture model numbers from review videos and confirm specs/warranties on manufacturer sites.
•Support educators by turning classroom video demonstrations into step-by-step verified references and readings.
•Assist technical support by extracting hardware revision codes from teardown videos and finding matching manuals.
•Enable accessibility tools that summarize visual anchors (signs, maps) from videos and provide verified context from the web.
•Improve broadcast research pipelines that track on-screen details and assemble trusted background info live.
•Enhance content moderation and provenance checks by linking video logos/watermarks to authentic sources.
•Power sports analytics that spot jersey numbers and event markers in videos, then pull verified stats from league databases.

Version: 1