Video-Browser: Towards Agentic Open-web Video Browsing

Zhengyang Liang; Yan Shu; Xiangrui Liu; Minghao Qin; Kaixin Liang; Nicu Sebe; Zheng Liu; Lizi Liao

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu, Xiangrui Liu et al.12/28/2025

arXiv PDF

Key Summary

•The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
•Old methods either watched tons of frames (accurate but super expensive) or only read text summaries (cheap but missed tiny visual facts).
•The authors define Agentic Video Browsing as an iterative plan–search–verify loop that depends on video evidence, not just web pages.
•They build Video-BrowseComp, a benchmark with three difficulty levels that force agents to find and verify answers in videos.
•Their agent, Video-Browser, uses Pyramidal Perception: filter with cheap text metadata, localize with sparse scans, then zoom in with high-fidelity vision.
•Compared to direct visual inference, Video-Browser improves accuracy by 37.5% while cutting token usage by 58.3%.
•The system includes a Planner (decides what to search), a Watcher (finds where to look in videos), and an Analyst (confirms fine visual details).
•It performs well across tasks that require visual grounding, like confirming who blocked a shot in an NBA playoff or spotting a red pen cap in a movie.
•The work highlights a key lesson: reliable web research with videos needs both smart search and precise visual checking.

Why This Research Matters

So much of modern life is on video: sports, news, reviews, and tutorials. If AI can’t find and verify facts inside videos, it will miss what people really care about. This work shows how to search widely yet look closely, making answers both efficient and trustworthy. It can help verify viral claims, check product features, or confirm exactly what happened in a key play. By allocating compute only where it counts, it reduces cost while improving reliability. As video volume grows, this approach is a blueprint for building agents that research, not just regurgitate.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to find the exact moment in a two-hour movie when a tiny red pen cap peeks out of a pocket. You could watch the whole thing, or you could skim summaries—but summaries probably won’t mention a pen cap.

🥬 The Concept (Direct visual inference): It’s when an AI ‘watches’ a lot of video frames directly to answer a question. How it works:

Download long videos
Sample many frames (sometimes a huge number)
Feed frames into a vision-language model
Generate an answer from what it ‘saw’ Why it matters: Without it, you can’t confirm fine visual facts (like a pen’s color). But it can cause a ‘context explosion’—too many frames, too many tokens, too slow. 🍞 Anchor: The AI tries to confirm Walter Mitty’s pen color by scanning frames, but it spends a fortune in compute just to find a tiny red cap.

🍞 Hook: You know how reading a movie’s plot summary won’t tell you the color of someone’s socks?

🥬 The Concept (Text-centric summarization): It’s when an AI mostly reads titles, transcripts, and summaries instead of watching many frames. How it works:

Grab video titles, descriptions, transcripts
Compress them into short text notes
Answer from the notes Why it matters: It’s cheap and fast, but it misses small visual clues—this is the ‘modality gap’ between text and vision. 🍞 Anchor: The summary says Walter daydreams often, but never mentions the red pen cap, so the AI answers “not specified.”

🍞 Hook: Think of a librarian who doesn’t just read book jackets; they pull the exact page you need.

🥬 The Concept (Agentic Video Browsing): It’s an AI that plans, searches, opens videos, checks timestamps, and verifies visual evidence step by step. How it works:

Break the question into subtasks
Search the web and pick candidate videos
Peek at key moments
Confirm the answer with visual proof Why it matters: Many real answers live inside videos (sports plays, tutorials, product demos). Without agentic browsing, AI guesses or over-trusts text. 🍞 Anchor: To answer “Who blocked the shot after an alley-oop?”, the agent searches game clips, jumps to likely timestamps, and verifies the player’s name on-screen.

The world before: AI agents got good at reading web pages and looking at images. Benchmarks like GAIA and BrowseComp tested browsing on text and static pictures. But the web’s most dynamic medium—video—was treated like a passive file: feed in a clip, ask a question, get an answer. That’s not how people research. In real life, you search, compare multiple videos, and verify details.

The problem: Open-web video research needs both scale (searching many candidates) and precision (zooming into seconds that matter). Direct visual inference gives precision but is expensive; text summaries give scale but miss fine details.

🍞 Hook: Like trying to find one special seashell by either combing every grain of sand (too slow) or only reading a beach brochure (too vague).

🥬 The Concept (Perception–Context trade-off): It’s the tug-of-war between seeing a lot (perception) and keeping the prompt short (context). How it works:

More frames = better visual recall
But more frames = more tokens in context
Big context slows and crowds out reasoning
You must choose where to spend tokens Why it matters: If you push everything into context, you can’t reason deeply; if you cut too much, you miss the answer. 🍞 Anchor: Streaming thousands of frames from five long videos leaves no room to think; summarizing to 20 lines hides the red pen.

Failed attempts:

Watch everything (accurate, but unscalable across multiple long videos).
Read only text (efficient, but misses visual truth like colors, counts, or who touched the ball).

The gap: We need a way to cheaply narrow down candidates, roughly locate promising moments, then carefully zoom into tiny windows for visual truth.

Real stakes:

Sports: Identifying who blocked a shot in a specific 2024 playoff moment.
Movies: Confirming a tiny prop detail that carries symbolism.
Tutorials/Shopping: Verifying if a tool or feature is actually shown working.
News literacy: Checking if a claim matches what’s visible in the original footage. In all of these, answers are short and objective (a name, a color, a count), but finding them requires smart, visual, timestamped verification across multiple videos.

02Core Idea

The “aha!” moment in one sentence: Treat the web of videos like a pyramid—first filter with cheap text, then localize with sparse scans, and only at the end zoom in with expensive, high-fidelity vision—while an agent plans and loops until it has visual proof.

🍞 Hook: Imagine birdwatching with binoculars—you first pick the right tree (filter), then scan branches (localize), then zoom on the exact feather pattern (zoom-in).

🥬 The Concept (Pyramidal Perception): It is a three-stage way to spend compute only where it counts. How it works:

Stage I—Semantic Filter: Use titles/snippets to toss out irrelevant videos (no frames yet)
Stage II—Sparse Localization: Read transcripts and sample a few frames to propose time windows
Stage III—Zoom-in: Densely decode only those windows to resolve tiny visual details Why it matters: You avoid context explosion but still see what you need to answer correctly. 🍞 Anchor: Searching for Walter’s pen, you ignore irrelevant uploads by title, glance sparsely to find moments with shirt close-ups, then zoom on those seconds to spot the red cap.

Three analogies:

Detective: Skim alibis (filter), scan CCTV at key hours (localize), watch the exact minute in HD (zoom-in).
Museum guide: Check the floor map (filter), stroll past relevant rooms (localize), step close to read the tiny plaque (zoom-in).
Treasure hunt: Read the clue (filter), check likely spots (localize), dig only where X marks the spot (zoom-in).

Before vs. After:

Before: Either stream tons of frames (accurate but slow and costly) or rely on text (fast but visually blind).
After: Use the pyramid to prune early, find where to look, and verify with dense vision in small slices; you get higher accuracy with far fewer tokens.

🍞 Hook: You know how a class group project works best when roles are clear—one plans, one gathers, one checks?

🥬 The Concept (Planner–Watcher–Analyst loop): It’s an agent with three roles and a memory that iterates until the answer is proven. How it works:

Planner: Breaks the query into subtasks, crafts search keywords, and reads Watcher feedback
Watcher: Applies the pyramid to select videos and propose timestamp windows
Analyst: Zooms into the proposed windows to extract fine-grained visual facts
Memory + Loop: If unsure, the Planner searches again and refines focus until confident Why it matters: Without clear roles and feedback, the agent either wastes compute or misses the crucial frame. 🍞 Anchor: For “Who blocked that 2024 3-point attempt?”, the Planner targets Finals clips, the Watcher proposes a 10-second window, and the Analyst confirms from frames and commentary that it was Jayson Tatum’s shot.

Why it works (intuition, no equations):

Early cheap filters cut 80–90% of dead ends before any heavy vision.
Sparse scans are like ‘blips on radar’—you don’t need every frame to guess likely moments.
Zoom-in reserves expensive vision for tiny windows (e.g., 6–20 seconds), so you see jersey numbers, object colors, and micro-actions clearly.
The loop keeps the agent honest: if evidence is weak, it searches differently rather than hallucinating.

Building blocks:

Global memory of searches tried, candidates seen, and observations
Planner that decomposes the task and adapts queries
Watcher that runs Stage I (metadata filter), Stage II (sparse transcript + frames), Stage III (dense window decoding)
Analyst that synthesizes visual evidence into a short, verifiable answer Together, these pieces transform open-web video browsing from guessy text search into verifiable visual research.

03Methodology

At a high level: Natural-language question → Planner (make smart searches) → Watcher (Stage I filter → Stage II localize → Stage III zoom-in) → Analyst (confirm tiny visual facts) → Short answer.

Step 0: Inputs and outputs

What happens: The system receives a user’s question and access to the open web. It should return a short, objective answer (e.g., a name, color, or count).
Why it exists: Short answers are easy to auto-check and keep the goal crisp.
Example: “In the movie directed by Ben Stiller, what color is Walter Mitty’s pen?” → “Red.”

🍞 Hook: You know how you first plan a grocery run before walking the aisles?

🥬 The Concept (Planner): It’s the brain that decides what to search next. How it works:

Break the big question into subtasks
Generate search queries for each subtask
Read Watcher feedback about which videos/times look promising
Repeat with refined queries until confident Why it matters: Without planning, the agent wastes time on broad, unfocused searches. 🍞 Anchor: For the Barkley/Yao bet, the Planner pivots from “NBA bet donkey butt” to “Charles Barkley Yao Ming 19 points Inside the NBA,” which instantly finds the right clips.

🍞 Hook: Think of a movie night—first you toss out movies that clearly don’t fit before picking scenes to watch closely.

🥬 The Concept (Watcher, Stage I – Semantic Filter): It’s a cheap text sieve that prunes irrelevant videos by reading only titles and snippets. How it works:

Collect candidate videos from search
Read metadata only (no video frames yet)
Score relevance to the subtask
Keep the best few Why it matters: Saves tons of compute by not decoding irrelevant videos. 🍞 Anchor: If the question is about a 2024 NBA Finals moment, Stage I drops 2016 highlight compilations by their titles.

🍞 Hook: Like flipping through a book by skimming chapter headings and a few pictures.

🥬 The Concept (Watcher, Stage II – Sparse Localization): It finds ‘where’ in a retained video the answer might be. How it works:

Read the whole transcript to find likely mentions and times
Sample a small set of frames across the video
Use both to propose short time windows [start, end]
Return a list of candidate windows as ‘glimpses’ Why it matters: It shrinks hour-long videos into manageable slices for close inspection. 🍞 Anchor: For “Who blocked the 3-pointer?”, Stage II spots transcript lines near “blocked by…” and proposes a 12–20 second window.

🍞 Hook: When you see a suspicious blur in a replay, you pause and step through frame by frame.

🥬 The Concept (Watcher, Stage III – Zoom-in): It decodes only the chosen windows at high fidelity. How it works:

Densely sample frames inside each window
Feed them to a vision-language model
Extract precise visual facts (jersey number, color, count)
Provide a crisp visual summary to the Analyst Why it matters: This is where tiny details are finally visible and reliable. 🍞 Anchor: Zooming into Walter’s shirt pocket frames reveals the pen cap is clearly red.

🍞 Hook: Think of a science fair judge who reads your notes, then looks closely at your experiment to confirm.

🥬 The Concept (Analyst): It’s the final reasoner that uses all windows to produce the short answer. How it works:

Read the zoomed-in clips and transcripts
Cross-check across multiple videos if needed
Resolve conflicts and choose the best-supported answer
Output a short, verifiable string plus confidence Why it matters: Without this careful synthesis, the system might over-trust a single clip. 🍞 Anchor: Multiple clips confirm: “Sunday night Yao Ming went 9-for-9… He scored 20 points.” The Analyst outputs “20 points.”

Secret sauce (why this recipe is clever):

Spend cheap tokens first (metadata), save heavy vision for the final inches.
Keep a feedback loop so the Planner can adjust queries when evidence is weak.
Use both text (transcripts) and vision (frames) so the agent isn’t blind to either modality.
Constrain outputs to short, judgeable strings to avoid format mismatches.

What breaks without each step:

No Planner: The system keeps searching randomly and wastes budget.
No Stage I: Compute explodes on irrelevant videos.
No Stage II: You don’t know where to look; context balloons with long clips.
No Stage III: You miss fine details like colors, jersey numbers, logos.
No Analyst: Conflicting snippets never resolve into one trustworthy answer.

Concrete mini-walkthrough:

Input: “In a 2024 playoff game, who had his 3-pointer blocked after an alley-oop sequence?”
Planner: Searches “Mavericks Celtics 2024 alley-oop block who 3-pointer.”
Watcher Stage I: Selects Finals videos by titles.
Stage II: Finds transcript: “Tatum step back … blocked by Gafford.” Proposes 10–20s window.
Stage III: Zooms in, sees the action and hears commentary.
Analyst: Confirms “Jayson Tatum.”
Output: “Jayson Tatum.”

04Experiments & Results

🍞 Hook: Think of a spelling bee where the judge only accepts exact, provable answers.

🥬 The Concept (Video-BrowseComp benchmark): It’s a test set that forces agents to rely on videos, not just text shortcuts. How it works:

Questions need video evidence (mandatory video dependency)
Answers are short and objective (names, colors, counts)
Three levels: explicit retrieval (Level 1), implicit retrieval (Level 2), multi-source reasoning (Level 3) Why it matters: It mirrors real video research—find, verify, and cross-check across sources. 🍞 Anchor: “What color is Walter Mitty’s pen?” can only be answered by seeing frames, not by reading a wiki.

🍞 Hook: When your teacher grades both answers and how confident you were.

🥬 The Concept (Overall Accuracy and Calibration Error): OA measures how often answers match ground truth; CE checks if confidence matches reality. How it works:

Use an LLM judge to compare the predicted short answer with the ground truth
Ask the model to report a confidence score
Compute how well confidence tracks actual correctness (lower CE is better) Why it matters: We need both correct answers and honest uncertainty. 🍞 Anchor: If the agent says “Red” with 95% confidence and it’s right, that’s good calibration; if it’s wrong but still 95% confident, that’s bad.

The competition:

Tool-free MLLMs (no web tools) tried to answer from internal knowledge and small context.
Search-augmented models (big vendor tools) could browse the web but often stayed text-centric.
Baselines included direct visual inference (watch many frames) and text-centric summarization (compress to text).

Scoreboard with context:

Tool-free models topped out under ~20% accuracy—parametric memory can’t replace real video evidence.
Search-augmented systems improved Level 1 (explicit clues) but stumbled on Level 2/3 where visual proof is key.
Video-Browser, using Pyramidal Perception, achieved a 37.5% relative accuracy improvement over direct visual inference, while reducing tokens by 58.3%. That’s like getting a strong B when others were stuck around a C, but paying half the cost.

Efficiency findings:

Direct visual inference: decent accuracy but massive token/context usage (context explosion).
Summarization: tiny context but lost key visuals; accuracy dipped and calibration worsened.
Pyramidal Perception: best of both—highest accuracy with far fewer tokens by zooming only where needed.

Surprising observations:

Models that ‘browse’ the web still often behave like text search engines; when text is silent (e.g., specific sports moments), they fail unless they truly see the video.
Test-time scaling helps: more candidates, deeper loops, and denser sparse scans all steadily improve accuracy—evidence that the loop and the pyramid effectively use extra compute.
Using transcripts alone severely underperforms; explicit visual perception is mandatory to bridge the modality gap.

Takeaway: Verifiable open-web video research isn’t about watching everything or reading everything—it’s about watching just the right few seconds, reliably.

05Discussion & Limitations

Limitations:

Benchmark size: 210 carefully curated questions keep costs reasonable but limit broad coverage; it’s a ‘golden set’ rather than a giant leaderboard.
Compute: Even with savings, multi-round search plus decoding windows still costs tokens; scaling to thousands of queries requires budget planning.
Tiny-object recognition: Very small logos/low-res text can still fool current models, even after zoom-in.
Semantic distractors: Videos with similar themes (e.g., Icelandic foods) can mislead selection/localization unless entity checks are strict.
Information deficits: If neither frames nor transcripts contain the needed stat, the agent must fetch external structured data or abstain.

Required resources:

Access to web search for video retrieval, transcripts (when available), and video downloading/decoding.
An MLLM capable of multimodal reasoning with efficient token usage.
Modest storage/cache for candidate clips and transcripts, plus a scheduler for multi-iteration runs.

When not to use:

Questions answerable from text alone (use simple web search to save cost).
Ultra-fine OCR tasks on tiny, blurry frames (specialized OCR may be better).
Domains with restricted or paywalled video access (agent can’t fetch the needed evidence).

Open questions:

Can we learn better selection/localization policies that further cut tokens without hurting accuracy?
How to robustly verify identities (players, actors) under occlusion, fast motion, or look-alikes?
Can retrieval incorporate cross-video alignment (e.g., track the same event across multiple broadcasts) automatically?
How to calibrate uncertainty under multi-source disagreement so the agent knows when to abstain?
What training or RL signals best encourage ‘verify before trust’ behavior in video research agents?

06Conclusion & Future Work

Three-sentence summary: This paper defines Agentic Video Browsing and introduces Video-BrowseComp to test it with mandatory video evidence. It proposes Video-Browser, which uses Pyramidal Perception—filter by cheap text, localize with sparse signals, then zoom in with dense vision—coordinated by a Planner–Watcher–Analyst loop. The system improves accuracy by 37.5% over direct visual inference while cutting token usage by 58.3%, showing that precise, efficient video verification is possible on the open web.

Main achievement: Turning open-web video research into a practical, verifiable process by allocating compute only where the truth lives—within short, carefully chosen windows.

Future directions:

Smarter policies for selection and localization; tighter entity verification to resist distractors.
Better handling of tiny text/logos and long-range cross-video reasoning.
Scaling the benchmark while keeping costs accessible and answers unambiguous.

Why remember this: It reframes video browsing from ‘watch everything’ or ‘read summaries’ into ‘search smart, look briefly, verify visually,’ which is exactly what reliable, efficient AI agents must do to understand a world increasingly told in video.

Practical Applications

•Sports analytics assistants that identify who made a block, foul, or assist in specific game moments.
•Fact-checking tools that verify claims by locating and inspecting the exact video timestamps.
•Shopping helpers that confirm whether a product feature is visibly demonstrated in review videos.
•Education search that jumps to the precise lecture moment explaining a concept (e.g., ‘semantic gap’ slide).
•Customer support bots that find and highlight the exact step in a tutorial video that solves a user’s issue.
•News literacy apps that cross-check multiple video sources to resolve conflicting reports.
•Video archive search that localizes where a person, object, or color appears across long footage.
•Film analysis tools that verify props, continuity, and scene details for editors and researchers.
•Safety compliance auditors that confirm if procedures were visibly followed in workplace videos.
•Esports assistants that find and verify critical plays across multiple match VODs.

Version: 1