Video-Browser: Towards Agentic Open-web Video Browsing
Key Summary
- âąThe paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
- âąOld methods either watched tons of frames (accurate but super expensive) or only read text summaries (cheap but missed tiny visual facts).
- âąThe authors define Agentic Video Browsing as an iterative planâsearchâverify loop that depends on video evidence, not just web pages.
- âąThey build Video-BrowseComp, a benchmark with three difficulty levels that force agents to find and verify answers in videos.
- âąTheir agent, Video-Browser, uses Pyramidal Perception: filter with cheap text metadata, localize with sparse scans, then zoom in with high-fidelity vision.
- âąCompared to direct visual inference, Video-Browser improves accuracy by 37.5% while cutting token usage by 58.3%.
- âąThe system includes a Planner (decides what to search), a Watcher (finds where to look in videos), and an Analyst (confirms fine visual details).
- âąIt performs well across tasks that require visual grounding, like confirming who blocked a shot in an NBA playoff or spotting a red pen cap in a movie.
- âąThe work highlights a key lesson: reliable web research with videos needs both smart search and precise visual checking.
Why This Research Matters
So much of modern life is on video: sports, news, reviews, and tutorials. If AI canât find and verify facts inside videos, it will miss what people really care about. This work shows how to search widely yet look closely, making answers both efficient and trustworthy. It can help verify viral claims, check product features, or confirm exactly what happened in a key play. By allocating compute only where it counts, it reduces cost while improving reliability. As video volume grows, this approach is a blueprint for building agents that research, not just regurgitate.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre trying to find the exact moment in a two-hour movie when a tiny red pen cap peeks out of a pocket. You could watch the whole thing, or you could skim summariesâbut summaries probably wonât mention a pen cap.
đ„Ź The Concept (Direct visual inference): Itâs when an AI âwatchesâ a lot of video frames directly to answer a question. How it works:
- Download long videos
- Sample many frames (sometimes a huge number)
- Feed frames into a vision-language model
- Generate an answer from what it âsawâ Why it matters: Without it, you canât confirm fine visual facts (like a penâs color). But it can cause a âcontext explosionââtoo many frames, too many tokens, too slow. đ Anchor: The AI tries to confirm Walter Mittyâs pen color by scanning frames, but it spends a fortune in compute just to find a tiny red cap.
đ Hook: You know how reading a movieâs plot summary wonât tell you the color of someoneâs socks?
đ„Ź The Concept (Text-centric summarization): Itâs when an AI mostly reads titles, transcripts, and summaries instead of watching many frames. How it works:
- Grab video titles, descriptions, transcripts
- Compress them into short text notes
- Answer from the notes Why it matters: Itâs cheap and fast, but it misses small visual cluesâthis is the âmodality gapâ between text and vision. đ Anchor: The summary says Walter daydreams often, but never mentions the red pen cap, so the AI answers ânot specified.â
đ Hook: Think of a librarian who doesnât just read book jackets; they pull the exact page you need.
đ„Ź The Concept (Agentic Video Browsing): Itâs an AI that plans, searches, opens videos, checks timestamps, and verifies visual evidence step by step. How it works:
- Break the question into subtasks
- Search the web and pick candidate videos
- Peek at key moments
- Confirm the answer with visual proof Why it matters: Many real answers live inside videos (sports plays, tutorials, product demos). Without agentic browsing, AI guesses or over-trusts text. đ Anchor: To answer âWho blocked the shot after an alley-oop?â, the agent searches game clips, jumps to likely timestamps, and verifies the playerâs name on-screen.
The world before: AI agents got good at reading web pages and looking at images. Benchmarks like GAIA and BrowseComp tested browsing on text and static pictures. But the webâs most dynamic mediumâvideoâwas treated like a passive file: feed in a clip, ask a question, get an answer. Thatâs not how people research. In real life, you search, compare multiple videos, and verify details.
The problem: Open-web video research needs both scale (searching many candidates) and precision (zooming into seconds that matter). Direct visual inference gives precision but is expensive; text summaries give scale but miss fine details.
đ Hook: Like trying to find one special seashell by either combing every grain of sand (too slow) or only reading a beach brochure (too vague).
đ„Ź The Concept (PerceptionâContext trade-off): Itâs the tug-of-war between seeing a lot (perception) and keeping the prompt short (context). How it works:
- More frames = better visual recall
- But more frames = more tokens in context
- Big context slows and crowds out reasoning
- You must choose where to spend tokens Why it matters: If you push everything into context, you canât reason deeply; if you cut too much, you miss the answer. đ Anchor: Streaming thousands of frames from five long videos leaves no room to think; summarizing to 20 lines hides the red pen.
Failed attempts:
- Watch everything (accurate, but unscalable across multiple long videos).
- Read only text (efficient, but misses visual truth like colors, counts, or who touched the ball).
The gap: We need a way to cheaply narrow down candidates, roughly locate promising moments, then carefully zoom into tiny windows for visual truth.
Real stakes:
- Sports: Identifying who blocked a shot in a specific 2024 playoff moment.
- Movies: Confirming a tiny prop detail that carries symbolism.
- Tutorials/Shopping: Verifying if a tool or feature is actually shown working.
- News literacy: Checking if a claim matches whatâs visible in the original footage. In all of these, answers are short and objective (a name, a color, a count), but finding them requires smart, visual, timestamped verification across multiple videos.
02Core Idea
The âaha!â moment in one sentence: Treat the web of videos like a pyramidâfirst filter with cheap text, then localize with sparse scans, and only at the end zoom in with expensive, high-fidelity visionâwhile an agent plans and loops until it has visual proof.
đ Hook: Imagine birdwatching with binocularsâyou first pick the right tree (filter), then scan branches (localize), then zoom on the exact feather pattern (zoom-in).
đ„Ź The Concept (Pyramidal Perception): It is a three-stage way to spend compute only where it counts. How it works:
- Stage IâSemantic Filter: Use titles/snippets to toss out irrelevant videos (no frames yet)
- Stage IIâSparse Localization: Read transcripts and sample a few frames to propose time windows
- Stage IIIâZoom-in: Densely decode only those windows to resolve tiny visual details Why it matters: You avoid context explosion but still see what you need to answer correctly. đ Anchor: Searching for Walterâs pen, you ignore irrelevant uploads by title, glance sparsely to find moments with shirt close-ups, then zoom on those seconds to spot the red cap.
Three analogies:
- Detective: Skim alibis (filter), scan CCTV at key hours (localize), watch the exact minute in HD (zoom-in).
- Museum guide: Check the floor map (filter), stroll past relevant rooms (localize), step close to read the tiny plaque (zoom-in).
- Treasure hunt: Read the clue (filter), check likely spots (localize), dig only where X marks the spot (zoom-in).
Before vs. After:
- Before: Either stream tons of frames (accurate but slow and costly) or rely on text (fast but visually blind).
- After: Use the pyramid to prune early, find where to look, and verify with dense vision in small slices; you get higher accuracy with far fewer tokens.
đ Hook: You know how a class group project works best when roles are clearâone plans, one gathers, one checks?
đ„Ź The Concept (PlannerâWatcherâAnalyst loop): Itâs an agent with three roles and a memory that iterates until the answer is proven. How it works:
- Planner: Breaks the query into subtasks, crafts search keywords, and reads Watcher feedback
- Watcher: Applies the pyramid to select videos and propose timestamp windows
- Analyst: Zooms into the proposed windows to extract fine-grained visual facts
- Memory + Loop: If unsure, the Planner searches again and refines focus until confident Why it matters: Without clear roles and feedback, the agent either wastes compute or misses the crucial frame. đ Anchor: For âWho blocked that 2024 3-point attempt?â, the Planner targets Finals clips, the Watcher proposes a 10-second window, and the Analyst confirms from frames and commentary that it was Jayson Tatumâs shot.
Why it works (intuition, no equations):
- Early cheap filters cut 80â90% of dead ends before any heavy vision.
- Sparse scans are like âblips on radarââyou donât need every frame to guess likely moments.
- Zoom-in reserves expensive vision for tiny windows (e.g., 6â20 seconds), so you see jersey numbers, object colors, and micro-actions clearly.
- The loop keeps the agent honest: if evidence is weak, it searches differently rather than hallucinating.
Building blocks:
- Global memory of searches tried, candidates seen, and observations
- Planner that decomposes the task and adapts queries
- Watcher that runs Stage I (metadata filter), Stage II (sparse transcript + frames), Stage III (dense window decoding)
- Analyst that synthesizes visual evidence into a short, verifiable answer Together, these pieces transform open-web video browsing from guessy text search into verifiable visual research.
03Methodology
At a high level: Natural-language question â Planner (make smart searches) â Watcher (Stage I filter â Stage II localize â Stage III zoom-in) â Analyst (confirm tiny visual facts) â Short answer.
Step 0: Inputs and outputs
- What happens: The system receives a userâs question and access to the open web. It should return a short, objective answer (e.g., a name, color, or count).
- Why it exists: Short answers are easy to auto-check and keep the goal crisp.
- Example: âIn the movie directed by Ben Stiller, what color is Walter Mittyâs pen?â â âRed.â
đ Hook: You know how you first plan a grocery run before walking the aisles?
đ„Ź The Concept (Planner): Itâs the brain that decides what to search next. How it works:
- Break the big question into subtasks
- Generate search queries for each subtask
- Read Watcher feedback about which videos/times look promising
- Repeat with refined queries until confident Why it matters: Without planning, the agent wastes time on broad, unfocused searches. đ Anchor: For the Barkley/Yao bet, the Planner pivots from âNBA bet donkey buttâ to âCharles Barkley Yao Ming 19 points Inside the NBA,â which instantly finds the right clips.
đ Hook: Think of a movie nightâfirst you toss out movies that clearly donât fit before picking scenes to watch closely.
đ„Ź The Concept (Watcher, Stage I â Semantic Filter): Itâs a cheap text sieve that prunes irrelevant videos by reading only titles and snippets. How it works:
- Collect candidate videos from search
- Read metadata only (no video frames yet)
- Score relevance to the subtask
- Keep the best few Why it matters: Saves tons of compute by not decoding irrelevant videos. đ Anchor: If the question is about a 2024 NBA Finals moment, Stage I drops 2016 highlight compilations by their titles.
đ Hook: Like flipping through a book by skimming chapter headings and a few pictures.
đ„Ź The Concept (Watcher, Stage II â Sparse Localization): It finds âwhereâ in a retained video the answer might be. How it works:
- Read the whole transcript to find likely mentions and times
- Sample a small set of frames across the video
- Use both to propose short time windows [start, end]
- Return a list of candidate windows as âglimpsesâ Why it matters: It shrinks hour-long videos into manageable slices for close inspection. đ Anchor: For âWho blocked the 3-pointer?â, Stage II spots transcript lines near âblocked byâŠâ and proposes a 12â20 second window.
đ Hook: When you see a suspicious blur in a replay, you pause and step through frame by frame.
đ„Ź The Concept (Watcher, Stage III â Zoom-in): It decodes only the chosen windows at high fidelity. How it works:
- Densely sample frames inside each window
- Feed them to a vision-language model
- Extract precise visual facts (jersey number, color, count)
- Provide a crisp visual summary to the Analyst Why it matters: This is where tiny details are finally visible and reliable. đ Anchor: Zooming into Walterâs shirt pocket frames reveals the pen cap is clearly red.
đ Hook: Think of a science fair judge who reads your notes, then looks closely at your experiment to confirm.
đ„Ź The Concept (Analyst): Itâs the final reasoner that uses all windows to produce the short answer. How it works:
- Read the zoomed-in clips and transcripts
- Cross-check across multiple videos if needed
- Resolve conflicts and choose the best-supported answer
- Output a short, verifiable string plus confidence Why it matters: Without this careful synthesis, the system might over-trust a single clip. đ Anchor: Multiple clips confirm: âSunday night Yao Ming went 9-for-9⊠He scored 20 points.â The Analyst outputs â20 points.â
Secret sauce (why this recipe is clever):
- Spend cheap tokens first (metadata), save heavy vision for the final inches.
- Keep a feedback loop so the Planner can adjust queries when evidence is weak.
- Use both text (transcripts) and vision (frames) so the agent isnât blind to either modality.
- Constrain outputs to short, judgeable strings to avoid format mismatches.
What breaks without each step:
- No Planner: The system keeps searching randomly and wastes budget.
- No Stage I: Compute explodes on irrelevant videos.
- No Stage II: You donât know where to look; context balloons with long clips.
- No Stage III: You miss fine details like colors, jersey numbers, logos.
- No Analyst: Conflicting snippets never resolve into one trustworthy answer.
Concrete mini-walkthrough:
- Input: âIn a 2024 playoff game, who had his 3-pointer blocked after an alley-oop sequence?â
- Planner: Searches âMavericks Celtics 2024 alley-oop block who 3-pointer.â
- Watcher Stage I: Selects Finals videos by titles.
- Stage II: Finds transcript: âTatum step back ⊠blocked by Gafford.â Proposes 10â20s window.
- Stage III: Zooms in, sees the action and hears commentary.
- Analyst: Confirms âJayson Tatum.â
- Output: âJayson Tatum.â
04Experiments & Results
đ Hook: Think of a spelling bee where the judge only accepts exact, provable answers.
đ„Ź The Concept (Video-BrowseComp benchmark): Itâs a test set that forces agents to rely on videos, not just text shortcuts. How it works:
- Questions need video evidence (mandatory video dependency)
- Answers are short and objective (names, colors, counts)
- Three levels: explicit retrieval (Level 1), implicit retrieval (Level 2), multi-source reasoning (Level 3) Why it matters: It mirrors real video researchâfind, verify, and cross-check across sources. đ Anchor: âWhat color is Walter Mittyâs pen?â can only be answered by seeing frames, not by reading a wiki.
đ Hook: When your teacher grades both answers and how confident you were.
đ„Ź The Concept (Overall Accuracy and Calibration Error): OA measures how often answers match ground truth; CE checks if confidence matches reality. How it works:
- Use an LLM judge to compare the predicted short answer with the ground truth
- Ask the model to report a confidence score
- Compute how well confidence tracks actual correctness (lower CE is better) Why it matters: We need both correct answers and honest uncertainty. đ Anchor: If the agent says âRedâ with 95% confidence and itâs right, thatâs good calibration; if itâs wrong but still 95% confident, thatâs bad.
The competition:
- Tool-free MLLMs (no web tools) tried to answer from internal knowledge and small context.
- Search-augmented models (big vendor tools) could browse the web but often stayed text-centric.
- Baselines included direct visual inference (watch many frames) and text-centric summarization (compress to text).
Scoreboard with context:
- Tool-free models topped out under ~20% accuracyâparametric memory canât replace real video evidence.
- Search-augmented systems improved Level 1 (explicit clues) but stumbled on Level 2/3 where visual proof is key.
- Video-Browser, using Pyramidal Perception, achieved a 37.5% relative accuracy improvement over direct visual inference, while reducing tokens by 58.3%. Thatâs like getting a strong B when others were stuck around a C, but paying half the cost.
Efficiency findings:
- Direct visual inference: decent accuracy but massive token/context usage (context explosion).
- Summarization: tiny context but lost key visuals; accuracy dipped and calibration worsened.
- Pyramidal Perception: best of bothâhighest accuracy with far fewer tokens by zooming only where needed.
Surprising observations:
- Models that âbrowseâ the web still often behave like text search engines; when text is silent (e.g., specific sports moments), they fail unless they truly see the video.
- Test-time scaling helps: more candidates, deeper loops, and denser sparse scans all steadily improve accuracyâevidence that the loop and the pyramid effectively use extra compute.
- Using transcripts alone severely underperforms; explicit visual perception is mandatory to bridge the modality gap.
Takeaway: Verifiable open-web video research isnât about watching everything or reading everythingâitâs about watching just the right few seconds, reliably.
05Discussion & Limitations
Limitations:
- Benchmark size: 210 carefully curated questions keep costs reasonable but limit broad coverage; itâs a âgolden setâ rather than a giant leaderboard.
- Compute: Even with savings, multi-round search plus decoding windows still costs tokens; scaling to thousands of queries requires budget planning.
- Tiny-object recognition: Very small logos/low-res text can still fool current models, even after zoom-in.
- Semantic distractors: Videos with similar themes (e.g., Icelandic foods) can mislead selection/localization unless entity checks are strict.
- Information deficits: If neither frames nor transcripts contain the needed stat, the agent must fetch external structured data or abstain.
Required resources:
- Access to web search for video retrieval, transcripts (when available), and video downloading/decoding.
- An MLLM capable of multimodal reasoning with efficient token usage.
- Modest storage/cache for candidate clips and transcripts, plus a scheduler for multi-iteration runs.
When not to use:
- Questions answerable from text alone (use simple web search to save cost).
- Ultra-fine OCR tasks on tiny, blurry frames (specialized OCR may be better).
- Domains with restricted or paywalled video access (agent canât fetch the needed evidence).
Open questions:
- Can we learn better selection/localization policies that further cut tokens without hurting accuracy?
- How to robustly verify identities (players, actors) under occlusion, fast motion, or look-alikes?
- Can retrieval incorporate cross-video alignment (e.g., track the same event across multiple broadcasts) automatically?
- How to calibrate uncertainty under multi-source disagreement so the agent knows when to abstain?
- What training or RL signals best encourage âverify before trustâ behavior in video research agents?
06Conclusion & Future Work
Three-sentence summary: This paper defines Agentic Video Browsing and introduces Video-BrowseComp to test it with mandatory video evidence. It proposes Video-Browser, which uses Pyramidal Perceptionâfilter by cheap text, localize with sparse signals, then zoom in with dense visionâcoordinated by a PlannerâWatcherâAnalyst loop. The system improves accuracy by 37.5% over direct visual inference while cutting token usage by 58.3%, showing that precise, efficient video verification is possible on the open web.
Main achievement: Turning open-web video research into a practical, verifiable process by allocating compute only where the truth livesâwithin short, carefully chosen windows.
Future directions:
- Smarter policies for selection and localization; tighter entity verification to resist distractors.
- Better handling of tiny text/logos and long-range cross-video reasoning.
- Scaling the benchmark while keeping costs accessible and answers unambiguous.
Why remember this: It reframes video browsing from âwatch everythingâ or âread summariesâ into âsearch smart, look briefly, verify visually,â which is exactly what reliable, efficient AI agents must do to understand a world increasingly told in video.
Practical Applications
- âąSports analytics assistants that identify who made a block, foul, or assist in specific game moments.
- âąFact-checking tools that verify claims by locating and inspecting the exact video timestamps.
- âąShopping helpers that confirm whether a product feature is visibly demonstrated in review videos.
- âąEducation search that jumps to the precise lecture moment explaining a concept (e.g., âsemantic gapâ slide).
- âąCustomer support bots that find and highlight the exact step in a tutorial video that solves a userâs issue.
- âąNews literacy apps that cross-check multiple video sources to resolve conflicting reports.
- âąVideo archive search that localizes where a person, object, or color appears across long footage.
- âąFilm analysis tools that verify props, continuity, and scene details for editors and researchers.
- âąSafety compliance auditors that confirm if procedures were visibly followed in workplace videos.
- âąEsports assistants that find and verify critical plays across multiple match VODs.