Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang; Honglu Zhou; Shijie Wang; Junnan Li; Caiming Xiong; Silvio Savarese; Mohit Bansal; Michael S. Ryoo; Juan Carlos Niebles

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Beginner

Ziyang Wang, Honglu Zhou, Shijie Wang et al.12/5/2025

arXiv PDF

Key Summary

•Long Video Understanding (LVU) is hard because the important clues are tiny, far apart in time, and buried in hours of mostly unimportant footage.
•Most older systems first caption every video chunk, then search those captions, which wastes time and can miss fine details.
•This paper introduces Active Video Perception (AVP), where the AI actively decides what to look for, where to look, and how carefully to look based on the question.
•AVP works in a loop: plan what to watch, observe that part of the video, then reflect to see if there’s enough evidence or if it should look again.
•Instead of dumping everything into text, the observer records compact, time-stamped evidence tied to the question, keeping the facts precise and grounded in the video.
•A reflector checks if the evidence is sufficient and confident; if not, it guides the next, more targeted observation.
•Across five tough benchmarks, AVP beats previous agent-style methods by 5.7% accuracy while using only about 18.4% of the time and 12.4% of the input tokens.
•Ablations show the planner and reflector both matter a lot, and only a few rounds (about three) are usually enough.
•AVP preserves fine-grained temporal and spatial details, making it better at complex, multi-step questions that need exact moments.
•This approach points toward smarter, more efficient video agents that watch only what they need, like careful detectives instead of binge-watchers.

Why This Research Matters

Videos are everywhere—classes, sports, news, security—and most of the time we only need a few key moments to answer our questions. AVP finds those moments fast by planning where to look and how carefully to look, instead of wasting time on everything. This saves people hours of scrubbing, lowers computing bills, and keeps answers tied to exact timestamps. It also makes AI explanations clearer, because the evidence list points to precise video moments. In safety-critical settings like surveillance or incident review, that grounding and efficiency can be crucial. And as video libraries keep growing, an agent that watches smartly, not endlessly, becomes essential.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re looking for a single scene in a two-hour movie where a character secretly passes a note. You wouldn’t watch every second; you’d jump to suspicious moments, rewind key parts, and zoom in on subtle hand movements.

🥬 Filling (The Actual Concept): Long Video Understanding (LVU) is the challenge of answering questions about very long videos by finding small, scattered clues across time.

How it works (before this paper): Many systems slice the whole video into many chunks, auto-caption all of them (no matter the question), and then use those captions to answer.
Why it matters: Without better focus, the real evidence gets drowned in a sea of irrelevant frames; answers become slow, expensive, and sometimes wrong.

🍞 Bottom Bread (Anchor): If you ask, “When does the coach step onto the court for the first time?”, you need a few precise seconds, not every huddle or crowd shot.

🍞 Top Bread (Hook): You know how taking notes on every single page of a textbook wastes time if your quiz is only on Chapter 7?

🥬 Filling (The Actual Concept): Query-agnostic captioning is when a system captions everything in a video before it even knows your question.

How it works: It runs a captioner over all segments, storing a giant text database.
Why it matters: This burns lots of compute on irrelevant moments and can blur fine details like exact timing or small objects.

🍞 Bottom Bread (Anchor): It’s like summarizing an entire 500-page book before answering, “What color was the kite in Chapter 7?”

🍞 Top Bread (Hook): Think of how you actually watch videos when you care about a specific answer—you skim, then zoom in.

🥬 Filling (The Actual Concept): Active perception theory says a smart agent should decide why it’s looking, then choose what, when, and where to look to get only the needed information.

How it works: Set a goal (your question), plan a targeted peek, check if it’s enough, and repeat until confident.
Why it matters: Without active perception, systems waste time, miss fine-grained cues, and struggle with multi-step reasoning.

🍞 Bottom Bread (Anchor): If the question hints at “opening scene,” you first scan the start; if you see a clue at 1:02, you replay that exact moment at higher resolution to catch small details.

🍞 Top Bread (Hook): Imagine a detective who plans where to search first, gathers evidence, and then decides the next step based on what they found.

🥬 Filling (The Actual Concept): Agentic frameworks for LVU break the task into planning, looking, and reasoning, often using large language models to coordinate these steps.

How it works: An LLM plans actions, calls tools (like viewers), collects notes, and reasons to an answer.
Why it matters: Without smart coordination, long videos become unmanageable, and reasoning loses solid grounding.

🍞 Bottom Bread (Anchor): Instead of watching the whole game, the agent plans to inspect the last two minutes for a critical foul, then rechecks replays if needed.

The world before this paper: Strong multimodal LLMs could recognize many visual things, but long videos overwhelmed them. Caption-first pipelines helped LLMs reason in text but were slow and often missed fine-grained temporal or spatial details needed for tough questions. People tried extending context windows, compressing videos, or selecting keyframes, but those approaches either still processed too much irrelevant content or used fixed, question-agnostic settings (like fixed FPS). Others added reflection in text space or built caption graphs for retrieval—but everything still depended on captions that might smooth over the very clues that matter.

The gap this paper fills: a question-driven, pixel-grounded way to actively seek only the necessary evidence—choosing what to look for, when to look, where to look, and how finely to sample—iteratively, until the system is confident. This turns LVU from passive binge-watching into purposeful clue-hunting.

Why you should care: We all create and consume long videos—classes, sports, news, vlogs. Faster, more accurate video answers save time (no more scrubbing endlessly), improve safety (find the exact incident in security footage), and enhance learning (pinpoint the critical step in a lab demo). It’s the difference between “searching a haystack” and “using a magnet to find the needle.”

02Core Idea

🍞 Top Bread (Hook): You know how you don’t read a whole encyclopedia to answer “What year did the moon landing happen?” You jump straight to the relevant page.

🥬 Filling (The Actual Concept): The key insight of AVP in one sentence: Let the AI actively plan what, when, and where to watch in the video, gather compact time-stamped evidence, and iteratively reflect on whether that evidence is enough to answer the question.

How it works: Plan → Observe → Reflect, and loop until confident.
Why it matters: This avoids watching everything, preserves precise timing and details, and focuses compute on what the question actually needs.

🍞 Bottom Bread (Anchor): For “Who hands over the box?”, the system first finds the likely scene, then zooms in at higher FPS to catch the exact handoff, and stops once sure.

Three analogies for the same idea:

Detective analogy: Instead of reading every report, the detective requests only the necessary camera pulls, rewinds the moment of interest, and concludes once the evidence is solid.
Student analogy: For a test about Chapter 9, you skim Chapter 9, then re-read the hard paragraph carefully. You don’t reread the whole book.
Treasure hunt analogy: Use the map (question hints) to search specific areas, dig a small test hole, and keep digging only where you find clues.

🍞 Top Bread (Hook): Imagine switching from a vacuum that sucks up everything to a hand tool that only picks what you need.

🥬 Filling (The Actual Concept): Query-conditioned planning means the observation plan (what/where/how) depends on the specific question.

How it works: The planner picks target time ranges, FPS, and resolution based on hints like “opening scene,” timestamps, or vague cues.
Why it matters: Without question-driven planning, the system spends time on irrelevant parts and risks missing tiny but crucial details.

🍞 Bottom Bread (Anchor): If the question mentions “final minutes,” the planner focuses on the end and raises FPS to catch quick actions.

🍞 Top Bread (Hook): You know how pausing and replaying a sports clip helps you see who touched the ball first?

🥬 Filling (The Actual Concept): Targeted video observation is watching only selected regions at chosen granularity and writing down structured, time-stamped evidence.

How it works: The observer records concise entries like [start, end, description], accumulating a list across rounds.
Why it matters: Without structured, time-aligned notes, reasoning becomes fuzzy and can’t reliably point to exact moments.

🍞 Bottom Bread (Anchor): “[1:04–1:09] A small, conical monument appears in the upper-left background.” That precise note lets the system pick the right multiple-choice answer.

🍞 Top Bread (Hook): After you peek at a scene, you ask yourself, “Do I know enough to answer?” If not, you plan what to check next.

🥬 Filling (The Actual Concept): The iterative plan–observe–reflect loop repeats until the reflector is confident the evidence is sufficient.

How it works: The reflector rates confidence, explains what’s missing, and guides the next plan.
Why it matters: Without reflection, the system might stop too early or keep watching pointlessly.

🍞 Bottom Bread (Anchor): Round 1 narrows to [1:00–1:10] but lacks detail; Round 2 ups FPS/resolution there and confirms the correct answer.

🍞 Top Bread (Hook): Think of asking a friend who can understand both words and pictures at the same time.

🥬 Filling (The Actual Concept): Multimodal Large Language Models (MLLMs) power the planner, observer, and reflector so they can handle text queries and video frames together.

How it works: The planner interprets the question, the observer reads visual evidence, and the reflector reasons about sufficiency.
Why it matters: Without MLLMs, coordinating text instructions with visual details would fall apart.

🍞 Bottom Bread (Anchor): The model reads “Where is the monument first seen?” (text) and inspects the frames (video) to produce the answer.

Before vs. after: Before AVP, systems passively captioned everything, paying too much and sometimes blurring timing cues. After AVP, the system watches like a skilled editor—jumping to likely scenes, increasing detail only when needed, and stopping once certain—leading to higher accuracy and much better efficiency.

Why it works (intuition): Questions usually point to only a few vital moments. If the agent follows those signals (what/where/how), it saves effort and avoids noise. Iteration adds safety: if the first pass isn’t enough, the next pass gets sharper and closer, making the final answer both faster and more trustworthy.

Building blocks:

Planner: turns the question into an actionable plan (what/where/how).
Observer: looks exactly there with the chosen granularity and logs structured, time-stamped evidence.
Reflector: checks sufficiency and either answers or guides the next observation. These parts form a tight feedback loop that converges quickly on the needed clues.

03Methodology

At a high level: Input (video + question) → Plan (what/where/how) → Observe (extract time-stamped evidence) → Reflect (enough yet?) → Output (answer) or Re-plan (repeat up to a few rounds).

🍞 Top Bread (Hook): You know how you plan a quick skim first, then a careful rewatch only if needed?

🥬 Filling (The Actual Concept): The planner converts the question into a concrete observation plan: what to look for, where to look (time range), and how to sample (FPS/resolution).

How it works step by step:
1. Parse the question for hints (timestamps, phrases like “opening scene,” or task type like factual vs. reasoning).
2. Choose where: entire video for a cheap scan or a focused segment if hints exist.
3. Choose how: coarse settings (low FPS/res) for exploration, fine settings (higher FPS/res) for detail.
4. Define what: a brief instruction tailored to the question (e.g., “Find when the coach first steps on court.”).
Why it matters: Without a plan, the system wastes time everywhere and risks missing the tiny moment that matters.

🍞 Bottom Bread (Anchor): “What: verify who hands over the box; Where: [4:55–5:15]; How: 2 FPS, medium res.”

🍞 Top Bread (Hook): Imagine watching just the slice of video your plan picked and writing a crisp note with the exact times and what happened.

🥬 Filling (The Actual Concept): The observer executes the plan and outputs structured time-stamped evidence entries: [start, end, description].

How it works step by step:
1. Load the chosen region(s) with the specified FPS/resolution.
2. Scan for moments relevant to the “what.”
3. For each salient moment, log [start, end, description].
4. Append entries to a cumulative evidence list (the agent’s memory).
Why it matters: Free-form text can drift; structured, time-aligned notes keep facts precise and make later reasoning reliable.

🍞 Bottom Bread (Anchor): “[1:04–1:09] The Tombstone monument appears in the upper-left background; the couple stands midground.”

🍞 Top Bread (Hook): After taking notes, you ask, “Is this enough to answer confidently?” If not, you decide what to check next.

🥬 Filling (The Actual Concept): The reflector evaluates sufficiency and confidence, extracting an answer if confident, or explaining what is missing if not.

How it works step by step:
1. Read the cumulative evidence list.
2. Judge if it supports a clear answer (e.g., pick an MCQ option) with confidence.
3. If confident, output answer and stop; if not, write what to look for next (e.g., “Need higher FPS from 1:00–1:10 to see the monument clearly.”).
Why it matters: Without reflection, the system might stop early and be wrong, or keep searching forever.

🍞 Bottom Bread (Anchor): Round 1: confidence 0.3 (“not enough”); Round 2: confidence 0.7 (“answer D”).

Algorithm flow (friend-style explanation):

Start with an initial plan from the question. Observe as instructed. Add your findings to a neat, time-stamped list. Ask your inner critic if it’s enough. If yes, answer. If not, let the critic say what’s missing, add that to your history, and make a sharper plan. Repeat up to 3 rounds or until you’re sure.

Concrete example with data:

Query: “Where is the Tombstone monument first seen on screen?”
Round 1 plan: What = find introduction of the German woman and monument location; Where = entire video; How = 0.5 FPS, low res.
Round 1 observe: “[1:00–1:10] narrator introduces a German couple; a wide shot shows ranch and a distant monument.”
Round 1 reflect: Confidence 0.3 (not enough to pick an option; need precise location quadrant).
Round 2 plan: What = determine exact location vs options; Where = [1:00–1:10]; How = 2 FPS, medium res.
Round 2 observe: “[1:04–1:09] small, conical monument on a hill in upper-left background.”
Round 2 reflect: Confidence 0.7, answer D.

🍞 Top Bread (Hook): Think of using binoculars: you first scan the horizon (wide, coarse), then zoom in on the tiny moving shape (narrow, fine).

🥬 Filling (The Actual Concept): Coarse-to-fine observation means starting cheap and broad, then sharpening where evidence suggests.

How it works: Begin with low FPS/res to localize; then increase FPS/res in the suspicious window.
Why it matters: Without this, you either waste compute everywhere or miss small, fast events.

🍞 Bottom Bread (Anchor): From 0.5 FPS scan to 2 FPS zoom-in on [1:00–1:10] to spot a small monument.

🍞 Top Bread (Hook): Imagine keeping a tidy scrapbook of every clue you’ve seen so far to avoid repeating work.

🥬 Filling (The Actual Concept): Evidence history and re-planning reuse what’s already found and steer the next plan.

How it works: After each round, the plan, new evidence, and reflector’s notes are saved; the planner reads this to refine the next step.
Why it matters: Without history, the system forgets progress, repeats searches, and wastes time.

🍞 Bottom Bread (Anchor): History shows you already checked the intro scene, so the next plan targets the close-up later on.

Secret sauce (what makes it clever):

Question-driven planning that tightly controls what/where/how to look.
Structured, time-stamped evidence that preserves grounding.
Iterative reflection that prevents both over-searching and under-searching.
Adaptive granularity (coarse-to-fine) that spends compute only where it pays off. Together, these make AVP accurate, efficient, and trustworthy on long, tricky videos.

04Experiments & Results

🍞 Top Bread (Hook): Picture a championship where many teams try to answer tough questions about hours-long videos—who wins, and how fast do they play?

🥬 Filling (The Actual Concept): The tests measure how accurately and efficiently systems answer long-video questions across several benchmarks.

How it works: Datasets include MINERVA, LVBench, MLVU, Video-MME, and LongVideoBench, with videos often many minutes to hours long.
Why it matters: Real-world use needs both correctness (accuracy) and practicality (time and tokens).

🍞 Bottom Bread (Anchor): On LVBench, questions span entity facts, events, temporal grounding, reasoning, and summarization over hour-long videos.

🍞 Top Bread (Hook): Imagine running races against both general-purpose sprinters and specialized marathoners.

🥬 Filling (The Actual Concept): AVP is compared to strong general MLLMs, video-specific models, and agentic frameworks that plan and reason.

How it works: Everyone gets the same questions; scores are accuracy plus efficiency stats like inference time and input tokens.
Why it matters: Beating caption-first agents shows that actively selecting video views is a crucial shift.

🍞 Bottom Bread (Anchor): AVP vs DeepVideoDiscovery (DVD): AVP is both more accurate and dramatically faster with far fewer tokens.

Scoreboard with context:

Accuracy: AVP achieves an average +5.7% accuracy over the best prior agentic method (DVD) across benchmarks.
Efficiency: On LVBench, AVP needs only about 18.4% of the inference time and 12.4% of the input tokens compared to DVD. That’s like getting an A while studying for one evening, when others crammed all week.
Backbone robustness: Using different MLLMs (from lighter to stronger), AVP consistently improves over the backbone itself—showing the framework’s generality.
Time-stamped structure wins: Replacing AVP’s structured evidence with unstructured notes hurts accuracy, confirming that precise temporal grounding is key.

🍞 Top Bread (Hook): Think of a coach fine-tuning a play—“What if we add a planner? What if we add a reviewer?”

🥬 Filling (The Actual Concept): Ablation studies test the value of each AVP component and design choice.

How it works: Start with observer-only, then add planner, then add reflector; also vary max rounds and model strengths.
Why it matters: This reveals which parts truly drive performance.

🍞 Bottom Bread (Anchor): Observer-only < +Planner < +Reflector (full AVP) on both MINERVA and LVBench; best results typically at 3 rounds.

Surprising findings:

Only a few rounds needed: Performance usually saturates by three plan–observe–reflect cycles, suggesting the loop converges quickly.
Component sensitivities differ by dataset: Extremely long videos benefit especially from a strong observer; complex multi-hop questions gain more from a stronger planner and reflector.
Active beats passive at scale: Eliminating the huge captioning stage both speeds things up and raises accuracy, because the agent zeros in on the real clues instead of drowning in text.

What these numbers mean in plain terms:

If everyone else watched almost everything to be safe, AVP instead asks, “What’s the minimum I must see to be sure?” That change is why it answers better while spending much less time and memory.
On tests that need exact timing (like “first seen,” “before the second clip”), AVP’s time-stamped evidence and iterative zoom-in shine.
On massive videos, AVP’s coarse-to-fine scanning avoids wasting effort—like using a map before you hike.

Bottom line: Across five respected benchmarks, AVP sets a new standard for both accuracy and efficiency, showing that active, query-driven perception is the right tool for long video reasoning.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great detectives can miss a quick glance in a crowded room.

🥬 Filling (The Actual Concept): Limitations describe where AVP might struggle and what it needs.

How it works: We identify tricky cases (very brief events), resource needs (MLLM access, GPU/TPU decode), and scenarios not fully tested (live streaming, robotics).
Why it matters: Knowing limits helps you use the tool wisely and improve it next.

🍞 Bottom Bread (Anchor): A 0.5 FPS skim might miss a single fast three-pointer if it happens between sampled frames.

Specific limitations:

Fine, blink-and-you-miss-it events: If the initial coarse scan is too sparse, a very brief action can be missed. The loop can recover by re-planning with higher FPS, but only if hints surface.
Offline focus: Experiments mainly use full videos available upfront; true streaming or embodied settings (deciding while the world moves) are future work.
Prompted control: Current planning/reflecting is prompt-based; learned policies (e.g., reinforcement learning) might do even better at long-horizon efficiency.

Required resources:

A capable MLLM (for planning, observing, reflecting), video decoding tools, and enough memory to hold sampled frames within a token budget.
For best results, moderate GPU/TPU compute to handle multiple passes at varying FPS/resolution.

When not to use:

Ultra-brief event detection with no question hints, where exhaustive high-FPS scanning is mandatory (e.g., millisecond anomalies).
Tasks that are already local and short, where a simple single-pass model suffices and AVP’s planning loop adds overhead.

Open questions:

How to learn the observation policy: Can we train the planner/reflector to pick the perfect FPS, region, and stopping point automatically under strict time budgets?
Streaming adaptation: How to decide what to watch next in real time as video arrives?
Multi-agent coordination: Can specialized observers (faces, text, motion) cooperate under the same planner for even stronger results?
Robustness: How to keep confidence calibration solid across wildly different video genres and qualities?

Takeaway: AVP moves the field toward smarter, leaner video reasoning, but the hardest edges—tiny, rapid events and real-time operation—invite the next round of innovation.

06Conclusion & Future Work

Three-sentence summary: This paper introduces Active Video Perception (AVP), which treats long video understanding as an iterative, query-driven evidence hunt that plans what/where/how to look, writes down time-stamped clues, and reflects until confident. By avoiding question-agnostic captioning and using a tight plan–observe–reflect loop, AVP preserves precise temporal and spatial grounding while massively improving efficiency. Across five benchmarks, AVP achieves state-of-the-art accuracy among agentic methods while using a fraction of the time and tokens.

Main achievement: Proving that active, pixel-grounded observation—guided by the question and verified through reflection—beats passive caption-first pipelines on both accuracy and efficiency for long, complex videos.

Future directions: Learn the observation policy (beyond prompting) to optimize rounds, FPS, and halting; bring AVP into streaming and embodied agents that must perceive and act in real time; add specialized observers (e.g., OCR, pose) under a unified planner. Improve confidence calibration and dynamic risk control for high-stakes deployments.

Why remember this: AVP shifts the mental model of video AI from binge-watching to clue-hunting. It shows that deciding what, when, and where to look—then checking if that was enough—is the key to mastering long videos without drowning in data. That idea will likely power the next generation of practical, trustworthy video assistants.

Practical Applications

•Sports review: Jump to the exact plays to verify who scored and when, producing time-stamped clips for coaches and highlights.
•Education: Answer student queries about a lecture by scanning only relevant minutes and giving time-stamped references.
•Customer support and training: Pinpoint the step where a user or trainee made a mistake in a long screen-recording video.
•Security and compliance: Locate the first appearance of an object or person in hours of CCTV with precise timestamps.
•Media production: Rapidly assemble a rough cut by finding and extracting only scenes that match a storyboard cue.
•Content moderation: Efficiently check flagged segments in long streams without scanning entire broadcasts.
•Meeting summarization: Pull out who decided what and when from multi-hour recordings, linking notes to timestamps.
•Medical and lab videos: Identify the exact moment a procedure step occurs for auditing or teaching purposes.
•Wildlife and research footage: Detect specific behaviors or events (e.g., first feeding) without labeling every frame.
•Video search engines: Offer question-answering over long videos with grounded, time-stamped evidence links.

Version: 1