Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Xiaoqian Shen; Min-Hung Chen; Yu-Chiang Frank Wang; Mohamed Elhoseiny; Ryo Hachiuma

Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Intermediate

Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang et al.12/16/2025

arXiv PDF

Key Summary

•Zoom-Zero helps AI answer questions about videos by first finding the right moment and then zooming in to double-check tiny details.
•It fixes a weakness in big video-language models that often miss when things happen or make up answers not grounded in the video.
•The method uses reinforcement learning with a new zoom-in accuracy reward that only gives full credit if the answer is correct after zooming in on the predicted time span.
•It also introduces token-selective credit assignment so the model rewards the exact parts of its output that localized the moment or produced the answer.
•Compared to strong baselines, Zoom-Zero improves temporal grounding (finding the right moment) by 5.2% on NExT-GQA and 4.6% on ReXTime.
•On long videos, the coarse-to-fine zooming strategy boosts performance by an average of 6.4% without losing the big-picture context.
•The approach reduces hallucinations by verifying that the predicted time span actually contains the visual evidence needed.
•It works in a single stage at normal speed, and can optionally run a two-stage zoom for even better accuracy at a modest extra cost.
•The idea is simple but powerful: find first, then verify closely, and pay the right tokens for the right job.
•This makes AI more trustworthy for tasks like sports highlights, surveillance review, education videos, and documentaries.

Why This Research Matters

Videos are how we learn, remember, and share stories—from school lessons and science demos to sports, news, and security footage. Zoom-Zero makes AI more reliable by tying answers to the exact moments where the evidence appears, reducing mistakes and hallucinations. This is especially important in long videos, where small details can get lost unless you zoom in at the right time. Better grounding means teachers, analysts, and everyday users can trust the answers they get and find what matters faster. It also keeps computation efficient by focusing detail only where it counts. In short, it turns AI into a more careful video watcher that first finds, then proves.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a long movie, you first get the general plot, then you rewind a few seconds to catch a tiny clue you missed? That is exactly the kind of skill AI needs to answer questions about videos honestly and precisely.

🍞 Top Bread (Hook): Imagine you ask, “When does the player score?” If the AI guesses the wrong minute or makes up an answer, you won’t trust it. 🥬 Filling (The Actual Concept):

What it is: Grounded Video Question Answering (GVQA) asks an AI to both find the right time span in a video and answer a question based only on what’s truly in that span.
How it works: The model watches the video, figures out which time interval matches the question, and generates an answer that should be backed by visible evidence in that interval.
Why it matters: Without true grounding, the model can say the right words for the wrong reasons (hallucinations), or point to the wrong moment. 🍞 Bottom Bread (Anchor): If you ask, “What number is on the jersey?” the AI must first jump to the goal scene, then read the number on the close-up frame.

The world before: Large video-language models (LVLMs) were good at general descriptions but struggled with time. They often mixed up when things happened, especially in long videos, and sometimes produced answers that didn’t match any real frames—like saying “29%” when that text never appeared in the grounded segment it chose.

🍞 Top Bread (Hook): You know how teachers want you to show your work, not just give an answer? AI needs that too. 🥬 Filling (The Actual Concept):

What it is: Temporal Grounding is the skill of pointing to the exact start and end times in the video that prove your answer.
How it works: The model predicts a time span [start, end] that matches the query.
Why it matters: If you can’t show the moment, the answer isn’t trustworthy. 🍞 Bottom Bread (Anchor): When asked “When does the fireworks show begin?”, the AI should highlight, say, 91.0–163.0 seconds, not the wrong minute.

Researchers tried reinforcement learning (RL) to improve this. RL can sharpen specific abilities without ruining what the model already knows.

🍞 Top Bread (Hook): Training a puppy with treats helps it learn tricks by trial and error. 🥬 Filling (The Actual Concept):

What it is: Reinforcement Learning (RL) teaches models by giving rewards for good behavior and less reward for bad behavior.
How it works: The model tries different answers; a reward function scores them; the model updates to get higher scores next time.
Why it matters: RL can target weaknesses—like temporal precision—without needing perfect labels everywhere. 🍞 Bottom Bread (Anchor): If the model correctly spots the goal time and answers the color of the jersey, it gets a bigger “treat.”

A popular RL tool here is GRPO—Group Relative Policy Optimization—which compares several sampled answers to decide which ones are better.

🍞 Top Bread (Hook): Think of a coach ranking multiple practice runs from a team and encouraging the better ones. 🥬 Filling (The Actual Concept):

What it is: GRPO improves a model by sampling a group of responses, scoring them, and pushing the model toward the relatively better ones.
How it works: For a question, the model generates several outputs, each gets a reward; the scores are normalized across the group; the model learns to prefer higher-scoring samples.
Why it matters: It avoids the cost of training a separate “critic” network and focuses on relative improvements. 🍞 Bottom Bread (Anchor): If eight attempts are made, and two clearly align with the right moment and answer, the model nudges itself toward those two.

But early RL attempts used simple rewards—like “format is correct” and “IoU is high.” That left two big problems:

They often forgot to check if the predicted time span truly contains the visual evidence for the answer.
They gave one reward to the whole output, so all tokens (including unhelpful ones) got the same credit.

🍞 Top Bread (Hook): You wouldn’t give the entire group an A just because one teammate did the hard part. 🥬 Filling (The Actual Concept):

What it is: Mean Intersection over Union (mIoU) measures how much the predicted time span overlaps with the ground truth across many examples.
How it works: For each example, overlap is computed (intersection divided by union), then averaged.
Why it matters: It tells you if your localization is usually close, but it doesn’t prove the key frames for the answer are truly inside. 🍞 Bottom Bread (Anchor): Predicting [85–170s] when the truth is [91–163s] may have decent IoU, but if the frame with the tiny “29%” text isn’t actually seen clearly, you still get the answer wrong.

The gap: Models needed a way to (1) first find likely intervals, (2) then zoom in to verify tiny clues, and (3) reward the exact parts of the output responsible for finding times versus stating answers. The stakes are real: From summarizing lectures and sports to reviewing surveillance or documentaries, people need time-accurate, evidence-backed answers, not good-sounding guesses.

02Core Idea

🍞 Top Bread (Hook): You know how photographers first frame the scene, then twist the lens to zoom in and focus on the detail that matters? That’s the big idea here.

The “Aha!” moment in one sentence: First localize the moment coarsely, then zoom in on those frames to verify the tiny details, and give rewards only to the tokens that did the job—some for finding time, some for answering.

Three analogies:

Binoculars on a hike: scan the mountain range (coarse), then zoom in on the cave entrance (fine) to confirm what you saw.
Library search: find the right shelf (coarse), then read the exact paragraph (fine) to answer the question.
Sports replay: spot the play (coarse), then slow-mo zoom to check if the ball crossed the line (fine).

🍞 Top Bread (Hook): You know how you build a LEGO set: big blocks first, then small details. 🥬 Filling (The Actual Concept):

What it is: Coarse-to-Fine Framework means the model first picks the likely time span, then allocates more visual tokens to those frames for a close inspection.
How it works: Step 1—roll out several predictions of [start, end] and a preliminary answer with low per-frame detail; Step 2—crop to the best spans and re-run with higher per-frame detail to confirm the final answer.
Why it matters: Long videos exceed the “attention budget.” This keeps global context while still seeing crisp details where it counts. 🍞 Bottom Bread (Anchor): The coarse pass might miss the tiny “29%” on a sign; the fine pass zooms in so the text becomes readable.

🍞 Top Bread (Hook): Teachers give extra points when you show your work. 🥬 Filling (The Actual Concept):

What it is: Zoom-in Accuracy Reward gives full credit only if, after zooming into the predicted time span, the model’s final answer is correct.
How it works: Use the coarse-predicted span to select frames; increase per-frame resolution (more tokens per frame); answer again; reward = 1 if correct.
Why it matters: This ties correctness to the evidence inside the predicted span, reducing hallucinations. 🍞 Bottom Bread (Anchor): If the model says “B: 29%” but, after zooming in, can’t find 29% in those frames, it doesn’t get the reward.

🍞 Top Bread (Hook): On a team, the goal-scorer and the playmaker both deserve credit, but not for the same thing. 🥬 Filling (The Actual Concept):

What it is: Token-Selective Credit Assignment (TokenAdv) gives different parts of the output different credit based on their role—time-localizing tokens versus answer tokens.
How it works: Compute separate advantages for each reward type (format, IoU, answer, zoom). Assign grounding-related advantages to tokens inside the <glue> span and answer-related advantages to tokens inside <answer>.
Why it matters: It fixes GRPO’s “everyone gets the same score” problem, so the right tokens learn from the right signals. 🍞 Bottom Bread (Anchor): The numbers in <glue>[301.6s, 325.8s]</glue> get credit for good IoU; the letter in <answer>D</answer> gets credit for correct answering.

Before vs After:

Before: Models often guessed spans, ignored tiny clues, and rewarded every token equally.
After: Models find spans more precisely, verify details by zooming in, and reward only the tokens responsible for each skill.

Why it works (intuition):

Splitting the job reduces confusion: first “where,” then “what.”
Verifying inside the chosen span discourages making up answers.
Paying tokens by role makes learning targeted and efficient.

Building blocks:

A base LVLM (e.g., Qwen2.5-VL) with dynamic token budgets.
GRPO for group-wise sampling and relative comparison.
Four rewards: format, IoU (temporal grounding), answer accuracy (coarse), and zoom-in accuracy (fine verification).
TokenAdv to route each reward to the right tokens.
Optional test-time scaling: simple one-stage (fast), or two-stage coarse-to-fine / divide-and-conquer (stronger on very long videos).

03Methodology

At a high level: Video + Question → Coarse Pass (sample spans + preliminary answers) → Pick spotlight span(s) → Fine Pass (zoomed frames with higher per-frame detail) → Final answer and rewards → Update policy with token-selective credit.

Step-by-step like a recipe:

Input preparation

What happens: The model receives a video (sampled at low fps to cover long time) and a question. A fixed total token budget forces a tradeoff between seeing many frames (global view) and seeing each frame in detail (local view).
Why it exists: Long videos can’t fit fully at high resolution into the model context.
Example: A 40-minute documentary is sampled to 256 frames globally in the coarse pass.

Coarse-grained pass (find likely moments)

What happens: The policy samples G responses. Each response includes: (a) a predicted time span in <glue>[(s,e), ...]</glue>, and (b) a preliminary multiple-choice answer in <answer> </answer>, with the required format inside <think>, <time>, etc.
Why it exists: The coarse pass preserves global context so the model won’t miss far-apart events.
Example: For “When do we first see Fairmont Chateau Lake?” several spans are proposed, like [91.0,163.0] or [70.0,79.0], each with a tentative answer.

Score the coarse pass (multi-faceted rewards)

What happens: Each sampled response gets four verifiable rewards: a) Format reward R_format = 1 if the output matches the template (correct tags and structure). b) Temporal grounding reward R_IoU = overlap quality between predicted span and ground truth. c) Coarse answer reward R_Acc = 1 if the preliminary answer is correct using the full coarse input. d) Zoom-in accuracy reward R_Zoom (see next step) after fine verification.
Why it exists: Different skills need different feedback signals; otherwise learning gets muddled.
Example: A response with neat tags, decent IoU, and a wrong answer might get 1 (format), 0.6 (IoU), 0 (answer), and pending (zoom) until verification.

Fine-grained pass (temporal zoom-in)

What happens: The model crops the video to the predicted span(s) and reallocates the same total visual token budget onto fewer frames. This raises tokens per frame—like increasing resolution—so small text, logos, or objects become readable. The model then answers again, producing a final answer.
Why it exists: Many failures happen because tiny cues are invisible at coarse resolution. Zooming makes them visible while keeping focus on the relevant time.
Example: The tiny “29%” in a store ad was unreadable before; after zooming, it’s crisp, and the model picks option D: 29%.

Zoom-in accuracy reward (fine verification)

What happens: If the final answer after zooming is correct, R_Zoom = 1; otherwise 0.
Why it exists: It ensures the predicted timespan truly contains the needed evidence. No more lucky guesses.
Example: If a span yields the correct final answer only after zooming, that response wins among rollouts.

Token-selective credit assignment (TokenAdv)

What happens: Instead of summing rewards into one number, compute advantages separately per reward type across the G samples. Then assign the right advantages to the right tokens:
- Tokens inside <glue> get format + IoU + zoom advantages (localization-focused).
- Tokens inside <answer> get format + zoom + answer advantages (answer-focused).
Why it exists: GRPO’s uniform credit treats all tokens equally. TokenAdv pays the right tokens for the right job, speeding and stabilizing learning.
Example: If the answer is wrong but the span is good, the <glue> tokens still get positive credit (IoU), while the <answer> token does not.

Policy update (GRPO with decoupled advantages)

What happens: With advantages assigned, the policy is nudged to increase probability of better tokens and decrease worse ones, with a KL regularizer to stay near a reference model (avoiding wild jumps).
Why it exists: Stable learning that focuses on relative improvements.
Example: Among eight samples, the ones that had the best combination of grounded spans and correct final answers after zooming shape the next iteration.

Concrete data walk-through:

Input: “What color is the jacket of the man opening wine? (A) green (B) black (C) red (D) blue).”
Coarse pass samples spans: [159.1,170.3], [301.6,325.8], [343.5,359.8], with draft answers.
Fine pass zooms into each candidate span; only one span contains clear frames of the jacket. After zooming, the final answer “C” (red) is correct, earning R_Zoom = 1. That sampled response becomes the strongest teacher for the update.

The secret sauce:

Split the task into “where then what,” so the model doesn’t confuse the two.
Verify after zooming to tie correctness to evidence.
Route credit to the exact tokens responsible for each skill (TokenAdv), fixing GRPO’s uniform-credit flaw.
Keep inference flexible: one-stage (fast), or two-stage coarse-to-fine / divide-and-conquer (stronger) depending on time budget.

04Experiments & Results

The test: Does Zoom-Zero better find the right time and give the right answer across short and long videos? Researchers measured:

Temporal grounding quality: mIoU (average overlap), plus R@0.3 and R@0.5 (percent of predictions with overlap above 0.3 or 0.5).
Answer accuracy: Multiple-choice correctness.
Coverage measures on long videos: IoG/IoP variants to check if key clues are truly inside the predicted span.

The competition: Strong LVLMs and RL baselines, including Qwen2.5-VL, TimeChat, VTimeLLM, Grounded-VideoLLM, VideoChat-TPO, TVG-R1, and VideoChat-R1. All models were of similar size (7B/8B) for fair comparison.

The scoreboard with context:

NExT-GQA (short-form GVQA): Zoom-Zero achieves state-of-the-art across all grounding metrics, improving mIoU by 5.2% over the runner-up, and increasing R@0.3 and R@0.5 by 5.4% and 6.1%. Think: like moving from a B to an A in “find the moment” class while also keeping or improving the “answer it right” grade.
ReXTime (temporal reasoning): Zoom-Zero brings an average 4.6% lift across all grounding metrics and boosts answer accuracy too. That’s like outperforming the second-best team by scoring several more points in a timed relay.
CG-Bench (very long videos with tiny clue portions): Zoom-Zero lifts mIoG (coverage of the ground-truth clue) by 7.7% over the next best model, proving zoom-in verification helps capture the exact frames with key evidence.
Long-video understanding (VideoMME, MLVU, LVBench, CG-Bench): The coarse-to-fine approach improves average accuracy by 6.4%, showing you can keep the global story while zooming to important details.

Surprising findings:

Targeted rewards don’t just help grounding; they also improve answering. Giving credit to the right tokens reduces the old trade-off where better grounding sometimes hurt overall accuracy.
Even in short videos (where context fits better), a single zoom-in can still add a small but consistent accuracy bump (~0.7–0.8%).
Inference can stay one-stage (near baseline speed) and still be better thanks to the training method; optional two-stage zoom adds more gains at modest extra time.

Ablations (what made the difference):

TokenAdv alone improves grounding over plain GRPO, proving selective credits matter.
Adding the zoom-in reward further boosts both answer quality and grounding.
Combined, they deliver the best results across metrics on both NExT-GQA and ReXTime.

Takeaway: Tying correctness to the zoomed evidence and paying the right tokens for localization vs. answering is a big, general boost across datasets and video lengths.

05Discussion & Limitations

Limitations:

Single zoom stage: The method performs one round of coarse-to-fine zoom. Multiple iterative zooms could refine spans even more but cost more compute.
Enforced zooming: The system doesn’t yet decide adaptively when to zoom or how many times. A learned zoom policy could be smarter and faster.
Token budget constraints: Even with zoom, extremely tiny details in noisy or low-light frames may remain unreadable.
Annotation dependence: The strongest grounding reward (IoU) needs labeled spans. Although self-verification ideas exist, they’re not the main path here.

Required resources:

A 7B/8B LVLM backbone (e.g., Qwen2.5-VL), GPUs with enough memory (training used 8×A100 80GB), and datasets with QA and (ideally) temporal spans.
Time for RL training with multiple sampled rollouts (G≈8) per prompt.

When NOT to use it:

If your videos are ultra-short and perfectly clear, standard models might already be enough.
If you need instant answers with strict latency and cannot afford even optional two-stage inference.
If your questions don’t depend on precise timing (e.g., purely general descriptions).

Open questions:

Can the model learn an adaptive, multi-step zoom policy that balances speed and accuracy on the fly?
How far can self-supervised or self-verifying rewards go without ground-truth spans?
Can we combine zoom-in with retrieval or memory modules for hours-long videos?
How should we best schedule token budgets across time and space under real-time constraints?

06Conclusion & Future Work

Three-sentence summary: Zoom-Zero teaches video AIs to first find the likely moment for a question and then zoom in to verify the tiny details that prove the answer. It introduces a zoom-in accuracy reward that ties correctness to the predicted time span and a token-selective credit assignment so the exact parts of the output get the right learning signal. This cuts hallucinations, sharpens temporal grounding, and raises accuracy on both short and long videos.

Main achievement: Turning “find then verify” into a reinforced, token-aware training recipe that reliably links answers to actual evidence in the video.

Future directions: Learn when and how often to zoom (adaptive multi-stage zooming), scale to even longer videos via memory/retrieval, and explore self-verification to reduce reliance on span annotations. Also, optimize token budgeting policies that switch fluidly between global coverage and local detail.

Why remember this: Zoom-Zero shows that a simple, human-like workflow—scan first, zoom later, and reward the right steps—can make video question answering more trustworthy, accurate, and efficient across many kinds of videos.

Practical Applications

•Sports analysis: Automatically locate scoring moments and verify jersey numbers or scoreboard details.
•Education: Find and confirm the exact clip where a teacher explains a formula or shows a demo.
•Security review: Highlight the precise time of an event and verify the object or person involved.
•Customer research: Pinpoint the moment a product feature is shown and read on-screen text accurately.
•News summarization: Extract and verify key moments from long press briefings or debates.
•Healthcare training: Identify the step in a procedure video and confirm instrument labels or readouts.
•Compliance audits: Find when a safety step occurred and verify warning signs are visible.
•Media production: Quickly retrieve scenes and confirm fine details (e.g., prop labels, signs) for editing.
•E-commerce video QA: Find the moment a size or price is displayed and verify the text exactly.
•Esports/VOD platforms: Jump to the exact highlight and confirm the final call via zoomed replay.

Version: 1