SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Jitesh Jain; Jialuo Li; Zixian Ma; Jieyu Zhang; Chris Dongjoo Kim; Sangho Lee; Rohun Tripathi; Tanmay Gupta; Christopher Clark; Humphrey Shi

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Intermediate

Jitesh Jain, Jialuo Li, Zixian Ma et al.12/15/2025

arXiv PDF

Key Summary

•SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.
•It uses tools (like web search, speech-to-text, event finding, and frame extraction) to gather clues before answering, instead of staring at every frame.
•A special “orchestrator” model (SAGE-MM) chooses between single-turn and multi-turn reasoning so the system works well on short and long videos (any-horizon).
•The team created a fast, low-cost way to make training questions for long videos using Gemini-2.5-Flash, saving about 100× in cost and 10× in time vs. humans.
•They trained SAGE-MM with reinforcement learning and an LLM-as-a-judge reward so it learns to use tools wisely and handle open-ended questions.
•They built SAGE-Bench, a new test set of long, real YouTube videos with many open-ended questions that feel like real life, not just multiple-choice quizzes.
•SAGE improves accuracy by up to 6.1% on open-ended tasks and by up to 14.6% on 10–20 minute videos compared to strong baselines.
•SAGE runs faster than many previous agent systems because it searches in segments and only calls tools when needed.
•Ablations show speech transcription and visual frame extraction are especially important, while current temporal grounding tools remain a bottleneck.
•This work suggests agent-style, tool-using video reasoning with RL is a better fit for long, real-world videos than single-turn models.

Why This Research Matters

People watch long, real videos and ask open-ended questions, not just multiple-choice ones. SAGE shows how an AI can act like a smart viewer—skim, search, and zoom in—to answer faster and better. It reduces wasted compute by focusing only where needed, which saves time and money. It also uses fair, semantic grading so natural-language answers can be judged well, not just by exact wording. This makes AI video assistants more practical for entertainment, education, and news. In short, SAGE is a concrete step toward everyday, useful, video-savvy AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re watching a two-hour sports video but only care about one goal. You’d skim, search, jump around, and only watch closely at the right time. You wouldn’t watch every second in one go.

🥬 The Concept (Any-Horizon Video Reasoning, the world before): Before this paper, most video AI tried to answer in a single shot: the model sees a big pile of frames and blurts out an answer in one turn. That’s like watching the whole movie to answer one small question. It works for short clips, but it’s slow, costly, and often misses details in long videos. How it worked:

Sample many frames from a long video.
Feed them into a big model.
Predict the final answer in one step. Why it mattered but broke down: It needed lots of compute, struggled with really long videos, and didn’t adapt—simple questions wasted time, and hard questions lacked investigation.

🍞 Anchor: Think of asking, “What color is the new Ferrari livery?” A one-turn model tries to digest the entire event video, even though the answer appears in a small segment.

🍞 Hook: You know how people solve tricky puzzles by asking smaller questions (“Who won last year?”), rewinding, and zooming in on the right part of the video?

🥬 The Concept (The Problem): Researchers needed agents that flexibly choose between quick answers (for easy, short tasks) and multi-step searches (for long, tricky tasks). But three blockers stood in the way:

Data: Getting good question–answer (QnA) pairs for long videos is expensive and slow.
System design: Most agents over-relied on “temporal grounding” (finding the exact moment) over the whole video, which is hard and inaccurate when videos are long.
Training: Reinforcement learning (RL) rewards for open-ended answers are tricky because you can’t just do string matching like multiple-choice. Why it matters: Without solving these, agents can’t behave like humans—sometimes skimming, sometimes diving deep.

🍞 Anchor: It’s like training a librarian to find a sentence in a series of books. You need good practice questions, a plan to search sensibly (not read every page), and a fair way to grade their answers.

🍞 Hook: Imagine trying different hacks that still don’t feel right—like cutting a 1-hour video into 120 tiny clips and processing them one by one.

🥬 The Concept (Failed Attempts):

Bottom-up synthetic data: Generating QnA by chopping videos into many short subclips and stitching results. It’s cheaper than humans but slow and resource-heavy.
Over-engineered agents: Heavy temporal grounders scanning entire videos, or systems optimized for multiple-choice but weak on open-ended questions.
RL for single-turn models: Using string overlap rewards helps MCQ, but fails to grade free-form answers. Why it breaks: These methods are slow, not robust on long videos, and don’t handle real-world, open-ended questions.

🍞 Anchor: It’s like searching a 600-page book by scanning every paragraph with a magnifying glass and then grading only if you picked the exact same words.

🍞 Hook: What would a human do instead? Use outside knowledge, scan just the likely parts, and check more carefully only when needed.

🥬 The Concept (The Gap this paper fills):

A tool-using agent that decides between single-turn and multi-turn reasoning (any-horizon), guided by an orchestrator model.
A fast, long-context synthetic QnA pipeline using Gemini-2.5-Flash that covers the whole video and is cheap to produce.
An RL recipe with step-wise and final rewards using an LLM-as-a-judge so open-ended answers are fairly graded.
A new benchmark, SAGE-Bench, with long YouTube videos and many open-ended questions. Why it matters: Now the agent can skim or dive, use tools like web search and speech transcription, and get trained on realistic data with fair rewards.

🍞 Anchor: Like a smart detective with internet access, a notepad, timestamps, and a fair referee to check if the explanation and answer make sense.

🍞 Hook: Why should you care?

🥬 The Concept (Real Stakes):

Faster answers for short clips and better accuracy on long videos.
Lower compute bills (don’t process the whole video every time).
Realistic Q&A like what people ask while watching: “Who else appeared besides Will Smith?”
More useful assistants for entertainment, education, and news. Why it matters: This is a stepping stone from lab demos toward everyday video helpers.

🍞 Anchor: Imagine asking your TV, “Show me when the chef adds garlic, and what brand of mayo is that?” The system jumps to the right moment, reads the label, and answers in seconds.

02Core Idea

🍞 Hook: You know how you choose between skimming a short TikTok or carefully scrubbing through a long movie to find the best scene?

🥬 The Concept (Aha! in one sentence): Teach a tool-using video agent to pick the right horizon—answer now if it’s easy, or take multiple steps with tools if it’s hard—and train this habit with reinforcement learning and fair judging. How it works (big picture):

Use an orchestrator (SAGE-MM) that first understands the context and intent.
If needed, it calls tools (web search, speech transcription, event grounding, frame extraction, analysis) in multiple steps.
A special RL recipe rewards good tool use, valid JSON actions, and correct final answers (judged semantically by an LLM), so it learns any-horizon habits. Why it matters: Without this, systems either overthink easy questions or underthink hard ones.

🍞 Anchor: Like a coach who trains you when to sprint and when to pace yourself in a marathon.

Three analogies:

Librarian: Skim the table of contents (Stage-1), then flip to likely chapters and read a few pages (Stage-2 with tools) before answering.
Treasure hunt: Check the map (context), then pick the right tools—compass (web), shovel (frames), and notes (transcripts)—to find the treasure (answer).
Road trip: Short trip? Go straight there. Long journey? Plan fuel stops (timestamps), check directions (search), and listen to traffic updates (speech transcription).

Before vs After:

Before: One-shot answers on piles of frames; slow, rigid, poor on open-ended questions.
After: Tool-using agent that adapts per question and per video length; better accuracy on long videos and open-ended questions with less wasted compute.

Why it works (intuition):

Decomposition: Hard video questions become a sequence of smaller tool calls (find the part, extract frames, transcribe speech, then reason).
Fair rewards: An LLM judge checks meaning, not just exact words, so open-ended answers can be graded.
Any-horizon habit: RL balances single-turn vs multi-turn behavior so the agent doesn’t overcall tools or rush to answer.

Building Blocks (with mini sandwiches):

🍞 Hook: Imagine a conductor deciding which instruments play when. 🥬 The Concept: Orchestrator (SAGE-MM) is the model that decides whether to answer now or call a tool next. How it works: Stage-1 forms video context and intent; Stage-2 iteratively decides on tools and final answer in JSON. Why it matters: Without a smart orchestrator, the system either freezes or flails. 🍞 Anchor: Like a director who says, “First, search. Next, jump to 05:35. Now, extract frames. Finally, answer.”

🍞 Hook: You don’t read an entire 2-hour book to find one sentence—you search the right chapter. 🥬 The Concept: Segment-level temporal grounding finds events within a shorter window (≤10 minutes) instead of the whole video. How it works: Predict coarse bounds, then pinpoint in that slice. Why it matters: Full-video grounding is error-prone and slow; segments are faster and more accurate. 🍞 Anchor: Jump to 05:34–05:38 for “race start,” then analyze frames there.

🍞 Hook: Sometimes you need to read the captions or Google a name you see on screen. 🥬 The Concept: Tool-using reasoning combines web search, speech transcription, frame extraction, and analysis. How it works: The orchestrator picks the tool sequence; each tool returns evidence; the agent updates its plan. Why it matters: Vision alone can miss speech clues or external facts. 🍞 Anchor: “Any celebs besides Will Smith?” → transcribe speech, search names, confirm visually.

🍞 Hook: Grading essays by exact word match is unfair; you grade meaning. 🥬 The Concept: LLM-as-a-judge rewards check if the final answer is semantically correct. How it works: A strong LLM gives True/False on meaning plus step-wise format/tool-use rewards. Why it matters: Open-ended answers need semantic grading. 🍞 Anchor: “Purple superconducting ring at Mars L1” matches “magnetic umbrella ring that deflects solar wind.”

🍞 Hook: Practice tests should cover the whole book. 🥬 The Concept: Long-context synthetic QnA covers full videos quickly and cheaply using Gemini-2.5-Flash. How it works: Generate 10–20 QnA per video with a percent_video_parsed field to enforce coverage. Why it matters: High-quality, affordable data for both SFT and RL. 🍞 Anchor: For a 40-minute science video, questions span from minute 3 visuals to minute 35 narration.

🍞 Hook: Real driving tests use real roads, not only parking lots. 🥬 The Concept: SAGE-Bench evaluates on long, real YouTube videos with many open-ended questions. How it works: Manually verified QnA, durations often >10 minutes, judged by LLM for fairness. Why it matters: Tests what users actually ask while watching. 🍞 Anchor: “Describe the black hole disk color shift before it breaks open” beats “Pick A/B/C/D.”

03Methodology

At a high level: Inputs (sampled frames, metadata, tools, question) → Stage-1 (context and initial decision) → Stage-2 (iterative tool-using steps) → Output (final answer or next tool call).

Step-by-step (with sandwiches for key components):

Inputs:

128 evenly sampled frames from the video.
Metadata: path and duration.
Tool definitions: web-search, parse-website, transcribe-speech, ground-event, extract-video-parts, analyze.
User question.

🍞 Hook: Like opening a book to the title page and summary before reading. 🥬 The Concept: Stage-1 (Context VLM role) builds video_context, clarifies query_intent, and decides whether to answer directly or call a tool. How it works:

Read inputs: frames, metadata, tools, question.
Produce a strict JSON with video_context, query_intent, recommended_tools or final_answer.
If simple and obvious, answer now; else schedule the first tool call. Why it matters: Without a clean first decision, later steps waste time. 🍞 Anchor: “This is an F1 recap; user asks about 2002 GP first-turn smoke; recommend web-search to locate event, then ground-event in first 10 minutes.”

🍞 Hook: Think of a detective taking notes after each clue. 🥬 The Concept: Stage-2 (Iterative Reasoner) repeats: check if answerable; if not, call the next tool with arguments. How it works:

Maintain history of tool results plus context.
Each step outputs JSON with answerable verdict, recommended_tools, or final_answer.
Stop after max 10 steps or when answerable. Why it matters: Structured steps keep the search focused and efficient. 🍞 Anchor: Step 1: ground-event → Step 2: extract-video-parts (05:34–05:38) → Step 3: analyze frames → Step 4: finalize answer.

Supported tools (how and why they’re used):

web-search: find external info quickly (e.g., event dates, names). Without it, the agent may miss context not visible in frames.
parse-website: read a result page deeply. Without it, snippets might be too shallow.
transcribe-speech: get quotes and names from audio. Without it, verbal details are lost.
ground-event: find timestamps for an event in a segment (≤10 minutes). Without it, the agent wastes time scanning the whole video.
extract-video-parts: pull frames or subclips from a range to inspect closely. Without it, the agent can’t zoom into the right spot.
analyze: visual Q&A over selected media. Without it, extracted frames aren’t turned into answers.

Example with actual data (F1 smoke at first turn):

Stage-1: “User asks about 2002 Australian GP start; recommend web-search for confirmation; then ground-event in 00:00–10:00.”
Stage-2: ground-event → returns 05:34–05:38; extract-video-parts for those seconds; analyze frames → “blue/teal car visibly smoking”; final_answer.

🍞 Hook: A student doesn’t learn good habits by accident; they need feedback. 🥬 The Concept: RL Post-Training with multi-reward (GRPO). How it works:

Cold-start SFT: Fine-tune SAGE-MM on synthetic tool-call trajectories so it learns basic JSON formatting and tool use.
RL with GRPO: Generate multiple full trajectories; compute one reward per trajectory; assign it to all steps.
Step-level rewards: format (+0.05/−0.10), reasonable-tool (+0.10/−0.10), args-repeat (−0.05 × sqrt(repetitions)), args-valid (−0.10 if invalid).
Final accuracy reward: Use an LLM judge (GPT-4o) to mark answer semantically correct/incorrect, with small bonuses if visual tools were used correctly (+1.25) and penalties for wrong answers with tool calls (−0.5) or invalid JSON (−2.0).
Training stability: Limit max steps early (Nmax=6 for first 100 steps), then allow up to 11. Why it matters: Without multi-part rewards and semantic judging, the agent either spams tools or guesses too soon. 🍞 Anchor: The agent gets tiny rewards for tidy JSON and smart tool choices, but the big prize only if the final answer is semantically right.

🍞 Hook: Practice questions should cover the whole video, not just the beginning. 🥬 The Concept: Synthetic QnA generation with long-context model (Gemini-2.5-Flash). How it works:

For each video, generate 10–20 QnA that include visual-only, verbal-only, and both.
Include a percent_video_parsed field in each QnA to push coverage across the timeline.
Manually spot-check a subset; found only ~5% errors. Why it matters: Cheap, fast, and covers full videos—critical for SFT and RL. 🍞 Anchor: A 40-minute video might get questions at 03:10 (visual), 12:45 (verbal), 28:05 (both), and 35:20 (visual), ensuring wide coverage.

“Secret Sauce” summary:

Any-horizon control: the orchestrator learns when to stop early and when to continue.
Segment-first grounding: search ≤10-minute windows to avoid full-video scans.
Multi-reward RL + LLM judge: teaches both good process and correct meaning.
Realistic benchmark: test on long, real YouTube content with many open-ended questions.

04Experiments & Results

🍞 Hook: If a class takes both multiple-choice and open-ended tests, it’s not enough to ace only the bubbles—you must explain your thinking too.

🥬 The Concept (The Test): The authors evaluate SAGE on SAGE-Bench, a new set of 1,744 manually verified QnA spanning long YouTube videos (average ~12 minutes). Many questions are open-ended, not just multiple-choice. They also test on MINERVA, a separate long-video benchmark. How it works:

Inputs: 128 frames for fairness.
Evaluation: LLM-as-judge (GPT-4o) checks semantic correctness for both MCQ and open-ended.
Durations: Results are broken down by video length (0–60s up to 2400+s) to see long-horizon benefits. Why it matters: Real users ask open-ended questions on long videos; the test should match that.

🍞 Anchor: Instead of only “Pick A/B/C,” questions include “Describe the color changes before the black hole ‘breaks open’,” which needs real observation.

Competition (Baselines):

DIRECT single-turn models (Gemini-2.5, Qwen3-VL, Video-R1, VideoRFT, Video-Thinker).
AGENT systems (VideoMind, VideoExplorer, LVAgent, VideoChat-R1.5, VideoAgent).
Variants of SAGE using different orchestrators (open-source Qwen/Molmo, API models Gemini-2.5-Flash, GPT-4o) and tool backends.

Scoreboard with context:

Overall: SAGE with Qwen3-VL-8B + SFT + RL reaches 68.0% (like boosting from a B- to a solid B+/A-), and SAGE-Flash (using Gemini tools) hits 71.8% overall—beating their corresponding base models.
Open-ended: SAGE improves by up to 6.1% over baselines on open-ended questions.
Long videos: Gains grow with length—up to +8.2% for 10–20 minute videos, and up to +14.6% when using Gemini tools (SAGE-Flash) in that range.
MINERVA: SAGE outperforms other reasoning models and improves on longer videos (>600s) by ~2.6% against comparable baselines. Interpretation: The agent’s ability to choose any-horizon paths and use tools selectively pays off most on long, realistic content.

Surprising findings:

Many existing AGENT systems did well on MCQ but lagged on open-ended questions, suggesting they were over-optimized for narrow formats.
Temporal grounding tools remain a weak link; removing ground-event only slightly hurts, but removing extract-video-parts or transcribe-speech hurts a lot more—evidence that selective seeing and hearing matter most today.
SFT alone makes the model overcall tools (too cautious), while RL balances this, bringing the ratio of single-turn to multi-turn closer to a strong expert (Gemini-2.5-Flash).

Efficiency and runtime:

SAGE’s runtime per sample is comparable to single-turn models processing ~512 frames but with much higher accuracy, and far faster than many AGENT baselines that pre-process every subclip.
Reason: SAGE searches by segments and calls tools only when needed, avoiding heavy, repeated passes over entire videos.

🍞 Anchor: Like a student who knows when to scan chapters versus reading every page, SAGE finishes the test faster and still writes better essays on the long questions.

Ablations (what matters most):

Drop transcribe-speech: big drop on verbal/both questions.
Drop extract-video-parts or analyze: big drop on visual questions.
Drop ground-event: minor drop today (tool accuracy bottleneck), pointing to an area for future improvement.
Eval mode: Using SAGE as an AGENT at inference outperforms forcing it into DIRECT single-turn mode.
Nmax: Limiting to 11 steps is enough; RL already nudges the agent to answer within this horizon.

Takeaway: The combination—smart orchestrator, segment-first search, any-horizon RL training, and semantic judging—delivers steady, significant gains, especially where it counts: long, open-ended, real-world videos.

05Discussion & Limitations

Limitations:

Temporal grounding quality: The current ground-event tool is a bottleneck; SAGE helps by searching in segments, but better grounders would likely boost performance further.
JSON robustness: Occasionally the orchestrator outputs malformed JSON; the system retries with temperature, which can add slight nondeterminism.
Reward dependence on LLM-as-judge: Semantic judging is powerful but relies on a strong, external LLM; mismatches or biases in the judge could affect training and evaluation.
Domain focus: Training and evaluation lean toward entertainment YouTube videos; broader domains (medical, surveillance, education at scale) need more data.
Resource needs: Training used multiple H100 GPUs; although inference is efficient for an agent, production-scale deployment still needs careful engineering.

Required resources to use SAGE:

A capable multimodal LLM with function calling as SAGE-MM (e.g., Qwen3-VL variants).
Access to tools (web search API, ASR like Whisper, a visual analyzer, and a temporal grounder) and storage for extracted frames/clips.
Optional: An LLM judge (e.g., GPT-4o) for semantic evaluation or RL training.

When not to use:

Strictly deterministic, audited pipelines where retries or external APIs aren’t permitted.
Closed-network environments without web access or ASR where crucial tools can’t be used.
Tasks requiring precise, legally verifiable quotes where LLM judging may be insufficient—you’d need verifiable rewards.

Open questions:

Can we create stronger, long-video temporal grounding that works reliably at full-video scale?
Can we reduce reliance on closed-source judges by training robust open-judge models or building verifiable rewards for open-ended tasks?
How far can agent-centric RL algorithms push any-horizon efficiency and accuracy?
Can the system learn to choose or even build new tools on the fly when needed?
How well does SAGE generalize to domains beyond entertainment, like science lectures or sports analytics at broadcast scale?

06Conclusion & Future Work

Three-sentence summary: SAGE is a tool-using, any-horizon video agent that decides when to answer immediately and when to take multiple steps, trained with a multi-reward RL recipe and an LLM-as-judge. A fast, long-context synthetic QnA pipeline trains the orchestrator and keeps costs low while ensuring full-video coverage. On SAGE-Bench and MINERVA, SAGE outperforms strong baselines, especially on long and open-ended questions, with better efficiency than prior agents.

Main achievement: Showing that an orchestrated, tool-using agent trained with semantic rewards is a better fit for long, real-world video reasoning than single-turn models—and that this can be trained practically with synthetic data and RL.

Future directions:

Stronger temporal grounding and broader domains.
More advanced agent-centric RL algorithms.
Automatic tool selection and tool invention.
Open, robust LLM-judge alternatives or verifiable reward schemes for open-ended tasks.

Why remember this: It’s a blueprint for how video AIs can think more like people—skim when it’s easy, investigate when it’s hard, use the right tools at the right time—and it shows a practical path (data + system + RL) to make that behavior real.

Practical Applications

•Second-screen sports assistant: Ask about key moments (goals, pit stops) and get instant, timestamped answers.
•Cooking helper: “When does the chef add garlic?” SAGE jumps to the right segment and shows the frame.
•News verification: Transcribe precise quotes and cross-check names via web search for context.
•Lecture summarizer: Find and explain key visuals and spoken definitions across a long class video.
•Entertainment Q&A: Identify cameos, outfits, or props by extracting frames and reading labels.
•Event recap finder: Locate intros, finales, or highlights in long livestreams without scanning everything.
•Customer support: Triage long tutorial videos by finding the relevant troubleshooting segment fast.
•Content tagging: Auto-tag long videos with timestamps for chapters based on detected events and speech.
•Education tools: Let students ask open questions about science videos and get explanations grounded in frames and transcripts.
•Creator analytics: Quickly surface where audience-interesting moments (e.g., jokes, reveals) occur to improve editing.

Version: 1