LongVideoAgent: Multi-Agent Reasoning with Long Videos

Runtao Liu; Ziyi Liu; Jiaqi Tang; Yue Ma; Renjie Pi; Jipeng Zhang; Qifeng Chen

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Intermediate

Runtao Liu, Ziyi Liu, Jiaqi Tang et al.12/23/2025

arXiv PDF

Key Summary

•LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
•A master planner LLM decides what to do next, a grounding agent finds the right moment in the video, and a vision agent reads fine details from those frames.
•Instead of squeezing the whole video into a short summary, the system looks only where it needs to, when it needs to.
•The master agent is trained with reinforcement learning to take neat, well‑formed steps and stop using tools when it already has enough evidence.
•On the new LongTVQA and LongTVQA+ datasets (made from full episodes), the method beats strong single‑model baselines by clear margins.
•Adding grounding and then targeted vision yields big gains: in a controlled study, accuracy rose from 64.3% (no agents) to 74.8% (grounding + vision).
•Reinforcement learning further boosts smaller open models; for example, Qwen2.5‑7B improves from 46.1% to 60.2% on one benchmark.
•The system’s step‑by‑step traces are easy to read, so you can see exactly which clip was checked and what visual facts were used.
•A small step budget (about five actions) and a narrow window (one clip) already work well, with larger windows helping a bit more.
•This approach shows that smart planning plus the right tools can handle long, messy videos better than one big model trying to read everything at once.

Why This Research Matters

Long videos are everywhere—classes, meetings, TV, sports, and security—and most of us don’t have time to rewatch everything to answer one question. LongVideoAgent shows a practical way to jump straight to the moment that matters and read just the needed details, saving time and improving accuracy. Its step-by-step traces make decisions transparent, so users can trust how answers were found. Smaller open models benefit a lot from this method, making strong long-video reasoning more accessible. With future additions like audio and knowledge, this approach could power reliable assistants for studying, media analysis, compliance checks, and more.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have to answer a question about a whole TV episode that’s an hour long. Would you watch every second again, or skip straight to the parts that matter?

🥬 The Concept (Long‑Tail Video QA): It’s answering questions about very long videos where the clues are scattered across many scenes and times. How it works:

You search the entire episode for the right moment.
You check both what people say (subtitles) and what you see (frames).
You put the clues together to answer. Why it matters: Without careful searching, you’ll miss tiny cues (like a small object or a quick glance) and answer wrong. 🍞 Anchor: A question like “Where is Sheldon sitting when he’s with a man?” might need one specific nighttime scene on a bench; scanning the whole episode blindly is slow and error‑prone.

🍞 Hook: You know how teachers pause a documentary at the exact moment something important happens?

🥬 The Concept (Temporal Localization): It’s finding when in the video the needed moment happens. How it works:

Read the question and subtitles.
Predict the most relevant time span.
Jump there and check details. Why it matters: If you look at the wrong minute, even a perfect vision system won’t help. 🍞 Anchor: To answer “What does the sign say behind the cashier?”, you must skip to the checkout scene before trying to read any sign.

🍞 Hook: When you solve a puzzle, you use eyes (seeing), ears (hearing), and brain (thinking) together.

🥬 The Concept (Multimodal Reasoning): It’s combining text (subtitles), images (frames), and sometimes audio to make sense of a story. How it works:

Gather text and visual hints.
Decide which hint matters most for the question.
Use both to reason step by step. Why it matters: Subtitles miss visual facts; visuals miss spoken facts. You need both for tough questions. 🍞 Anchor: If someone says “over there” (subtitle), you need the frame to see where “there” is.

🍞 Hook: Think of a soccer team: the striker, the goalie, and the coach each have a job.

🥬 The Concept (Multi‑Agent Framework): It’s a team of specialized AI helpers coordinated by a master planner. How it works:

The master agent reads the question and plans a step.
It asks a grounding agent to find the right clip.
It asks a vision agent to read fine visual details.
It repeats until there’s enough evidence, then answers. Why it matters: One model doing everything often compresses or overlooks details in hour‑long videos. 🍞 Anchor: Instead of one student cramming a whole book, a team splits tasks—one finds the page, another reads the chart, a third writes the answer.

🍞 Hook: When you bake, you follow a recipe step by step, not all at once.

🥬 The Concept (Step‑Wise Reasoning): It breaks a big task into small, ordered actions. How it works:

Plan the next best move.
Execute it (ground or read visuals).
Check what you learned.
Stop when you have enough to answer. Why it matters: Without steps, the model wastes time and misses the goal. 🍞 Anchor: First find the right clip, then check objects in that clip, then answer.

Before this work, many systems tried to stuff an entire video into a short, lossy summary for a single pass. That often erased fine‑grained evidence (like a tiny badge on a shirt) and stretched the model’s memory. Other agent systems started to plan but used limited tools (mostly captions or basic retrieval), which couldn’t capture subtle objects, faces, or timing. The missing piece was a planner that could coordinate stronger, specialized tools, take only a few smart steps, and know when to stop. This paper fills that gap with a multi‑agent framework plus reinforcement learning to shape neat, efficient planning.

🍞 Hook: You know how stickers or points can train you to do homework neatly and on time?

🥬 The Concept (Reinforcement Learning): It teaches the master agent good habits by rewarding correct, well‑formed steps and right answers. How it works:

The agent tries a sequence of actions (ground, read, answer).
It gets small rewards for clean step format and a final reward for a correct answer.
Over time, it learns to plan better and avoid wasteful tool calls. Why it matters: Without feedback, the agent may ramble, over‑query, or answer too soon. 🍞 Anchor: The system learns that “Ground → Visual read → Answer” often beats “Visual read over and over → Guess.”

Real stakes: Long videos are everywhere—classes, meetings, sports, TV, and security footage. A system that can jump to the right spot and pull precise facts saves you from endless scrolling and guessing, making video knowledge searchable, checkable, and useful in daily life.

02Core Idea

The “aha!” moment: Don’t summarize everything—teach a master planner to smartly coordinate a finder (grounding) and a look‑closer (vision) and reward it for short, correct plans.

Three analogies:

Treasure hunt: The master is the captain, the grounding agent is the map that marks “X,” and the vision agent is the magnifying glass that reads tiny clues on the treasure chest.
Librarian team: One librarian pinpoints the exact shelf (grounding), another checks the paragraph for a keyword (vision), while the head librarian writes the final answer.
Sports play: The coach (master) calls a play, the scout finds the right opponent footage (grounding), and the analyst freeze‑frames key moments (vision) before the coach makes the call (answer).

Before vs. After:

Before: Single big models took a single gulp of a long video—often downsampled or compressed—losing small but crucial evidence. Agent systems existed but leaned on weaker tools and had little training on how to plan.
After: The master coordinates two stronger specialists, operates in short, clean steps, and is trained with rewards to avoid unnecessary tool calls and to answer decisively once evidence is enough. Fine details are retrieved on demand, not squashed into a summary.

Why it works (intuition):

Narrowing the search (grounding) reduces distraction and memory load.
Targeted inspection (vision) restores the precise details that summaries blur out.
Step‑wise planning with rewards tunes the master to balance speed and accuracy: explore only as needed, then exploit to answer.
Clear action tags structure the conversation, keeping the agent from rambling and forcing one crisp decision per step.

Building blocks (with simple sandwiches for the two specialists):

🍞 Hook: When you’re lost in a huge mall, the info desk points you to the exact store. 🥬 The Concept (Grounding Agent): It finds the time segment where the answer likely lives. How it works: (1) Read question + subtitles; (2) Propose the best episode clip; (3) Hand back a tag like <clip_15>. Why it matters: Without it, the master might comb the wrong minutes. 🍞 Anchor: For “When did she hand over the keys?”, grounding jumps to the clip with that exchange.

🍞 Hook: Detectives zoom into photos to spot fingerprints. 🥬 The Concept (Vision Agent): It reads fine‑grained visual facts (objects, faces, actions, on‑screen text) inside the grounded clip. How it works: (1) Receive the clip tag and a focused query; (2) Inspect selected frames; (3) Return concise textual observations. Why it matters: Subtitles won’t tell you what’s written on a sign or which hand holds a cup. 🍞 Anchor: “What side of the bed is closer to the window?” is solved by vision reading the room layout.

Finally, the master agent (with reinforcement learning) learns a tidy habit: If there’s no clip yet—ground. If the clip lacks visuals—ask vision. If evidence is enough—answer.

🍞 Hook: Like using a checklist so you don’t overdo steps while cooking. 🥬 The Concept (Step Budget): A small limit (like five steps) keeps the plan efficient. How it works: (1) Make a move; (2) Review evidence; (3) Repeat until limit or answer. Why it matters: Unlimited steps can waste time without adding accuracy. 🍞 Anchor: Most questions resolve in 2–5 actions: ground → read → answer.

03Methodology

High‑level recipe: Input (full episode subtitles + question) → Master decides next action → Grounding agent returns a clip tag → Optional vision agent reads fine details from that clip → Evidence accumulates → Master answers.

Step 1. Initialize context

What happens: The master agent receives all episode subtitles and the user’s question. Context is empty except for these.
Why this exists: It gives the master the story outline but not the visuals yet, so it must decide what to fetch.
Example: Question: “Where is Sheldon sitting when he is accompanied by a man?” Subtitles alone don’t reveal the location.

🍞 Hook: You don’t read a whole book to find one quote—you jump to the right page first. 🥬 The Concept (Grounding Pass): The master asks the grounding agent to localize the relevant segment. How it works:

Emit <request_grounding>.
Grounding returns a tag like <s05e06_seg02_clip_15> and the local subtitles.
The tag pins the timeline. Why it matters: It cuts the haystack down to a handful of needles. 🍞 Anchor: For the bench question, grounding jumps to the nighttime scene.

Step 2. Decide if visuals are needed

What happens: The master checks if subtitles in the grounded clip already answer the question. If not, it asks vision.
Why this exists: Prevents unnecessary visual reads and saves time.
Example: If the subtitle says, “Meet me at the bus stop,” no need to ask vision.

🍞 Hook: If you can’t read tiny print, you grab a magnifying glass. 🥬 The Concept (Targeted Vision Query): The master crafts a focused prompt for the vision agent (e.g., “Which side of the bed is near the window?”). How it works:

Emit <visual_query> ... </visual_query> tied to the current clip tag.
Vision inspects key frames for objects, text, or spatial relations.
It returns short, factual observations. Why it matters: Generic captions might miss the exact fact you need. 🍞 Anchor: “Window is on the left; Sheldon sits near it” → answer: left side.

Step 3. Accumulate and loop with a small step budget (K)

What happens: The master appends grounding tags and vision notes to its running context, then decides the next move—reground, re‑query vision more precisely, or answer.
Why this exists: Complex questions may need a refine‑and‑check cycle, but K (like 5) stops overthinking.
Example: First vision read is vague; the master asks a sharper second query focused on window position.

🍞 Hook: Practice makes neat habits. 🥬 The Concept (Reinforcement Learning for Planning): The master is fine‑tuned to make crisp, valid actions and get answers right. How it works:

Structural reward: +1 when an action is well‑formed (one clean tag, no extra text).
Answer reward: +1 (scaled) for the correct final choice.
Optimize with GRPO so the policy prefers sequences that are tidy and correct. Why it matters: It discourages messy outputs and pointless tool calls. 🍞 Anchor: The trained master more often chooses “ground → one precise visual read → answer,” instead of multiple unfocused reads.

Practical details that make it work:

Single action per turn: The master must choose exactly one of {visual_query, request_grounding, answer}. This keeps reasoning organized.
Text‑only master: The master never sees raw images—only subtitles, clip tags, and the vision agent’s textual notes. This simplifies integration across different LLM backbones.
Window size: By default, the grounding returns one clip; larger windows (2–3 clips) can add context across rapid cuts when needed.
Tools can be re‑invoked: The master may reground if evidence conflicts, or issue a sharper vision query if the first was too broad.

Secret sauce:

The combo of (1) precise temporal narrowing (grounding), (2) surgical visual reads (vision), and (3) a reward‑trained planner (master) aligns efficiency with accuracy.
Small step limits and tidy action tags reduce drift and make the entire trajectory interpretable: you can audit which clip was checked and why.

Concrete walk‑throughs:

Bench example: Ground to clip_15 → Vision says “Sheldon on bench at night, trash can, windows” → Master infers bus stop → Answer.
Bed‑window example: Ground to bedroom clip_09 → Vision read #1 too generic → Vision read #2: “Window on left, Sheldon near it” → Answer: left side.

04Experiments & Results

The test: Can the agent team answer multiple‑choice questions about full TV episodes (LongTVQA / LongTVQA+) better than strong single‑model baselines? Metrics are Answer Accuracy (did you pick the right option?) and Grounding Accuracy (did you find the right segment?).

Competition: They compare against powerful non‑agent multimodal models (e.g., GPT‑4o, Gemini 2.5 Pro) that process many frames directly, plus the same LLM backbones used either in non‑agent mode (subtitles only) or as the master in their agentic system. This makes gains attributable to agentic planning, not just a bigger backbone.

Scoreboard with context:

Multi‑agent beats non‑agent on the same backbone. In a controlled ablation, adding grounding boosts accuracy from 64.3% to 69.0%, and adding the vision agent lifts it further to 74.8%—like moving from a C to a solid B+/A‑.
Reinforcement learning helps especially for smaller open models. For example, an open 7B model (Qwen2.5‑7B) jumps from 46.1% to 60.2% on LongTVQA and from 60.3% to 70.8% on LongTVQA+ after RL—like raising your test grade by more than a full letter.
Closed‑source models remain strong, but the agentic approach narrows the gap. Even compact closed models (e.g., GPT‑5‑mini) see big gains when run agentically with frames compared to their non‑agent subtitle‑only variants (up to +12.20 points in one setting).

Ablations that tell a story:

Step budget K: Raising K from 2 to 5 increases grounding accuracy (67.0% → 71.0%) and answer accuracy (68.30% → 73.67%), but going to 10 brings little extra. Translation: a handful of smart moves beats many meandering ones.
Window size: Adding adjacent clips helps disambiguate quick cuts; answer accuracy rises from 70.33% (1 clip) to 75.00% (2 clips) and 77.26% (3 clips). Bigger windows add small gains but can cost more queries and latency.
Vision backbone: A stronger vision agent (e.g., GPT‑4o) improves both grounding (73.30%) and answers (78.00%) over a lighter model, showing that better fine‑detail reading pays off.

Surprising findings:

Tidy formatting rewards matter. Rewarding each well‑formed action tag reduces chatter and keeps the plan compact, which indirectly improves final accuracy.
One or two targeted vision reads are often enough. Over‑querying rarely helps—the best trajectories look like “Ground once, read once or twice, answer.”
Master never sees raw images, yet the system still benefits hugely from vision—proving that decoupling planning (text‑only) from perception (vision tool) is practical and powerful.

In short, the agent team consistently wins by looking exactly where needed and reading exactly what matters, rather than trying to remember everything at once.

05Discussion & Limitations

Limitations:

Audio blind spot: The system depends on provided subtitles and doesn’t transcribe raw audio, so whispered lines or tone cues can be missed.
Frozen specialists: Grounding and vision agents stay fixed during RL. Joint training might improve robustness—especially on tricky visual attributes or ambiguous dialogue.
Simple rewards: Only step formatting and final correctness are rewarded. Richer rewards (e.g., penalizing contradictory evidence or rewarding fewer tool hops) might further refine planning.

Required resources:

A capable master LLM (open or closed), plus a grounding tool and a strong vision model.
GPU time if you fine‑tune with RL (e.g., hours on several GPUs for 3B–7B open models).
Access to episode‑level subtitles and a way to sample frames from grounded clips.

When not to use:

Ultra‑short clips where a single pass already captures everything.
Domains with no subtitles and very weak visual signal, unless you add ASR or other modalities.
Real‑time constraints that forbid even a few grounding/vision calls.

Open questions:

How to fuse raw audio (ASR) and external knowledge (e.g., show lore) without bloating steps?
Can we co‑train grounding + vision with the master for end‑to‑end improvements while keeping interpretability?
What curated rewards best capture “good reasoning,” like consistency checks or evidence citations?
How to scale to day‑long streams (sports tournaments, security feeds) with hierarchical plans and memory?

06Conclusion & Future Work

Three‑sentence summary: LongVideoAgent turns long‑video QA into a team sport: a master planner grounds the right moments, asks a vision specialist for precise details, and stops once it has enough evidence. Reinforcement learning teaches the master to keep steps tidy and answers correct, avoiding wasteful tool calls. On full‑episode datasets (LongTVQA/LongTVQA+), this approach beats strong non‑agent baselines and produces interpretable traces.

Main achievement: Showing that coordinated, reward‑trained multi‑agent planning reliably outperforms single‑pass or weak‑tool approaches on hour‑long videos by fetching just‑in‑time evidence instead of compressing everything.

Future directions: Add audio (ASR) and world knowledge, refine grounding to be finer and faster, jointly train the specialists, and explore richer rewards that capture reasoning quality and evidence faithfulness.

Why remember this: It marks a shift from “summarize the whole video” to “plan, localize, and read what matters,” proving that smart coordination plus minimal, targeted perception can unlock accurate, efficient understanding of long, messy videos.

Practical Applications

•Lecture assistants that jump to the exact minute a concept was explained and extract the key diagram or equation.
•TV and movie search that finds the precise scene where a character reveals a clue and reads on-screen text.
•Sports analysis that locates a play and identifies player positions or scoreboard numbers on demand.
•Meeting summarization that pinpoints decisions and reads slides to capture figures or action items.
•Customer support QA that finds when a product was shown or a feature was mentioned in demo videos.
•Compliance auditing that checks exact segments for required disclosures or safety labels in long broadcasts.
•Education tools that link homework questions to the right lecture snippet and annotate visuals from that snippet.
•News verification that grounds controversial claims to the specific aired segment and extracts supporting visuals.
•Video cataloging that auto-tags episodes by grounded events (e.g., ‘key handoff’, ‘window-left bedroom’) for faster retrieval.
•Safety monitoring that finds the moment a rule was broken and reads signage or badge IDs from that clip.

Version: 1