LongVideoAgent: Multi-Agent Reasoning with Long Videos
Key Summary
- âąLongVideoAgent is a team of three AIs that work together to answer questions about hourâlong TV episodes without missing small details.
- âąA master planner LLM decides what to do next, a grounding agent finds the right moment in the video, and a vision agent reads fine details from those frames.
- âąInstead of squeezing the whole video into a short summary, the system looks only where it needs to, when it needs to.
- âąThe master agent is trained with reinforcement learning to take neat, wellâformed steps and stop using tools when it already has enough evidence.
- âąOn the new LongTVQA and LongTVQA+ datasets (made from full episodes), the method beats strong singleâmodel baselines by clear margins.
- âąAdding grounding and then targeted vision yields big gains: in a controlled study, accuracy rose from 64.3% (no agents) to 74.8% (grounding + vision).
- âąReinforcement learning further boosts smaller open models; for example, Qwen2.5â7B improves from 46.1% to 60.2% on one benchmark.
- âąThe systemâs stepâbyâstep traces are easy to read, so you can see exactly which clip was checked and what visual facts were used.
- âąA small step budget (about five actions) and a narrow window (one clip) already work well, with larger windows helping a bit more.
- âąThis approach shows that smart planning plus the right tools can handle long, messy videos better than one big model trying to read everything at once.
Why This Research Matters
Long videos are everywhereâclasses, meetings, TV, sports, and securityâand most of us donât have time to rewatch everything to answer one question. LongVideoAgent shows a practical way to jump straight to the moment that matters and read just the needed details, saving time and improving accuracy. Its step-by-step traces make decisions transparent, so users can trust how answers were found. Smaller open models benefit a lot from this method, making strong long-video reasoning more accessible. With future additions like audio and knowledge, this approach could power reliable assistants for studying, media analysis, compliance checks, and more.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine you have to answer a question about a whole TV episode thatâs an hour long. Would you watch every second again, or skip straight to the parts that matter?
đ„Ź The Concept (LongâTail Video QA): Itâs answering questions about very long videos where the clues are scattered across many scenes and times. How it works:
- You search the entire episode for the right moment.
- You check both what people say (subtitles) and what you see (frames).
- You put the clues together to answer. Why it matters: Without careful searching, youâll miss tiny cues (like a small object or a quick glance) and answer wrong. đ Anchor: A question like âWhere is Sheldon sitting when heâs with a man?â might need one specific nighttime scene on a bench; scanning the whole episode blindly is slow and errorâprone.
đ Hook: You know how teachers pause a documentary at the exact moment something important happens?
đ„Ź The Concept (Temporal Localization): Itâs finding when in the video the needed moment happens. How it works:
- Read the question and subtitles.
- Predict the most relevant time span.
- Jump there and check details. Why it matters: If you look at the wrong minute, even a perfect vision system wonât help. đ Anchor: To answer âWhat does the sign say behind the cashier?â, you must skip to the checkout scene before trying to read any sign.
đ Hook: When you solve a puzzle, you use eyes (seeing), ears (hearing), and brain (thinking) together.
đ„Ź The Concept (Multimodal Reasoning): Itâs combining text (subtitles), images (frames), and sometimes audio to make sense of a story. How it works:
- Gather text and visual hints.
- Decide which hint matters most for the question.
- Use both to reason step by step. Why it matters: Subtitles miss visual facts; visuals miss spoken facts. You need both for tough questions. đ Anchor: If someone says âover thereâ (subtitle), you need the frame to see where âthereâ is.
đ Hook: Think of a soccer team: the striker, the goalie, and the coach each have a job.
đ„Ź The Concept (MultiâAgent Framework): Itâs a team of specialized AI helpers coordinated by a master planner. How it works:
- The master agent reads the question and plans a step.
- It asks a grounding agent to find the right clip.
- It asks a vision agent to read fine visual details.
- It repeats until thereâs enough evidence, then answers. Why it matters: One model doing everything often compresses or overlooks details in hourâlong videos. đ Anchor: Instead of one student cramming a whole book, a team splits tasksâone finds the page, another reads the chart, a third writes the answer.
đ Hook: When you bake, you follow a recipe step by step, not all at once.
đ„Ź The Concept (StepâWise Reasoning): It breaks a big task into small, ordered actions. How it works:
- Plan the next best move.
- Execute it (ground or read visuals).
- Check what you learned.
- Stop when you have enough to answer. Why it matters: Without steps, the model wastes time and misses the goal. đ Anchor: First find the right clip, then check objects in that clip, then answer.
Before this work, many systems tried to stuff an entire video into a short, lossy summary for a single pass. That often erased fineâgrained evidence (like a tiny badge on a shirt) and stretched the modelâs memory. Other agent systems started to plan but used limited tools (mostly captions or basic retrieval), which couldnât capture subtle objects, faces, or timing. The missing piece was a planner that could coordinate stronger, specialized tools, take only a few smart steps, and know when to stop. This paper fills that gap with a multiâagent framework plus reinforcement learning to shape neat, efficient planning.
đ Hook: You know how stickers or points can train you to do homework neatly and on time?
đ„Ź The Concept (Reinforcement Learning): It teaches the master agent good habits by rewarding correct, wellâformed steps and right answers. How it works:
- The agent tries a sequence of actions (ground, read, answer).
- It gets small rewards for clean step format and a final reward for a correct answer.
- Over time, it learns to plan better and avoid wasteful tool calls. Why it matters: Without feedback, the agent may ramble, overâquery, or answer too soon. đ Anchor: The system learns that âGround â Visual read â Answerâ often beats âVisual read over and over â Guess.â
Real stakes: Long videos are everywhereâclasses, meetings, sports, TV, and security footage. A system that can jump to the right spot and pull precise facts saves you from endless scrolling and guessing, making video knowledge searchable, checkable, and useful in daily life.
02Core Idea
The âaha!â moment: Donât summarize everythingâteach a master planner to smartly coordinate a finder (grounding) and a lookâcloser (vision) and reward it for short, correct plans.
Three analogies:
- Treasure hunt: The master is the captain, the grounding agent is the map that marks âX,â and the vision agent is the magnifying glass that reads tiny clues on the treasure chest.
- Librarian team: One librarian pinpoints the exact shelf (grounding), another checks the paragraph for a keyword (vision), while the head librarian writes the final answer.
- Sports play: The coach (master) calls a play, the scout finds the right opponent footage (grounding), and the analyst freezeâframes key moments (vision) before the coach makes the call (answer).
Before vs. After:
- Before: Single big models took a single gulp of a long videoâoften downsampled or compressedâlosing small but crucial evidence. Agent systems existed but leaned on weaker tools and had little training on how to plan.
- After: The master coordinates two stronger specialists, operates in short, clean steps, and is trained with rewards to avoid unnecessary tool calls and to answer decisively once evidence is enough. Fine details are retrieved on demand, not squashed into a summary.
Why it works (intuition):
- Narrowing the search (grounding) reduces distraction and memory load.
- Targeted inspection (vision) restores the precise details that summaries blur out.
- Stepâwise planning with rewards tunes the master to balance speed and accuracy: explore only as needed, then exploit to answer.
- Clear action tags structure the conversation, keeping the agent from rambling and forcing one crisp decision per step.
Building blocks (with simple sandwiches for the two specialists):
đ Hook: When youâre lost in a huge mall, the info desk points you to the exact store. đ„Ź The Concept (Grounding Agent): It finds the time segment where the answer likely lives. How it works: (1) Read question + subtitles; (2) Propose the best episode clip; (3) Hand back a tag like <clip_15>. Why it matters: Without it, the master might comb the wrong minutes. đ Anchor: For âWhen did she hand over the keys?â, grounding jumps to the clip with that exchange.
đ Hook: Detectives zoom into photos to spot fingerprints. đ„Ź The Concept (Vision Agent): It reads fineâgrained visual facts (objects, faces, actions, onâscreen text) inside the grounded clip. How it works: (1) Receive the clip tag and a focused query; (2) Inspect selected frames; (3) Return concise textual observations. Why it matters: Subtitles wonât tell you whatâs written on a sign or which hand holds a cup. đ Anchor: âWhat side of the bed is closer to the window?â is solved by vision reading the room layout.
Finally, the master agent (with reinforcement learning) learns a tidy habit: If thereâs no clip yetâground. If the clip lacks visualsâask vision. If evidence is enoughâanswer.
đ Hook: Like using a checklist so you donât overdo steps while cooking. đ„Ź The Concept (Step Budget): A small limit (like five steps) keeps the plan efficient. How it works: (1) Make a move; (2) Review evidence; (3) Repeat until limit or answer. Why it matters: Unlimited steps can waste time without adding accuracy. đ Anchor: Most questions resolve in 2â5 actions: ground â read â answer.
03Methodology
Highâlevel recipe: Input (full episode subtitles + question) â Master decides next action â Grounding agent returns a clip tag â Optional vision agent reads fine details from that clip â Evidence accumulates â Master answers.
Step 1. Initialize context
- What happens: The master agent receives all episode subtitles and the userâs question. Context is empty except for these.
- Why this exists: It gives the master the story outline but not the visuals yet, so it must decide what to fetch.
- Example: Question: âWhere is Sheldon sitting when he is accompanied by a man?â Subtitles alone donât reveal the location.
đ Hook: You donât read a whole book to find one quoteâyou jump to the right page first. đ„Ź The Concept (Grounding Pass): The master asks the grounding agent to localize the relevant segment. How it works:
- Emit <request_grounding>.
- Grounding returns a tag like <s05e06_seg02_clip_15> and the local subtitles.
- The tag pins the timeline. Why it matters: It cuts the haystack down to a handful of needles. đ Anchor: For the bench question, grounding jumps to the nighttime scene.
Step 2. Decide if visuals are needed
- What happens: The master checks if subtitles in the grounded clip already answer the question. If not, it asks vision.
- Why this exists: Prevents unnecessary visual reads and saves time.
- Example: If the subtitle says, âMeet me at the bus stop,â no need to ask vision.
đ Hook: If you canât read tiny print, you grab a magnifying glass. đ„Ź The Concept (Targeted Vision Query): The master crafts a focused prompt for the vision agent (e.g., âWhich side of the bed is near the window?â). How it works:
- Emit <visual_query> ... </visual_query> tied to the current clip tag.
- Vision inspects key frames for objects, text, or spatial relations.
- It returns short, factual observations. Why it matters: Generic captions might miss the exact fact you need. đ Anchor: âWindow is on the left; Sheldon sits near itâ â answer: left side.
Step 3. Accumulate and loop with a small step budget (K)
- What happens: The master appends grounding tags and vision notes to its running context, then decides the next moveâreground, reâquery vision more precisely, or answer.
- Why this exists: Complex questions may need a refineâandâcheck cycle, but K (like 5) stops overthinking.
- Example: First vision read is vague; the master asks a sharper second query focused on window position.
đ Hook: Practice makes neat habits. đ„Ź The Concept (Reinforcement Learning for Planning): The master is fineâtuned to make crisp, valid actions and get answers right. How it works:
- Structural reward: +1 when an action is wellâformed (one clean tag, no extra text).
- Answer reward: +1 (scaled) for the correct final choice.
- Optimize with GRPO so the policy prefers sequences that are tidy and correct. Why it matters: It discourages messy outputs and pointless tool calls. đ Anchor: The trained master more often chooses âground â one precise visual read â answer,â instead of multiple unfocused reads.
Practical details that make it work:
- Single action per turn: The master must choose exactly one of {visual_query, request_grounding, answer}. This keeps reasoning organized.
- Textâonly master: The master never sees raw imagesâonly subtitles, clip tags, and the vision agentâs textual notes. This simplifies integration across different LLM backbones.
- Window size: By default, the grounding returns one clip; larger windows (2â3 clips) can add context across rapid cuts when needed.
- Tools can be reâinvoked: The master may reground if evidence conflicts, or issue a sharper vision query if the first was too broad.
Secret sauce:
- The combo of (1) precise temporal narrowing (grounding), (2) surgical visual reads (vision), and (3) a rewardâtrained planner (master) aligns efficiency with accuracy.
- Small step limits and tidy action tags reduce drift and make the entire trajectory interpretable: you can audit which clip was checked and why.
Concrete walkâthroughs:
- Bench example: Ground to clip_15 â Vision says âSheldon on bench at night, trash can, windowsâ â Master infers bus stop â Answer.
- Bedâwindow example: Ground to bedroom clip_09 â Vision read #1 too generic â Vision read #2: âWindow on left, Sheldon near itâ â Answer: left side.
04Experiments & Results
The test: Can the agent team answer multipleâchoice questions about full TV episodes (LongTVQA / LongTVQA+) better than strong singleâmodel baselines? Metrics are Answer Accuracy (did you pick the right option?) and Grounding Accuracy (did you find the right segment?).
Competition: They compare against powerful nonâagent multimodal models (e.g., GPTâ4o, Gemini 2.5 Pro) that process many frames directly, plus the same LLM backbones used either in nonâagent mode (subtitles only) or as the master in their agentic system. This makes gains attributable to agentic planning, not just a bigger backbone.
Scoreboard with context:
- Multiâagent beats nonâagent on the same backbone. In a controlled ablation, adding grounding boosts accuracy from 64.3% to 69.0%, and adding the vision agent lifts it further to 74.8%âlike moving from a C to a solid B+/Aâ.
- Reinforcement learning helps especially for smaller open models. For example, an open 7B model (Qwen2.5â7B) jumps from 46.1% to 60.2% on LongTVQA and from 60.3% to 70.8% on LongTVQA+ after RLâlike raising your test grade by more than a full letter.
- Closedâsource models remain strong, but the agentic approach narrows the gap. Even compact closed models (e.g., GPTâ5âmini) see big gains when run agentically with frames compared to their nonâagent subtitleâonly variants (up to +12.20 points in one setting).
Ablations that tell a story:
- Step budget K: Raising K from 2 to 5 increases grounding accuracy (67.0% â 71.0%) and answer accuracy (68.30% â 73.67%), but going to 10 brings little extra. Translation: a handful of smart moves beats many meandering ones.
- Window size: Adding adjacent clips helps disambiguate quick cuts; answer accuracy rises from 70.33% (1 clip) to 75.00% (2 clips) and 77.26% (3 clips). Bigger windows add small gains but can cost more queries and latency.
- Vision backbone: A stronger vision agent (e.g., GPTâ4o) improves both grounding (73.30%) and answers (78.00%) over a lighter model, showing that better fineâdetail reading pays off.
Surprising findings:
- Tidy formatting rewards matter. Rewarding each wellâformed action tag reduces chatter and keeps the plan compact, which indirectly improves final accuracy.
- One or two targeted vision reads are often enough. Overâquerying rarely helpsâthe best trajectories look like âGround once, read once or twice, answer.â
- Master never sees raw images, yet the system still benefits hugely from visionâproving that decoupling planning (textâonly) from perception (vision tool) is practical and powerful.
In short, the agent team consistently wins by looking exactly where needed and reading exactly what matters, rather than trying to remember everything at once.
05Discussion & Limitations
Limitations:
- Audio blind spot: The system depends on provided subtitles and doesnât transcribe raw audio, so whispered lines or tone cues can be missed.
- Frozen specialists: Grounding and vision agents stay fixed during RL. Joint training might improve robustnessâespecially on tricky visual attributes or ambiguous dialogue.
- Simple rewards: Only step formatting and final correctness are rewarded. Richer rewards (e.g., penalizing contradictory evidence or rewarding fewer tool hops) might further refine planning.
Required resources:
- A capable master LLM (open or closed), plus a grounding tool and a strong vision model.
- GPU time if you fineâtune with RL (e.g., hours on several GPUs for 3Bâ7B open models).
- Access to episodeâlevel subtitles and a way to sample frames from grounded clips.
When not to use:
- Ultraâshort clips where a single pass already captures everything.
- Domains with no subtitles and very weak visual signal, unless you add ASR or other modalities.
- Realâtime constraints that forbid even a few grounding/vision calls.
Open questions:
- How to fuse raw audio (ASR) and external knowledge (e.g., show lore) without bloating steps?
- Can we coâtrain grounding + vision with the master for endâtoâend improvements while keeping interpretability?
- What curated rewards best capture âgood reasoning,â like consistency checks or evidence citations?
- How to scale to dayâlong streams (sports tournaments, security feeds) with hierarchical plans and memory?
06Conclusion & Future Work
Threeâsentence summary: LongVideoAgent turns longâvideo QA into a team sport: a master planner grounds the right moments, asks a vision specialist for precise details, and stops once it has enough evidence. Reinforcement learning teaches the master to keep steps tidy and answers correct, avoiding wasteful tool calls. On fullâepisode datasets (LongTVQA/LongTVQA+), this approach beats strong nonâagent baselines and produces interpretable traces.
Main achievement: Showing that coordinated, rewardâtrained multiâagent planning reliably outperforms singleâpass or weakâtool approaches on hourâlong videos by fetching justâinâtime evidence instead of compressing everything.
Future directions: Add audio (ASR) and world knowledge, refine grounding to be finer and faster, jointly train the specialists, and explore richer rewards that capture reasoning quality and evidence faithfulness.
Why remember this: It marks a shift from âsummarize the whole videoâ to âplan, localize, and read what matters,â proving that smart coordination plus minimal, targeted perception can unlock accurate, efficient understanding of long, messy videos.
Practical Applications
- âąLecture assistants that jump to the exact minute a concept was explained and extract the key diagram or equation.
- âąTV and movie search that finds the precise scene where a character reveals a clue and reads on-screen text.
- âąSports analysis that locates a play and identifies player positions or scoreboard numbers on demand.
- âąMeeting summarization that pinpoints decisions and reads slides to capture figures or action items.
- âąCustomer support QA that finds when a product was shown or a feature was mentioned in demo videos.
- âąCompliance auditing that checks exact segments for required disclosures or safety labels in long broadcasts.
- âąEducation tools that link homework questions to the right lecture snippet and annotate visuals from that snippet.
- âąNews verification that grounds controversial claims to the specific aired segment and extracts supporting visuals.
- âąVideo cataloging that auto-tags episodes by grounded events (e.g., âkey handoffâ, âwindow-left bedroomâ) for faster retrieval.
- âąSafety monitoring that finds the moment a rule was broken and reads signage or badge IDs from that clip.