Agentic Very Long Video Understanding

Aniket Rege; Arka Sadhu; Yuliang Li; Kejie Li; Ramya Korlakai Vinayak; Yuning Chai; Yong Jae Lee; Hyo Jin Kim

Agentic Very Long Video Understanding

Intermediate

Aniket Rege, Arka Sadhu, Yuliang Li et al.1/26/2026

arXiv PDF

Key Summary

•The paper tackles understanding super long, first‑person videos (days to a week) by giving an AI a smarter memory and better tools.
•It builds a time-stamped “map” of people, places, and objects, called an entity scene graph, so the AI can remember who did what, where, and when.
•A planning agent breaks a big question into smaller steps and searches three places: pictures (frames), speech (transcripts), and the graph (relationships).
•The searches follow a strict-to-relaxed strategy to find exact matches first, then gently widen the rules if nothing turns up.
•Evidence from all searches is stored in working memory, and a final VQA model uses it to answer the question clearly and consistently.
•On the week-long EgoLifeQA benchmark, the method sets a new state of the art (57.5% average MCQ accuracy with Gemini 2.5 Pro).
•It especially shines on questions needing multi-hop reasoning about relationships (RelationMap and TaskMaster), with big gains over prior systems.
•On Video-MME (Long), it reaches 74.1% with many fewer frames than some competitors, showing efficiency as well as accuracy.
•Ablations show adding captions to transcripts improves the graph, and LLM-based transcript search boosts quality but costs more time.
•Limitations include reliance on accurate speech transcripts and detection, and performance can dip with noisy diarization or perception errors.

Why This Research Matters

Always-on assistants need memory, not just eyesight. This work gives AI a structured, time-aware memory of people, places, and objects so it can answer real-life questions that stretch across days. It helps with everyday tasks like finding lost items, recalling conversations, and noticing healthy (or unhealthy) habits. It saves time by searching precisely instead of scanning everything, which also reduces cost. It also points toward safer, more transparent AI since the graph can show why an answer was given. As wearables spread, this approach turns overwhelming video streams into practical, trustworthy help.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how it’s easy to remember what you did this morning, but much harder to recall what you did every day last week? Computers feel the same way about videos. Short clips are manageable, but a whole week of video is like a never-ending story. Before this research, most AI systems were pretty good at short, standalone videos—like TikToks or YouTube snippets. They could recognize objects, describe scenes, and answer simple questions. But when videos stretched to many minutes—or even an hour—researchers had to squeeze or summarize the video to fit the AI’s memory limits. That helped a bit, yet it still missed a lot of connections over time.

🍞 Hook: Imagine trying to summarize a whole week at summer camp using only a sticky note. You’ll miss who you met, which cabin you slept in, and what happened on different days.

🥬 The Concept: Longitudinal Video Understanding

What it is: Understanding videos that show someone’s life over long stretches—days or even weeks—so the AI can remember and connect far-apart moments.
How it works: 1) Keep track of many events across time; 2) Notice repeats like habits; 3) Link people, places, and objects across days; 4) Answer questions that require looking back and forth in time.
Why it matters: Without it, an AI assistant on smart glasses can’t answer everyday questions like “Where did I leave my keys yesterday?” or “Who sat next to me on all my taxi rides to the store?”

🍞 Anchor: A kid asks, “How many times did I practice piano this week, and when?” Longitudinal understanding lets the AI search the whole week, not just one afternoon.

The problem: Modern language-and-vision AIs have limited “context windows” (how much input they can consider at once). For a week-long video, you can’t just stuff every frame and every spoken word into the model. People tried clever tricks: pick a few important frames, compress visual tokens, summarize in sliding windows, and retrieve small text chunks from captions. Helpful, yes—but these methods lose a key ingredient: who is related to whom, when those interactions happened, and how different pieces of evidence connect across days.

🍞 Hook: You know how your brain files people, places, and things into mental folders? If you only keep loose notes, you’ll forget which note goes with which person.

🥬 The Concept: Temporal Localization

What it is: Pinpointing exactly when something happened in a long video.
How it works: 1) Use clues from sound (words, names, times); 2) Use visuals (objects, scenes); 3) Match those clues to precise timestamps; 4) Zoom in on the right slices of time.
Why it matters: Without good timing, the AI either searches too much and gets lost, or searches too little and misses the answer.

🍞 Anchor: “When did we talk about the science fair?” Temporal localization finds the exact minutes those words were spoken.

Failed attempts: 1) Only retrieving captions loses visual detail and relationships. 2) Only retrieving frames misses names and spoken hints. 3) Treating each part separately doesn’t help the AI tie people, objects, and places together over time. 4) Prior “agent” systems often used unstructured notes, so they forgot connections as time stretched.

The gap: We need a memory that is structured around entities (people, places, objects), knows their relationships (talks-to, uses, interacts-with, mentions), and is time-aware. Plus, we need a planner that can ask the right questions of the right tools.

Real stakes: With smart glasses or always-on assistants, questions are personal and practical: “Where did I put my wallet?” “Who joined me for Tuesday’s lab?” “How often did I drink water this week?” In safety, health, and productivity, getting these answers right and fast truly matters.

🍞 Hook: Imagine your notebook turns into a tidy map showing who met whom, where, and when.

🥬 The Concept: Cross-Modal Reasoning

What it is: Combining different kinds of clues—images, speech, and structured relationships—to reach better answers.
How it works: 1) Look at pictures for scene and objects; 2) Read transcripts for names and events; 3) Consult a relationship map to connect dots across time; 4) Mix them to confirm or correct each other.
Why it matters: Without mixing modalities, the AI might miss that “the person in the blue shirt” and “Alex” are the same person, or that “the store” in the video is the same place mentioned in speech.

🍞 Anchor: To answer “Who sat next to me on my shopping taxi rides?”, audio confirms “shopping,” visuals show the taxi seats, and the relationship map links repeated rides to the same friend.

02Core Idea

Aha! Moment in one sentence: Put a time-stamped map of people, places, and objects (an entity scene graph) at the center, and let an agent plan careful searches across audio, visuals, and that map, then stitch the evidence together.

🍞 Hook: Think of a giant scrapbook where each photo has sticky notes: who’s in it, where, what they did, and the exact time.

🥬 The Concept: Entity Scene Graphs

What it is: A structured map with nodes (people, places, objects) and edges (relationships like talks-to, uses, interacts-with, mentions), each edge labeled with when it happened.
How it works: 1) Extract names and items from transcripts and captions; 2) Link them with relationship types; 3) Add start and end times; 4) Store everything in a database you can query.
Why it matters: Without a graph, information is scattered. The AI can’t reliably follow who did what with whom across many days.

🍞 Anchor: If you ask, “Before we went to see the dog, who went with me to the second floor to find Tasha?”, the graph helps trace those linked steps in order.

🍞 Hook: Imagine a librarian who knows which shelf, which book, and which page holds your answer.

🥬 The Concept: Multi-hop Reasoning

What it is: Solving a problem by making several logical jumps—A leads to B, which leads to C.
How it works: 1) Break a big question into smaller parts; 2) Answer the first part; 3) Use that to ask the next part; 4) Keep going until the final answer appears.
Why it matters: Without multi-hop, the AI gets stuck on simple, single-step facts and can’t solve multi-day puzzles.

🍞 Anchor: To learn “who sat next to me on shopping taxi rides,” the AI: finds all shopping trips (hop 1), finds taxi rides (hop 2), then the neighbor in each ride (hop 3), and finally counts who appears most (hop 4).

Three analogies for the core idea:

Detective board: Photos (frames), call logs (transcripts), and strings linking people/places (graph) help solve a long case.
Library catalog: You don’t read every book; you query the catalog (graph) to find exact shelves (times), then peek into the right pages (frames/transcripts).
Recipe finder: Instead of tasting every dish, you search the index (graph), then check key ingredients (transcripts) and photos (frames).

Before vs. After:

Before: Long videos felt like a messy attic—too many items, no labels, hard to connect events across days.
After: The attic is labeled and time-stamped. A planner asks focused questions, finds precise times, and cross-checks across audio and visuals.

Why it works (intuition):

Graphs compress the who-did-what-with-whom-when information so queries are surgical, not sweeping.
Visual embeddings fetch look-alike scenes fast; transcripts fetch name/date clues; the graph ties them together for consistency.
An agent’s strict-to-relaxed querying avoids false hits first, then broadens carefully for recall.
A working memory keeps only the best evidence to fit model limits.

🍞 Hook: You know how a coach assigns drills before the big game?

🥬 The Concept: Planning Agent

What it is: A controller that breaks a big question into mini-steps and picks the right tool for each step.
How it works: 1) Read the question; 2) Make a plan with up to five steps; 3) Route steps to audio, visual, or graph search; 4) Gather and refine evidence.
Why it matters: Without planning, the AI wastes time searching everything and still misses connections.

🍞 Anchor: For “When did I chat with Alex at the kitchen island?”, the planner first finds “Alex” mentions (audio), then finds “kitchen island” (visual), then checks the graph for talks-to edges overlapping those times.

🍞 Hook: Think of your backpack pocket where you keep only the essentials for a hike.

🥬 The Concept: Working Memory

What it is: A compact notebook of the best cross-modal evidence found so far.
How it works: 1) Add analyzed snippets; 2) De-duplicate; 3) Keep timestamps and sources; 4) Hand it to the final answerer.
Why it matters: Without it, the AI forgets what it already proved and repeats itself.

🍞 Anchor: “Danced between 15:50–16:07 on Day 2” and “Shure said ‘Got it’ to Alice at 15:50:21–15:50:22” live together in memory to support the answer.

🍞 Hook: Start with a tight filter, then loosen gently if needed—like looking for your lost pencil first on your desk, then your room, then the house.

🥬 The Concept: Strict-to-Relaxed SQL Search

What it is: A careful strategy for querying the graph—begin exact, then widen by time, day, names, and relationship type if necessary.
How it works: 1) Exact day/time/entities/relationship; 2) Relax time; 3) Relax day; 4) Partial name matches; 5) Finally drop relationship constraint.
Why it matters: Without this, you either miss the right rows (too strict) or drown in noise (too relaxed).

🍞 Anchor: If “Alex” wasn’t spelled consistently, relaxing to LIKE ‘Alex%’ recovers matches that exact search would miss.

03Methodology

At a high level: Input (week-long video with audio) → Build resources (frames, transcripts, entity scene graph) → Agent plans and runs searches (visual, audio, graph) → Analyzer trims and fuses evidence into working memory → VQA model answers.

Step 0: Prepare the multimodal resources

What happens: Sample video at 1 FPS to get frames; run ASR for transcripts (speaker diarization if available); generate visual captions every 30s; fuse captions with transcripts into richer text. Extract entities and relationships with an LLM, add timestamps from captions/transcripts, and store the entity scene graph as rows in a SQLite database: source, target, relationship, start_t, end_t, day, and supporting text.
Why it exists: These resources give three complementary views: images (what it looks like), words (what was said), and structure (who-with-whom-when). Without them, later searches would be blind or inconsistent.
Example: A 7-day video yields thousands of frames, many transcript snippets, and graph edges like Jake (Person) TALKS_TO Alice (Person) at 15:50:21–15:50:22 Day 2.

🍞 Hook: It’s like building your study kit before an exam—notes, flashcards, and an index.

🥬 The Concept: Audio-Visual Search

What it is: Tools that find relevant frames and transcript snippets for a sub-question.
How it works: Visual search uses SigLIP2 embeddings to retrieve look-alike frames, optionally filtered by time or place. Transcript search can be LLM-based (smarter, slower) or BM25 (faster, simpler) to find matching words/phrases.
Why it matters: Without quick, precise retrieval, the agent wastes compute and still misses the right moments.

🍞 Anchor: Query “dancing” on Day 2 afternoon finds frames with dancing; transcript search for “music” or “dance” lines up the timing.

Step 1: Planning and subtask creation

What happens: The Planning Agent decomposes the user’s question into up to five crisp steps and assigns each to visual, audio, or graph search, providing any time filters.
Why it exists: Tackling a giant question all at once is inefficient; breaking it down improves precision and reduces token use.
Example: For “Who often sits next to me during taxi rides to go shopping?”, steps might be: (a) find shopping mentions (audio), (b) find taxi scenes (visual), (c) intersect times, (d) use graph to see who interacts with me then, (e) tally the most frequent neighbor.

Step 2: Run the retriever tools

Visual Search Tool: Stores frame embeddings and metadata (timestamps, locations). Given a short text query like “taxi interior,” it returns top-k nearest frames for the window of interest.
Audio Transcript Search Tool: Either feeds per-day transcripts to an LLM to select relevant snippets with timestamps, or runs BM25 over all transcripts to get candidate lines.
Entity Graph Search Tool: Issues SQL over the graph. It starts strict (exact day, time, entities, relationship) and relaxes constraints if needed. This prioritizes accuracy but still finds matches under real-world messiness (misspellings, timing drift).
Why these exist: Each tool sees a different facet: visuals excel at appearance and scene type; audio captures names and intents; the graph maintains identity and relationships over time. Without one of them, crucial clues would be lost.
Mini example: Visual returns frames labeled “car interior” around 10:10–10:40; audio finds “shopping list” at 10:20; graph shows TALKS_TO between “Jake” and “Sam” overlapping those minutes.

Step 3: Analyze and refine (Analyzer Tool)

What happens: An LLM examines retrieved items, discards off-target ones, extracts the most relevant evidence, and writes concise notes with timestamps into working memory.
Why it exists: Retrieval can be noisy. The analyzer keeps memory tidy, preventing confusion later.
Example: From 50 frames, it keeps just 4 that clearly show two people in a taxi, and links them to the transcript line “Let’s go to the store now,” Day 4 at 10:22.

Step 4: Synthesize the final answer (VQA Agent)

What happens: The VQA model reads the original question plus the compact working memory and produces the final answer, citing the aligned evidence across modalities.
Why it exists: This last step assembles the jigsaw pieces into a single, confident response.
Example: “Sam sits next to you most often during taxi rides to the store,” supported by repeated TALKS_TO and co-occurring taxi frames on Days 2, 4, and 6.

Secret sauce highlights

Time-aware entity graph: Edges have start and end times, so reasoning stays aligned to real moments, not vague summaries.
Strict-to-relaxed SQL: Lowers false positives by starting exact; increases recall smoothly when data is messy.
Cross-modal triangulation: A claim must pass multiple checks—visual look, spoken words, and relationship consistency.
Efficient context budgeting: The system retrieves only what’s needed into working memory, avoiding context overflow.

What breaks without each step

No graph: The system forgets identities across days; relation-heavy questions collapse.
No transcripts: Names and intents vanish; habits are hard to count.
No visual embeddings: Scene types and subtle actions get missed.
No analyzer: Noise floods memory; the final answer waffles.
No planner: The system thrashes, burning time and tokens on guesses.

Concrete walk-through

Question: “Across the week, who consistently sat next to me during taxi rides to the store?”

Planner creates steps (shopping mentions → taxi frames → overlap times → graph who’s beside me → tally).
Audio tool finds shopping utterances Tuesday 10:22 and Thursday 18:05.
Visual tool retrieves “taxi interior” frames near those times.
Graph tool finds TALKS_TO and INTERACTS_WITH edges between Jake and Sam overlapping the same minutes.
Analyzer keeps the clearest overlaps and writes them to memory.
VQA agent answers “Sam,” referencing multiple days and matching timestamps.

04Experiments & Results

The test: Can the system answer multiple-choice questions about very long videos, especially ones requiring multi-hop, cross-modal reasoning? Two benchmarks are used.

EgoLifeQA (week-long, egocentric): 500 MCQs from about 50 hours of video of one participant (“Jake”) over 7 days. Categories include EntityLog (visual identity), EventRecall (what happened when), HabitInsight (repeated behaviors), RelationMap (who talks/uses/interacts with whom), and TaskMaster (multi-step tasks across time).
Video-MME (Long): 300 videos, each 30–60 minutes, with 2700 MCQs total in the full benchmark; the long split tests broad video comprehension.

The competition: Baselines include uniform-sampling MLLMs (like Gemini 2.5 Pro and GPT-4.1), RAG-style systems retrieving frames/captions, and agentic methods like EgoButler, VideoAgent, DrVideo, and others.

Scoreboard with context:

EgoLifeQA: EGAgent with Gemini 2.5 Pro reaches 57.5% average accuracy. Think of class grades: many systems hovered around mid-30s to mid-40s (like C’s). EGAgent jumps to the high 50s (a strong B), and on the toughest relational categories it rockets ahead.
Category wins: RelationMap and TaskMaster see the biggest boosts—over 20 percentage points above strong non-agentic baselines and over 30 points above earlier SOTA in some comparisons—showing the graph plus planning truly helps multi-hop, entity-centric problems.
Video-MME (Long): EGAgent with Gemini 2.5 Pro achieves 74.1% while processing far fewer frames than some methods. Against methods with the same lighter backbone (e.g., Qwen2.5-VL-7B), EGAgent exceeds prior video-RAG baselines and matches more complex graph-RAG systems with over 10× fewer frames on some setups.

Meaningful numbers:

On EgoLifeQA, adding the entity graph on top of frames+transcripts improves accuracy in 4 of 5 categories. RelationMap and TaskMaster—where cross-modal, multi-hop reasoning is essential—benefit the most.
Using transcript-fused captions to build the graph (captions + transcripts) improves accuracy by around 2–3 percentage points compared to transcripts alone.
LLM-based transcript selection outperforms BM25 by roughly 6–9 percentage points on average, at the cost of higher latency and tokens.
Oracle tests (perfect time localization) show that even with exact timestamps, current VQA models top out below 70%, so better retrieval matters, but final reasoning models also limit ultimate scores.

Surprising findings:

Visual retrieval shows very high recall at tight time windows (e.g., 10 seconds), yet overall MCQ accuracy remains bounded, proving that “seeing the right frames” isn’t enough; you must connect them with audio and relationships.
The entity graph has weaker ultra-fine timing than frames/transcripts at tiny windows but is superb at broad temporal coverage and identity consistency, making it great for shortlist-then-refine.
Combining all three tools (graph + frames + transcripts) yields strong 10-second recall and the best overall accuracy, validating true cross-modal triangulation.

Efficiency and latency:

EGAgent typically answers an EgoLifeQA question in about 2–3 minutes wall-clock. The analyzer (MM-LM) dominates time; BM25 speeds things up but lowers accuracy.
Token budgeting is improved by narrowing to 50 retrieved frames and focused transcript snippets, rather than streaming everything into the VQA model.

Bottom line: When videos are “very long,” systems that simply sample uniformly or retrieve only one modality stumble. Planning over a time-aware entity graph, then cross-checking with frames and transcripts, wins on hard, relational questions and scales better in practice.

05Discussion & Limitations

Limitations

Upstream dependence: If speech recognition, diarization (who spoke when), or visual captioning misfires, the graph gets noisy names or wrong timestamps, hurting reasoning.
Data gaps: Not every relationship is spoken or seen clearly; the strict-to-relaxed strategy helps, but some links remain hidden.
Latency and cost: LLM-based transcript selection and analysis improve accuracy but add time and tokens. On-device or low-latency settings may prefer BM25, accepting a performance drop.
Final-model ceiling: Even with perfect retrieval windows, VQA accuracy caps below 70% on some tasks, so better reasoning models are still needed.

Required resources

A vision encoder (e.g., SigLIP2) for frame embeddings and a capable LLM/VLM for analysis and answering.
ASR (and ideally diarization) for reliable transcripts at scale.
Storage for a per-week graph (thousands of edges) and a vector DB for frames.
An orchestration framework (e.g., LangGraph) to run planning, tool calls, and memory updates.

When not to use

Very short videos or single-shot questions: A simple captioning/VQA pass might be faster and just as accurate.
Extremely privacy-sensitive settings without robust safeguards: Building a detailed, time-aware memory demands careful consent and protection.
Environments with poor audio or heavy occlusions: If transcripts and visuals are too weak, the graph will be sparse.

Open questions

Better identity persistence: Can we more robustly match “Alex,” “he,” and “blue-shirt person” across days with minimal errors?
Adaptive retrieval: When should the agent skip modalities to save time without hurting accuracy?
Richer relations: Beyond talks-to/uses/interacts-with/mentions, can we learn finer-grained roles (teaches, helps, hands-over) reliably?
Faster, cheaper reasoning: Can small, specialized VLMs approximate large models’ analysis for on-device use?
Trust and privacy: How do we allow users to inspect, correct, or forget parts of their personal graph safely?

06Conclusion & Future Work

Three-sentence summary: This paper introduces EGAgent, an agent that understands week-long, first-person videos by planning searches across frames, transcripts, and a time-aware entity scene graph. It answers complex, multi-hop questions by starting with strict, precise queries, relaxing only as needed, and stitching cross-modal evidence in a compact working memory. The result is state-of-the-art accuracy on hard relationship questions and efficient performance on very long video tasks.

Main achievement: Centering reasoning on a temporally annotated entity scene graph—then orchestrating cross-modal retrieval with an agentic planner—greatly improves long-horizon, relational video understanding.

Future directions: Strengthen identity tracking across days, expand relationship types, speed up transcript selection, and design lighter VLMs for on-device use. Explore user controls for privacy (view/edit/forget) and better explainability of answers. Tighten temporal localization even further to close the gap to oracle performance.

Why remember this: As smart glasses and always-on assistants become normal, being able to recall “who, where, when, and how” across a week is transformative. EGAgent shows that a structured, time-aware memory plus careful planning turns overwhelming video streams into helpful, human-scale answers.

Practical Applications

•Smart glasses that remember where you left items across the week and tell you when you last saw them.
•Personal wellness tracking that spots habits like water breaks, walks, or screen time patterns over days.
•Household coordination, recalling who helped with chores and when, to fairly split tasks.
•Study or training review, finding all moments you practiced a skill and summarizing progress.
•Workplace safety audits that locate repeated risky behaviors over a shift or week.
•Retail or logistics analysis that traces who interacted with which items and where, to improve workflows.
•Meeting assistants that recall multi-day project updates, who discussed what, and action-item follow-through.
•Travel diaries that auto-organize by people, places, and activities with accurate timestamps.
•Sports film study that links plays to specific players and situations across entire tournaments.
•Elder care support, recalling routines (medication, meals) and alerting caregivers to changes.

Version: 1