Agentic Very Long Video Understanding
Key Summary
- â˘The paper tackles understanding super long, firstâperson videos (days to a week) by giving an AI a smarter memory and better tools.
- â˘It builds a time-stamped âmapâ of people, places, and objects, called an entity scene graph, so the AI can remember who did what, where, and when.
- â˘A planning agent breaks a big question into smaller steps and searches three places: pictures (frames), speech (transcripts), and the graph (relationships).
- â˘The searches follow a strict-to-relaxed strategy to find exact matches first, then gently widen the rules if nothing turns up.
- â˘Evidence from all searches is stored in working memory, and a final VQA model uses it to answer the question clearly and consistently.
- â˘On the week-long EgoLifeQA benchmark, the method sets a new state of the art (57.5% average MCQ accuracy with Gemini 2.5 Pro).
- â˘It especially shines on questions needing multi-hop reasoning about relationships (RelationMap and TaskMaster), with big gains over prior systems.
- â˘On Video-MME (Long), it reaches 74.1% with many fewer frames than some competitors, showing efficiency as well as accuracy.
- â˘Ablations show adding captions to transcripts improves the graph, and LLM-based transcript search boosts quality but costs more time.
- â˘Limitations include reliance on accurate speech transcripts and detection, and performance can dip with noisy diarization or perception errors.
Why This Research Matters
Always-on assistants need memory, not just eyesight. This work gives AI a structured, time-aware memory of people, places, and objects so it can answer real-life questions that stretch across days. It helps with everyday tasks like finding lost items, recalling conversations, and noticing healthy (or unhealthy) habits. It saves time by searching precisely instead of scanning everything, which also reduces cost. It also points toward safer, more transparent AI since the graph can show why an answer was given. As wearables spread, this approach turns overwhelming video streams into practical, trustworthy help.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how itâs easy to remember what you did this morning, but much harder to recall what you did every day last week? Computers feel the same way about videos. Short clips are manageable, but a whole week of video is like a never-ending story. Before this research, most AI systems were pretty good at short, standalone videosâlike TikToks or YouTube snippets. They could recognize objects, describe scenes, and answer simple questions. But when videos stretched to many minutesâor even an hourâresearchers had to squeeze or summarize the video to fit the AIâs memory limits. That helped a bit, yet it still missed a lot of connections over time.
đ Hook: Imagine trying to summarize a whole week at summer camp using only a sticky note. Youâll miss who you met, which cabin you slept in, and what happened on different days.
𼏠The Concept: Longitudinal Video Understanding
- What it is: Understanding videos that show someoneâs life over long stretchesâdays or even weeksâso the AI can remember and connect far-apart moments.
- How it works: 1) Keep track of many events across time; 2) Notice repeats like habits; 3) Link people, places, and objects across days; 4) Answer questions that require looking back and forth in time.
- Why it matters: Without it, an AI assistant on smart glasses canât answer everyday questions like âWhere did I leave my keys yesterday?â or âWho sat next to me on all my taxi rides to the store?â
đ Anchor: A kid asks, âHow many times did I practice piano this week, and when?â Longitudinal understanding lets the AI search the whole week, not just one afternoon.
The problem: Modern language-and-vision AIs have limited âcontext windowsâ (how much input they can consider at once). For a week-long video, you canât just stuff every frame and every spoken word into the model. People tried clever tricks: pick a few important frames, compress visual tokens, summarize in sliding windows, and retrieve small text chunks from captions. Helpful, yesâbut these methods lose a key ingredient: who is related to whom, when those interactions happened, and how different pieces of evidence connect across days.
đ Hook: You know how your brain files people, places, and things into mental folders? If you only keep loose notes, youâll forget which note goes with which person.
𼏠The Concept: Temporal Localization
- What it is: Pinpointing exactly when something happened in a long video.
- How it works: 1) Use clues from sound (words, names, times); 2) Use visuals (objects, scenes); 3) Match those clues to precise timestamps; 4) Zoom in on the right slices of time.
- Why it matters: Without good timing, the AI either searches too much and gets lost, or searches too little and misses the answer.
đ Anchor: âWhen did we talk about the science fair?â Temporal localization finds the exact minutes those words were spoken.
Failed attempts: 1) Only retrieving captions loses visual detail and relationships. 2) Only retrieving frames misses names and spoken hints. 3) Treating each part separately doesnât help the AI tie people, objects, and places together over time. 4) Prior âagentâ systems often used unstructured notes, so they forgot connections as time stretched.
The gap: We need a memory that is structured around entities (people, places, objects), knows their relationships (talks-to, uses, interacts-with, mentions), and is time-aware. Plus, we need a planner that can ask the right questions of the right tools.
Real stakes: With smart glasses or always-on assistants, questions are personal and practical: âWhere did I put my wallet?â âWho joined me for Tuesdayâs lab?â âHow often did I drink water this week?â In safety, health, and productivity, getting these answers right and fast truly matters.
đ Hook: Imagine your notebook turns into a tidy map showing who met whom, where, and when.
𼏠The Concept: Cross-Modal Reasoning
- What it is: Combining different kinds of cluesâimages, speech, and structured relationshipsâto reach better answers.
- How it works: 1) Look at pictures for scene and objects; 2) Read transcripts for names and events; 3) Consult a relationship map to connect dots across time; 4) Mix them to confirm or correct each other.
- Why it matters: Without mixing modalities, the AI might miss that âthe person in the blue shirtâ and âAlexâ are the same person, or that âthe storeâ in the video is the same place mentioned in speech.
đ Anchor: To answer âWho sat next to me on my shopping taxi rides?â, audio confirms âshopping,â visuals show the taxi seats, and the relationship map links repeated rides to the same friend.
02Core Idea
Aha! Moment in one sentence: Put a time-stamped map of people, places, and objects (an entity scene graph) at the center, and let an agent plan careful searches across audio, visuals, and that map, then stitch the evidence together.
đ Hook: Think of a giant scrapbook where each photo has sticky notes: whoâs in it, where, what they did, and the exact time.
𼏠The Concept: Entity Scene Graphs
- What it is: A structured map with nodes (people, places, objects) and edges (relationships like talks-to, uses, interacts-with, mentions), each edge labeled with when it happened.
- How it works: 1) Extract names and items from transcripts and captions; 2) Link them with relationship types; 3) Add start and end times; 4) Store everything in a database you can query.
- Why it matters: Without a graph, information is scattered. The AI canât reliably follow who did what with whom across many days.
đ Anchor: If you ask, âBefore we went to see the dog, who went with me to the second floor to find Tasha?â, the graph helps trace those linked steps in order.
đ Hook: Imagine a librarian who knows which shelf, which book, and which page holds your answer.
𼏠The Concept: Multi-hop Reasoning
- What it is: Solving a problem by making several logical jumpsâA leads to B, which leads to C.
- How it works: 1) Break a big question into smaller parts; 2) Answer the first part; 3) Use that to ask the next part; 4) Keep going until the final answer appears.
- Why it matters: Without multi-hop, the AI gets stuck on simple, single-step facts and canât solve multi-day puzzles.
đ Anchor: To learn âwho sat next to me on shopping taxi rides,â the AI: finds all shopping trips (hop 1), finds taxi rides (hop 2), then the neighbor in each ride (hop 3), and finally counts who appears most (hop 4).
Three analogies for the core idea:
- Detective board: Photos (frames), call logs (transcripts), and strings linking people/places (graph) help solve a long case.
- Library catalog: You donât read every book; you query the catalog (graph) to find exact shelves (times), then peek into the right pages (frames/transcripts).
- Recipe finder: Instead of tasting every dish, you search the index (graph), then check key ingredients (transcripts) and photos (frames).
Before vs. After:
- Before: Long videos felt like a messy atticâtoo many items, no labels, hard to connect events across days.
- After: The attic is labeled and time-stamped. A planner asks focused questions, finds precise times, and cross-checks across audio and visuals.
Why it works (intuition):
- Graphs compress the who-did-what-with-whom-when information so queries are surgical, not sweeping.
- Visual embeddings fetch look-alike scenes fast; transcripts fetch name/date clues; the graph ties them together for consistency.
- An agentâs strict-to-relaxed querying avoids false hits first, then broadens carefully for recall.
- A working memory keeps only the best evidence to fit model limits.
đ Hook: You know how a coach assigns drills before the big game?
𼏠The Concept: Planning Agent
- What it is: A controller that breaks a big question into mini-steps and picks the right tool for each step.
- How it works: 1) Read the question; 2) Make a plan with up to five steps; 3) Route steps to audio, visual, or graph search; 4) Gather and refine evidence.
- Why it matters: Without planning, the AI wastes time searching everything and still misses connections.
đ Anchor: For âWhen did I chat with Alex at the kitchen island?â, the planner first finds âAlexâ mentions (audio), then finds âkitchen islandâ (visual), then checks the graph for talks-to edges overlapping those times.
đ Hook: Think of your backpack pocket where you keep only the essentials for a hike.
𼏠The Concept: Working Memory
- What it is: A compact notebook of the best cross-modal evidence found so far.
- How it works: 1) Add analyzed snippets; 2) De-duplicate; 3) Keep timestamps and sources; 4) Hand it to the final answerer.
- Why it matters: Without it, the AI forgets what it already proved and repeats itself.
đ Anchor: âDanced between 15:50â16:07 on Day 2â and âShure said âGot itâ to Alice at 15:50:21â15:50:22â live together in memory to support the answer.
đ Hook: Start with a tight filter, then loosen gently if neededâlike looking for your lost pencil first on your desk, then your room, then the house.
𼏠The Concept: Strict-to-Relaxed SQL Search
- What it is: A careful strategy for querying the graphâbegin exact, then widen by time, day, names, and relationship type if necessary.
- How it works: 1) Exact day/time/entities/relationship; 2) Relax time; 3) Relax day; 4) Partial name matches; 5) Finally drop relationship constraint.
- Why it matters: Without this, you either miss the right rows (too strict) or drown in noise (too relaxed).
đ Anchor: If âAlexâ wasnât spelled consistently, relaxing to LIKE âAlex%â recovers matches that exact search would miss.
03Methodology
At a high level: Input (week-long video with audio) â Build resources (frames, transcripts, entity scene graph) â Agent plans and runs searches (visual, audio, graph) â Analyzer trims and fuses evidence into working memory â VQA model answers.
Step 0: Prepare the multimodal resources
- What happens: Sample video at 1 FPS to get frames; run ASR for transcripts (speaker diarization if available); generate visual captions every 30s; fuse captions with transcripts into richer text. Extract entities and relationships with an LLM, add timestamps from captions/transcripts, and store the entity scene graph as rows in a SQLite database: source, target, relationship, start_t, end_t, day, and supporting text.
- Why it exists: These resources give three complementary views: images (what it looks like), words (what was said), and structure (who-with-whom-when). Without them, later searches would be blind or inconsistent.
- Example: A 7-day video yields thousands of frames, many transcript snippets, and graph edges like Jake (Person) TALKS_TO Alice (Person) at 15:50:21â15:50:22 Day 2.
đ Hook: Itâs like building your study kit before an examânotes, flashcards, and an index.
𼏠The Concept: Audio-Visual Search
- What it is: Tools that find relevant frames and transcript snippets for a sub-question.
- How it works: Visual search uses SigLIP2 embeddings to retrieve look-alike frames, optionally filtered by time or place. Transcript search can be LLM-based (smarter, slower) or BM25 (faster, simpler) to find matching words/phrases.
- Why it matters: Without quick, precise retrieval, the agent wastes compute and still misses the right moments.
đ Anchor: Query âdancingâ on Day 2 afternoon finds frames with dancing; transcript search for âmusicâ or âdanceâ lines up the timing.
Step 1: Planning and subtask creation
- What happens: The Planning Agent decomposes the userâs question into up to five crisp steps and assigns each to visual, audio, or graph search, providing any time filters.
- Why it exists: Tackling a giant question all at once is inefficient; breaking it down improves precision and reduces token use.
- Example: For âWho often sits next to me during taxi rides to go shopping?â, steps might be: (a) find shopping mentions (audio), (b) find taxi scenes (visual), (c) intersect times, (d) use graph to see who interacts with me then, (e) tally the most frequent neighbor.
Step 2: Run the retriever tools
- Visual Search Tool: Stores frame embeddings and metadata (timestamps, locations). Given a short text query like âtaxi interior,â it returns top-k nearest frames for the window of interest.
- Audio Transcript Search Tool: Either feeds per-day transcripts to an LLM to select relevant snippets with timestamps, or runs BM25 over all transcripts to get candidate lines.
- Entity Graph Search Tool: Issues SQL over the graph. It starts strict (exact day, time, entities, relationship) and relaxes constraints if needed. This prioritizes accuracy but still finds matches under real-world messiness (misspellings, timing drift).
- Why these exist: Each tool sees a different facet: visuals excel at appearance and scene type; audio captures names and intents; the graph maintains identity and relationships over time. Without one of them, crucial clues would be lost.
- Mini example: Visual returns frames labeled âcar interiorâ around 10:10â10:40; audio finds âshopping listâ at 10:20; graph shows TALKS_TO between âJakeâ and âSamâ overlapping those minutes.
Step 3: Analyze and refine (Analyzer Tool)
- What happens: An LLM examines retrieved items, discards off-target ones, extracts the most relevant evidence, and writes concise notes with timestamps into working memory.
- Why it exists: Retrieval can be noisy. The analyzer keeps memory tidy, preventing confusion later.
- Example: From 50 frames, it keeps just 4 that clearly show two people in a taxi, and links them to the transcript line âLetâs go to the store now,â Day 4 at 10:22.
Step 4: Synthesize the final answer (VQA Agent)
- What happens: The VQA model reads the original question plus the compact working memory and produces the final answer, citing the aligned evidence across modalities.
- Why it exists: This last step assembles the jigsaw pieces into a single, confident response.
- Example: âSam sits next to you most often during taxi rides to the store,â supported by repeated TALKS_TO and co-occurring taxi frames on Days 2, 4, and 6.
Secret sauce highlights
- Time-aware entity graph: Edges have start and end times, so reasoning stays aligned to real moments, not vague summaries.
- Strict-to-relaxed SQL: Lowers false positives by starting exact; increases recall smoothly when data is messy.
- Cross-modal triangulation: A claim must pass multiple checksâvisual look, spoken words, and relationship consistency.
- Efficient context budgeting: The system retrieves only whatâs needed into working memory, avoiding context overflow.
What breaks without each step
- No graph: The system forgets identities across days; relation-heavy questions collapse.
- No transcripts: Names and intents vanish; habits are hard to count.
- No visual embeddings: Scene types and subtle actions get missed.
- No analyzer: Noise floods memory; the final answer waffles.
- No planner: The system thrashes, burning time and tokens on guesses.
Concrete walk-through
- Question: âAcross the week, who consistently sat next to me during taxi rides to the store?â
- Planner creates steps (shopping mentions â taxi frames â overlap times â graph whoâs beside me â tally).
- Audio tool finds shopping utterances Tuesday 10:22 and Thursday 18:05.
- Visual tool retrieves âtaxi interiorâ frames near those times.
- Graph tool finds TALKS_TO and INTERACTS_WITH edges between Jake and Sam overlapping the same minutes.
- Analyzer keeps the clearest overlaps and writes them to memory.
- VQA agent answers âSam,â referencing multiple days and matching timestamps.
04Experiments & Results
The test: Can the system answer multiple-choice questions about very long videos, especially ones requiring multi-hop, cross-modal reasoning? Two benchmarks are used.
- EgoLifeQA (week-long, egocentric): 500 MCQs from about 50 hours of video of one participant (âJakeâ) over 7 days. Categories include EntityLog (visual identity), EventRecall (what happened when), HabitInsight (repeated behaviors), RelationMap (who talks/uses/interacts with whom), and TaskMaster (multi-step tasks across time).
- Video-MME (Long): 300 videos, each 30â60 minutes, with 2700 MCQs total in the full benchmark; the long split tests broad video comprehension.
The competition: Baselines include uniform-sampling MLLMs (like Gemini 2.5 Pro and GPT-4.1), RAG-style systems retrieving frames/captions, and agentic methods like EgoButler, VideoAgent, DrVideo, and others.
Scoreboard with context:
- EgoLifeQA: EGAgent with Gemini 2.5 Pro reaches 57.5% average accuracy. Think of class grades: many systems hovered around mid-30s to mid-40s (like Câs). EGAgent jumps to the high 50s (a strong B), and on the toughest relational categories it rockets ahead.
- Category wins: RelationMap and TaskMaster see the biggest boostsâover 20 percentage points above strong non-agentic baselines and over 30 points above earlier SOTA in some comparisonsâshowing the graph plus planning truly helps multi-hop, entity-centric problems.
- Video-MME (Long): EGAgent with Gemini 2.5 Pro achieves 74.1% while processing far fewer frames than some methods. Against methods with the same lighter backbone (e.g., Qwen2.5-VL-7B), EGAgent exceeds prior video-RAG baselines and matches more complex graph-RAG systems with over 10Ă fewer frames on some setups.
Meaningful numbers:
- On EgoLifeQA, adding the entity graph on top of frames+transcripts improves accuracy in 4 of 5 categories. RelationMap and TaskMasterâwhere cross-modal, multi-hop reasoning is essentialâbenefit the most.
- Using transcript-fused captions to build the graph (captions + transcripts) improves accuracy by around 2â3 percentage points compared to transcripts alone.
- LLM-based transcript selection outperforms BM25 by roughly 6â9 percentage points on average, at the cost of higher latency and tokens.
- Oracle tests (perfect time localization) show that even with exact timestamps, current VQA models top out below 70%, so better retrieval matters, but final reasoning models also limit ultimate scores.
Surprising findings:
- Visual retrieval shows very high recall at tight time windows (e.g., 10 seconds), yet overall MCQ accuracy remains bounded, proving that âseeing the right framesâ isnât enough; you must connect them with audio and relationships.
- The entity graph has weaker ultra-fine timing than frames/transcripts at tiny windows but is superb at broad temporal coverage and identity consistency, making it great for shortlist-then-refine.
- Combining all three tools (graph + frames + transcripts) yields strong 10-second recall and the best overall accuracy, validating true cross-modal triangulation.
Efficiency and latency:
- EGAgent typically answers an EgoLifeQA question in about 2â3 minutes wall-clock. The analyzer (MM-LM) dominates time; BM25 speeds things up but lowers accuracy.
- Token budgeting is improved by narrowing to 50 retrieved frames and focused transcript snippets, rather than streaming everything into the VQA model.
Bottom line: When videos are âvery long,â systems that simply sample uniformly or retrieve only one modality stumble. Planning over a time-aware entity graph, then cross-checking with frames and transcripts, wins on hard, relational questions and scales better in practice.
05Discussion & Limitations
Limitations
- Upstream dependence: If speech recognition, diarization (who spoke when), or visual captioning misfires, the graph gets noisy names or wrong timestamps, hurting reasoning.
- Data gaps: Not every relationship is spoken or seen clearly; the strict-to-relaxed strategy helps, but some links remain hidden.
- Latency and cost: LLM-based transcript selection and analysis improve accuracy but add time and tokens. On-device or low-latency settings may prefer BM25, accepting a performance drop.
- Final-model ceiling: Even with perfect retrieval windows, VQA accuracy caps below 70% on some tasks, so better reasoning models are still needed.
Required resources
- A vision encoder (e.g., SigLIP2) for frame embeddings and a capable LLM/VLM for analysis and answering.
- ASR (and ideally diarization) for reliable transcripts at scale.
- Storage for a per-week graph (thousands of edges) and a vector DB for frames.
- An orchestration framework (e.g., LangGraph) to run planning, tool calls, and memory updates.
When not to use
- Very short videos or single-shot questions: A simple captioning/VQA pass might be faster and just as accurate.
- Extremely privacy-sensitive settings without robust safeguards: Building a detailed, time-aware memory demands careful consent and protection.
- Environments with poor audio or heavy occlusions: If transcripts and visuals are too weak, the graph will be sparse.
Open questions
- Better identity persistence: Can we more robustly match âAlex,â âhe,â and âblue-shirt personâ across days with minimal errors?
- Adaptive retrieval: When should the agent skip modalities to save time without hurting accuracy?
- Richer relations: Beyond talks-to/uses/interacts-with/mentions, can we learn finer-grained roles (teaches, helps, hands-over) reliably?
- Faster, cheaper reasoning: Can small, specialized VLMs approximate large modelsâ analysis for on-device use?
- Trust and privacy: How do we allow users to inspect, correct, or forget parts of their personal graph safely?
06Conclusion & Future Work
Three-sentence summary: This paper introduces EGAgent, an agent that understands week-long, first-person videos by planning searches across frames, transcripts, and a time-aware entity scene graph. It answers complex, multi-hop questions by starting with strict, precise queries, relaxing only as needed, and stitching cross-modal evidence in a compact working memory. The result is state-of-the-art accuracy on hard relationship questions and efficient performance on very long video tasks.
Main achievement: Centering reasoning on a temporally annotated entity scene graphâthen orchestrating cross-modal retrieval with an agentic plannerâgreatly improves long-horizon, relational video understanding.
Future directions: Strengthen identity tracking across days, expand relationship types, speed up transcript selection, and design lighter VLMs for on-device use. Explore user controls for privacy (view/edit/forget) and better explainability of answers. Tighten temporal localization even further to close the gap to oracle performance.
Why remember this: As smart glasses and always-on assistants become normal, being able to recall âwho, where, when, and howâ across a week is transformative. EGAgent shows that a structured, time-aware memory plus careful planning turns overwhelming video streams into helpful, human-scale answers.
Practical Applications
- â˘Smart glasses that remember where you left items across the week and tell you when you last saw them.
- â˘Personal wellness tracking that spots habits like water breaks, walks, or screen time patterns over days.
- â˘Household coordination, recalling who helped with chores and when, to fairly split tasks.
- â˘Study or training review, finding all moments you practiced a skill and summarizing progress.
- â˘Workplace safety audits that locate repeated risky behaviors over a shift or week.
- â˘Retail or logistics analysis that traces who interacted with which items and where, to improve workflows.
- â˘Meeting assistants that recall multi-day project updates, who discussed what, and action-item follow-through.
- â˘Travel diaries that auto-organize by people, places, and activities with accurate timestamps.
- â˘Sports film study that links plays to specific players and situations across entire tournaments.
- â˘Elder care support, recalling routines (medication, meals) and alerting caregivers to changes.