The paper tackles understanding super long, first‑person videos (days to a week) by giving an AI a smarter memory and better tools.
This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.
This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.