Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).