The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
Big vision-language models are super smart but too large to fit on phones and small devices.
Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.
This paper teaches a vision-language model to first find objects in real 3D space (not just 2D pictures) and then reason about where things are.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.
FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.
The paper defines Microscopic Spatial Intelligence (MiSI) as the skill AI needs to understand tiny 3D things like molecules from 2D pictures and text, just like scientists do.
The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.