The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.