This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.
This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.
ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.
Agentic-R is a new way to teach a search retriever to find not just similar text, but the text that truly helps an AI get the final answer right.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.
ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.
The paper shows that simply adding a new AI model to the menu—without anyone actually using it—can push a fairness-focused regulator to change the market rules, shifting money from one side to the other.
Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.
Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.
The paper shows that changing the language a model 'thinks in' (its language of thought) can make its English answers more varied without making them much worse in quality.
Chroma 1.0 is a real-time, end-to-end speech-to-speech system that can talk back in your own cloned voice with sub-second delay.