Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
DrivePI is a single, small (0.5B) multimodal language model that sees with cameras and LiDAR, talks in natural language, and plans driving actions all at once.
Reasoning tokens (the words a model writes before its final answer) help the model think better, but they are not a trustworthy diary of how it really thought.
NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.
WebOperator is a smart way for AI to use a map of choices (a search tree) to navigate websites safely and reach goals.
Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.
AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.
VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.
V-RGBX is a new video editing system that lets you change the true building blocks of a scene—like base color, surface bumps, material, and lighting—rather than just painting over pixels.
The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.