Recursive Language Models (RLMs) let an AI read and work with prompts that are much longer than its normal memory by treating the prompt like a big external document it can open, search, and study with code.
Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.
This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.
DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.
This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.
This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.