This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.
LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.
TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.
WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.
Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.