This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.
TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.
WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.
Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.
MemEvolve teaches AI agents not only to remember past experiences but also to improve the way they remember, like a student who upgrades their study habits over time.
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.
Robust-R1 teaches vision-language models to notice how a picture is damaged, think through what that damage hides, and then answer as if the picture were clear.
This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.
This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.
This paper speeds up diffusion language models (dLLMs) by changing the order in which they fill in missing words.