VideoSSM is a new way to make long, stable, and lively videos by giving the model two kinds of memory: a short-term window and a long-term state-space memory.
TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.
The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.
SPARK teaches AI to grade its own steps without needing the right answers written down anywhere.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.
Fairy2i turns any pre-trained real-valued Transformer layer into an exactly equivalent complex form, so nothing changes before quantization.
Diffusion language models (dLLMs) can write all parts of an answer in parallel, but they usually take many tiny cleanup steps, which makes them slow.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.
This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).
This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.
This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.
Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.