AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
This paper teaches AI to turn simple dialogue into full movie scenes by first writing a detailed script and then filming it step by step.
Fast KVzip is a new way to shrink an LLM’s memory (the KV cache) while keeping answers just as accurate.
AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
This paper builds one smart system that listens to child–adult conversations and writes what was said, who said it, and exactly when each person spoke.
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.
SkyReels-V3 is a single AI model that can make videos in three ways: from reference images, by extending an existing video, and by creating talking avatars from audio.
Most people on Earth speak more than one language and often switch languages in the same chat, but AI tools aren’t tested well on this real behavior.
C-RADIOv4 is a single vision model that learns from several expert models at once and keeps their best skills while staying fast.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
VisGym is a playground of 17 very different visual tasks that test and train AI models that see and talk (Vision–Language Models) to act over many steps.
Mixture-of-Experts (MoE) models often send far more tokens to a few “favorite” experts, which overloads some GPUs while others sit idle.