Pixels are the raw stuff of images, and this paper shows you can learn great vision skills by predicting pixels directly, not by comparing fancy hidden features.
This paper shows a simple way to turn any strong autoregressive (step-by-step) model into a diffusion vision-language model (parallel, block-by-block) without changing the architecture.
This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.
Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
This paper teaches large language models (LLMs) to explore smarter by listening to their own gradients—the directions they would update—rather than chasing random variety.
Long texts are expensive for AI to read because each extra token costs a lot of compute and memory.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.
This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.
SCOPE lets AI agents rewrite their own instructions while they are working, so they can fix mistakes and get smarter on the next step, not just the next task.