Text-to-image models can make pretty pictures but still miss details in complex prompts, like counts, positions, or exact text.
DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.
Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isn’t much training data for that.
CUDA Agent is a training system that teaches an AI to write super-fast GPU code (CUDA kernels) by practicing, testing, and getting rewards for correct and speedy results.
The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.
SenCache speeds up video diffusion models by reusing past answers only when the model is predicted to change very little.
The paper shows that when we train with the popular InfoNCE contrastive loss, the learned features start to behave like they come from a Gaussian (bell-shaped) distribution.
Masked Image Generation Models (MIGMs) make pictures by filling in many blank spots step by step, but each step is slow and repeats a lot of work.
Similarity-based image–text models like CLIP can be fooled by “half-truths,” where adding one plausible but wrong detail makes a caption look more similar to an image instead of less similar.
This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.
Speculative decoding speeds up big language models by letting a small helper model guess several next words and having the big model check them all at once.