Scone is a new AI method that makes images from instructions while correctly picking the right subject even when many look similar.
Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.
AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.
VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.
V-RGBX is a new video editing system that lets you change the true building blocks of a scene—like base color, surface bumps, material, and lighting—rather than just painting over pixels.
The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.
The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
Vision-Language-Action (VLA) models are robots’ “see–think–do” brains that connect cameras (vision), words (language), and motors (action).