This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.
UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.
OmniPSD is a new AI that can both make layered Photoshop (PSD) files from words and take apart a flat image into clean, editable layers.
TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.
Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.
This paper teaches a computer to turn one single picture into a moving 3D scene that stays consistent from every camera angle.
EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.
TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.
This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).