SeeThrough3D teaches image generators to understand what should be visible and what should be hidden when objects overlap, just like in real life.
BBQ is a text-to-image model that lets you place objects exactly where you want using numeric bounding boxes and color them with exact RGB values.
The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.
Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.