The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.
AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.
This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.
Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.
LongCat-Image is a small (6B) but mighty bilingual image generator that turns text into high-quality, realistic pictures and can also edit images very well.
TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.