Papers14

#text-to-image

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Vaibhav Agrawal, Rishubh Parihar et al.Feb 26arXiv

SeeThrough3D teaches image generators to understand what should be visible and what should be hidden when objects overlap, just like in real life.

#occlusion-aware generation#3D layout control#text-to-image

BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Beginner

Eliran Kachlon, Alexander Visheratin et al.Feb 24arXiv

BBQ is a text-to-image model that lets you place objects exactly where you want using numeric bounding boxes and color them with exact RGB values.

#text-to-image#bounding boxes#RGB control

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Intermediate

Tianhe Wu, Ruibin Li et al.Feb 3arXiv

The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.

#diffusion distillation#distribution matching distillation#mode collapse

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner

Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Intermediate

Dongjie Cheng, Ruifeng Yuan et al.Jan 25arXiv

AR-Omni is a single autoregressive model that can take in and produce text, images, and speech without extra expert decoders.

#autoregressive modeling#multimodal large language model#any-to-any generation

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Intermediate

Chengzhuo Tong, Mingkun Chang et al.Jan 15arXiv

This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.

#Chain-of-Frame#visual reasoning#text-to-image

Unified Thinker: A General Reasoning Modular Core for Image Generation

Intermediate

Sashuai Zhou, Qiang Zhou et al.Jan 6arXiv

Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.

#reasoning-aware image generation#structured planning#edit-only prompt

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Intermediate

Kaixin Ding, Yang Zhou et al.Dec 18arXiv

Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.

#meta-gradient#data selection#text-to-image

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Intermediate

Bozhou Li, Sihan Yang et al.Dec 17arXiv

This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.

#diffusion models#text encoder#multimodal large language model

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate

Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate

Minglei Shi, Haolin Wang et al.Dec 12arXiv

This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.

#text-to-image#diffusion transformer#flow matching

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Intermediate

Tong Zhang, Carlos Hinojosa et al.Dec 11arXiv

Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.

#diffusion models#memorization mitigation#latent feature injection

1 2