🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers14

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#GenEval

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Intermediate
Zehong Ma, Ruihan Xu et al.Feb 2arXiv

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Intermediate
Fu-Yun Wang, Han Zhang et al.Feb 1arXiv

PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.

#PromptRL#flow matching#reinforcement learning

DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment

Intermediate
Haoyou Deng, Keyu Yan et al.Jan 28arXiv

DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.

#DenseGRPO#flow matching#GRPO

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate
Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

Intermediate
Xin Xie, Jiaxian Guo et al.Jan 22arXiv

Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.

#diffusion models#rectified flow#hypernetwork

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Intermediate
Chengzhuo Tong, Mingkun Chang et al.Jan 15arXiv

This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.

#Chain-of-Frame#visual reasoning#text-to-image

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Beginner
Shilong Zhang, He Zhang et al.Dec 19arXiv

This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.

#Pixel–Semantic VAE#Semantic Regularization#Off-Manifold Generation

StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models

Intermediate
Senmao Li, Kai Wang et al.Dec 18arXiv

StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.

#Visual Autoregressive Modeling#Next-Scale Prediction#Stage-Aware Acceleration

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Beginner
Shufan Li, Jiuxiang Gu et al.Dec 16arXiv

Sparse-LaViDa makes diffusion-style AI models much faster by skipping unhelpful masked tokens during generation while keeping quality the same.

#Masked Discrete Diffusion#Sparse Parameterization#Register Tokens

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate
Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Intermediate
Han Lin, Xichen Pan et al.Dec 12arXiv

MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.

#MetaCanvas#MLLM#Diffusion Transformer

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Intermediate
Xin He, Longhui Wei et al.Dec 4arXiv

EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.

#EMMA#unified multimodal architecture#32x autoencoder
12