🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers11

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#VAE

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate
Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Intermediate
Letian Zhang, Sucheng Ren et al.Jan 21arXiv

OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).

#Unified Visual Encoder#VAE#Vision Transformer

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Intermediate
Ruiheng Zhang, Jingfeng Yao et al.Jan 16arXiv

UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.

#UniX#autoregressive branch#diffusion branch

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Beginner
Zefeng He, Xiaoye Qu et al.Dec 30arXiv

DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.

#DiffThinker#Generative Multimodal Reasoning#Diffusion Models

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Intermediate
Rujiao Long, Yang Li et al.Dec 19arXiv

Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.

#Reasoning Palette#latent contextualization#VAE

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Intermediate
Shuyuan Tu, Yueming Pan et al.Dec 18arXiv

FlashPortrait makes talking-portrait videos that keep a person’s identity steady for as long as you want—minutes or even hours.

#FlashPortrait#portrait animation#identity consistency

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Intermediate
Shengming Yin, Zekai Zhang et al.Dec 17arXiv

The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.

#image decomposition#RGBA layers#alpha blending

SS4D: Native 4D Generative Model via Structured Spacetime Latents

Intermediate
Zhibing Li, Mengchen Zhang et al.Dec 16arXiv

SS4D is a new AI model that turns a short single-camera video into a full 3D object that moves over time (that’s 4D), and it does this in about 2 minutes.

#4D generation#structured spacetime latents#temporal attention

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate
Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Intermediate
Yuan Gao, Chen Chen et al.Dec 8arXiv

This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.

#Feature Auto-Encoder#FAE#Self-Supervised Learning

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Beginner
Zefeng Zhang, Xiangzhao Hao et al.Dec 4arXiv

COOPER is a single AI model that both “looks better” (perceives depth and object boundaries) and “thinks smarter” (reasons step by step) to answer spatial questions about images.

#COOPER#multimodal large language model#unified model