Papers18

All Beginner Intermediate Advanced

All Sources arXiv

#Diffusion Transformer

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Intermediate

Jingxuan Zhang, Yunta Hsieh et al.Feb 23arXiv

Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.

#Vision-Language-Action#Post-Training Quantization#Diffusion Transformer

Not triaged yet

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Intermediate

Dahye Kim, Deepti Ghadiyaram et al.Feb 19arXiv

This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.

#Diffusion Transformer#Dynamic Tokenization#Patch Scheduling

Not triaged yet

FastVMT: Eliminating Redundancy in Video Motion Transfer

Intermediate

Yue Ma, Zhikai Wang et al.Feb 5arXiv

FastVMT is a faster way to copy motion from one video to another without training a new model for each video.

#FastVMT#video motion transfer#Diffusion Transformer

Not triaged yet

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Intermediate

Bozhou Li, Yushuo Guan et al.Feb 3arXiv

The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.

#Diffusion Transformer#Text Conditioning#Multi-layer LLM Features

Not triaged yet

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng et al.Jan 22arXiv

Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.

#Representation Autoencoder#RAE#Variational Autoencoder

Not triaged yet

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Intermediate

Dongting Hu, Aarush Gupta et al.Jan 13arXiv

This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.

#Diffusion Transformer#Sparse Attention#Adaptive Sparse Self-Attention

Not triaged yet

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Intermediate

Kewei Zhang, Ye Huang et al.Jan 12arXiv

Transformers are powerful but slow because regular self-attention compares every token with every other token, which grows too fast for long sequences.

#Multi-Head Linear Attention#Linear Attention#Self-Attention

Not triaged yet

TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Intermediate

Yu Xu, Hongbin Yan et al.Jan 12arXiv

TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.

#Task-Aware Gating#Mixture-of-Experts#Unified Image Generation

Not triaged yet

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Intermediate

Yuanyang Yin, Yufan Deng et al.Jan 12arXiv

Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.

#Image-to-Video generation#Diffusion Transformer#Controllability

Not triaged yet

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Intermediate

Xijie Huang, Chengming Xu et al.Jan 5arXiv

This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.

#First-Frame Propagation#Video Editing#FFP-300K

Not triaged yet

Act2Goal: From World Model To General Goal-conditioned Policy

Intermediate

Pengfei Zhou, Liliang Chen et al.Dec 29arXiv

Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.

#goal-conditioned policy#visual world model#multi-scale temporal hashing

Not triaged yet

SpotEdit: Selective Region Editing in Diffusion Transformers

Intermediate

Zhibin Qin, Zhenxiong Tan et al.Dec 26arXiv

SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.

#Diffusion Transformer#Selective image editing#Region-aware editing

Not triaged yet

1 2