Papers6

#Instruction Following

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang, Yuhang Chen et al.Feb 2arXiv

This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).

#Unified Multimodal Models#GAPEVAL#Gap Score

LLM-in-Sandbox Elicits General Agentic Intelligence

Beginner

Daixuan Cheng, Shaohan Huang et al.Jan 22arXiv

This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.

#LLM-in-Sandbox#Agentic Intelligence#Reinforcement Learning

TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts

Intermediate

Yu Xu, Hongbin Yan et al.Jan 12arXiv

TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.

#Task-Aware Gating#Mixture-of-Experts#Unified Image Generation

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Intermediate

Yuanyang Yin, Yufan Deng et al.Jan 12arXiv

Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.

#Image-to-Video generation#Diffusion Transformer#Controllability

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Beginner

Ruiyan Han, Zhen Fang et al.Jan 6arXiv

This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.

#Unified Multimodal Models#Self-Generated Supervision#Conduction Aphasia

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Intermediate

Tianshuai Hu, Xiaolu Liu et al.Dec 18arXiv

Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.

#Vision-Language-Action#End-to-End Autonomous Driving#Dual-System VLA