🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers6

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Flow matching

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Intermediate
Shuo Liu, Ishneet Sukhvinder Singh et al.Feb 3arXiv

Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.

#Vision–Language Steering#Inference-time control#Diffusion policy

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Intermediate
Haozhe Xie, Beichen Wen et al.Jan 29arXiv

DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.

#Dynamic object manipulation#Vision-Language-Action#Continuous inference

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Intermediate
Hao Luo, Ye Wang et al.Jan 19arXiv

Being-H0.5 is a robot brain that learns from huge amounts of human videos and robot demos so it can work on many different robots, not just one.

#Vision-Language-Action model#Unified Action Space#Human-centric learning

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Intermediate
Xin Yu, Xiaojuan Qi et al.Dec 26arXiv

This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.

#Self-Evaluating Model#Any-step inference#Text-to-image generation

SpotEdit: Selective Region Editing in Diffusion Transformers

Intermediate
Zhibin Qin, Zhenxiong Tan et al.Dec 26arXiv

SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.

#Diffusion Transformer#Selective image editing#Region-aware editing

Visual Generation Tuning

Intermediate
Jiahao Guo, Sinan Du et al.Nov 28arXiv

Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.

#Visual Generation Tuning#VGT-AE#Vision-Language Models