Papers6

#Flow matching

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu, Ishneet Sukhvinder Singh et al.Feb 3arXiv

Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.

#Vision–Language Steering#Inference-time control#Diffusion policy

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Intermediate

Haozhe Xie, Beichen Wen et al.Jan 29arXiv

DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.

#Dynamic object manipulation#Vision-Language-Action#Continuous inference

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Intermediate

Hao Luo, Ye Wang et al.Jan 19arXiv

Being-H0.5 is a robot brain that learns from huge amounts of human videos and robot demos so it can work on many different robots, not just one.

#Vision-Language-Action model#Unified Action Space#Human-centric learning

Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Intermediate

Xin Yu, Xiaojuan Qi et al.Dec 26arXiv

This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.

#Self-Evaluating Model#Any-step inference#Text-to-image generation

SpotEdit: Selective Region Editing in Diffusion Transformers

Intermediate

Zhibin Qin, Zhenxiong Tan et al.Dec 26arXiv

SpotEdit is a training‑free way to edit only the parts of an image that actually change, instead of re-generating the whole picture.

#Diffusion Transformer#Selective image editing#Region-aware editing

Visual Generation Tuning

Intermediate

Jiahao Guo, Sinan Du et al.Nov 28arXiv

Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.

#Visual Generation Tuning#VGT-AE#Vision-Language Models