Papers3

All Beginner Intermediate Advanced

All Sources arXiv

#Long-Horizon Planning

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Intermediate

GigaBrain Team, Boyuan Wang et al.Feb 12arXiv

GigaBrain-0.5M* is a robot brain that sees, reads, and acts, and it gets smarter by imagining the future before moving.

#Vision-Language-Action#World Model#Reinforcement Learning

Not triaged yet

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Intermediate

Zirui Wang, Junyi Zhang et al.Jan 23arXiv

VisGym is a playground of 17 very different visual tasks that test and train AI models that see and talk (Vision–Language Models) to act over many steps.

#VisGym#Vision–Language Models#Multimodal Agents

Not triaged yet

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Intermediate

Jing Zuo, Lingzhou Mu et al.Jan 20arXiv

FantasyVLN teaches a robot to follow language instructions while looking around, using a smart, step-by-step thinking style during training but not at test time.

#Vision-and-Language Navigation#Chain-of-Thought#Multimodal CoT

Not triaged yet