Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.
Fast-ThinkAct teaches a robot to plan with a few tiny hidden "thought tokens" instead of long paragraphs, making it much faster while staying smart.
VLingNav is a robot navigation system that sees, reads instructions, and acts, while deciding when to think hard and when to just move.
Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
DrivePI is a single, small (0.5B) multimodal language model that sees with cameras and LiDAR, talks in natural language, and plans driving actions all at once.
Vision-Language-Action (VLA) models are robots’ “see–think–do” brains that connect cameras (vision), words (language), and motors (action).
This paper shows how to make home-helper robots better at long, multi-step chores by smart training on diverse tasks and by polishing the model after training using its own best attempts.
Robots often act like goldfish with short memories; HiF-VLA fixes this by letting them use motion to remember the past and predict the future.
Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.