The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.
This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.
This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.
Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.
This paper introduces Web World Models (WWMs), a way to build huge, explorable worlds by putting strict rules in code and letting AI write the fun details.
MMGR is a new benchmark that checks whether AI image and video generators follow real-world rules, not just whether their outputs look pretty.