Papers8

#world model

Generative Visual Code Mobile World Models

Woosung Koh, Sungjun Han et al.Feb 2arXiv

This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.

#mobile GUI#world model#vision-language model

Advancing Open-source World Models

Intermediate

Robbyant Team, Zelin Gao et al.Jan 28arXiv

LingBot-World is an open-source world model that turns video generation into an interactive, real-time simulator.

#world model#video diffusion#causal attention

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Intermediate

Moo Jin Kim, Yihuai Gao et al.Jan 22arXiv

Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.

#video diffusion#robot policy learning#visuomotor control

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

Intermediate

Hansen Jin Lillemark, Benhao Huang et al.Jan 3arXiv

This paper shows how to give AI a steady “mental map” of the world that keeps updating even when the camera looks away.

#flow equivariance#world model#partially observed environments

Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

Intermediate

Pierre Colombo, Malik Boudiaf et al.Dec 21arXiv

Capitalization tie-out checks if a company’s ownership table truly matches what its legal documents say.

#capitalization tie-out#dataroom#cap table verification

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Intermediate

Jianxiong Gao, Zhaoxi Chen et al.Dec 15arXiv

LongVie 2 is a video world model that can generate controllable videos for 3–5 minutes while keeping the look and motion steady over time.

#long video generation#world model#multimodal control

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate

Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Intermediate

Ruicheng Zhang, Mingyang Zhang et al.Dec 7arXiv

Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.

#hierarchical video generation#robotic manipulation#long-horizon planning