Advancing Open-source World Models

Robbyant Team; Zelin Gao; Qiuyu Wang; Yanhong Zeng; Jiapeng Zhu; Ka Leong Cheng; Yixuan Li; Hanlin Wang; Yinghao Xu; Shuailei Ma; Yihang Chen; Jie Liu; Yansong Cheng; Yao Yao; Jiayi Zhu; Yihao Meng; Kecheng Zheng; Qingyan Bai; Jingye Chen; Zehong Shen; Yue Yu; Xing Zhu; Yujun Shen; Hao Ouyang

Advancing Open-source World Models

Intermediate

Robbyant Team, Zelin Gao, Qiuyu Wang et al.1/28/2026

arXiv PDF

Key Summary

•LingBot-World is an open-source world model that turns video generation into an interactive, real-time simulator.
•It keeps scenes consistent for minutes, remembers objects even when they go off-screen, and reacts to user actions like W, A, S, D and camera turns.
•A layered data engine mixes real videos, game recordings with controls, and Unreal Engine renders, then adds three kinds of captions to teach both scenes and motion.
•Training happens in three stages: build a strong video base, add action control and long memory, then make it fast and causal for live use.
•A Mixture-of-Experts design splits the work between a 'big picture' expert and a 'fine details' expert for high fidelity without extra runtime cost.
•Causal attention and KV caching let the model generate frames step by step, so it can respond in under a second at around 16 fps.
•Few-step distillation plus self-rollout and a gentle adversarial head keep quality high and reduce long-horizon drift.
•On VBench, LingBot-World scores top imaging and aesthetic quality and a much higher dynamic degree than strong baselines.
•It supports promptable world events (like switching to night or spawning birds) and enables training action agents and doing 3D reconstruction from its videos.
•Limits remain: heavy compute needs, limited action types, occasional drift over very long runs, and no multi-agent yet.

Why This Research Matters

Interactive, consistent virtual worlds can change how we learn, play, and train machines. Students could explore historical sites, switch to night or winter, and see how the world changes instantly. Game creators can prototype new levels with live controls and stable long-term memory, speeding creative workflows. Robots can safely practice navigation and planning in worlds that react properly to actions, before entering homes or factories. Safer driving systems can test rare weather and lighting scenarios on demand. Because LingBot-World is open-source, researchers and developers everywhere can adapt it for their needs instead of waiting on closed tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a great video can look amazing for a few seconds, but a game needs to stay logical and responsive for minutes? That’s the big difference between pretty clips and living worlds. Before this work, most AI video models were like talented daydreamers: they could paint gorgeous short scenes, but they didn’t really know the rules of the world—like what happens when you turn left, where a car goes when it leaves the frame, or how the same statue should look when you see it again later.

The problem was threefold. First, the right data was scarce. Web videos show what happened, but they don’t tell you which buttons were pressed, which way the camera turned, or why a scene changed. Without action labels and camera information, a model can only guess cause and effect. Second, long-term coherence was shaky. Standard diffusion video models often forget details as time passes, so buildings shift, styles wobble, and objects reappear differently after being out of view. Third, they were slow. Iterative diffusion sampling is costly, so most systems could not respond live to user input; they had to render offline.

People tried to scale data and models to get past this, but that alone didn’t fix interaction or memory. Training on more passive videos improved looks, not logic. Adding a little control helped for a few seconds, but drift and inconsistency returned on longer runs. Some teams built impressive closed systems, but without open models and code, the wider community couldn’t build on them.

What was missing was a whole pipeline that treats videos not as passive movies but as cause-and-effect worlds: a data engine that ties visuals to actions and geometry; a training path that first learns to paint well, then learns long memory and action-following, and finally runs fast enough for live play; and an architecture that respects time’s arrow so the future can’t peek back at the past.

Here’s why everyday life should care. Interactive worlds power games, training robots, helping cars plan, and creating educational content that responds to students. Imagine touring a historic site and being able to turn down any hallway, change to night to see lighting effects, or add rain to learn how weather changes reflections—instantly. Or training a home robot safely in a simulated kitchen where opening a cupboard actually reveals what’s behind it, and it stays the same when you look back minutes later.

🍞 Hook: Imagine playing a mystery game where clues only matter if they happen before the solution. 🥬 The Concept (Causal Attention): Causal attention is a way for the model to only use information from now and the past when deciding the next frame, never from the future.

How it works:
1. Split the video into time chunks.
2. Inside each chunk, allow short-range two-way checking to keep nearby frames smooth.
3. Across chunks, only look backward in time, never forward.
4. Reuse saved keys and values (KV caching) so the next step is fast.
Why it matters: Without causal attention, the model might “cheat” by using future frames, which breaks live interaction and makes real-time response impossible. 🍞 Anchor: When you press W to walk forward, the model updates the next frames based only on what you’ve seen and done, not on frames it hasn’t generated yet.

🍞 Hook: Think of a game controller that makes your character speed up more on a downhill than flat ground. 🥬 The Concept (Action Control Modulation): Action control modulation lets the model change what it generates based on your inputs (like W, A, S, D and camera turns) so the world reacts correctly.

How it works:
1. Turn inputs into numeric signals (keys become on/off, camera turns become smooth values).
2. Encode them (including camera motions) into special vectors.
3. Gently adjust the model’s internal features (scale and shift) using these vectors.
4. Generate the next frames that match your action.
Why it matters: Without it, a scene would move randomly instead of following your controls. 🍞 Anchor: Press A to strafe left, and the hallway slides appropriately to your right, staying consistent when you look back later.

🍞 Hook: Picture a theme park builder that’s not just pretty but also consistent when you revisit the same rides. 🥬 The Concept (LingBot-World): LingBot-World is an open-source system that learns how worlds change over time and renders them in real time based on your actions and prompts.

How it works:
1. Learn general video skills from huge datasets (textures, motion basics).
2. Train on long, action-labeled sequences to gain memory and controllability.
3. Convert to a fast, causal generator that responds live.
Why it matters: Without all three steps, you’d get short, pretty clips, not interactive, persistent worlds. 🍞 Anchor: You walk through a temple, turn left, look back after a minute, and the same statue is still there in the right place and style.

02Core Idea

The one-sentence aha: Turn a beautiful daydreamer into a rule-following simulator by teaching it with layered, action-aware data, then reshaping it to think forward in time fast.

Three analogies to lock it in:

Movie vs. sandbox: Old models filmed a short scene; LingBot-World builds a playground where your moves change what happens next.
GPS vs. map: A static map looks nice; a GPS updates with your turns and remembers where landmarks are when you circle back.
Chef vs. short-order cook: A master chef can cook anything slowly; a short-order cook delivers fast. LingBot-World first learns gourmet skills, then distills them to serve hot and quick.

Before vs. after:

Before: Short, stunning clips with drift over time; limited or no reaction to controls; offline rendering.
After: Minute-long consistency; precise action following; live response around 16 fps with sub-second latency; open-source so others can build on it.

Why it works (intuition, not equations):

Learning to see and paint: Pre-training on massive videos teaches textures, shapes, and common motions, so the model starts with strong visual instincts.
Teaching cause and memory: Middle-training on long, action-labeled sequences shows the model what follows from which control and encourages it to keep a stable internal picture for much longer.
Respecting time: Causal attention ensures the model can’t peek at future frames, so it behaves like a proper simulator.
Going fast without forgetting: Few-step distillation compresses the slow multi-step process into a handful of smart steps while self-rollouts train the model to recover from its own small mistakes.

Building blocks (with simple, stand-alone explanations):

🍞 Hook: You know how a teacher gives chapter, scene, and beat-by-beat summaries? 🥬 The Concept (Hierarchical Captioning): Hierarchical captioning describes each video at three levels—overall story, static scene details, and time-stamped events.

How it works:
1. Narrative caption ties the whole scene and camera path together.
2. Scene-static caption lists objects and style without motion.
3. Dense temporal captions mark what happens every few seconds.
Why it matters: Without it, the model mixes up what should stay the same (the room) and what should move (the camera), hurting control and memory. 🍞 Anchor: “A bright temple with red doors” (static), then “turn left” (temporal), all wrapped in a smooth tour (narrative).

🍞 Hook: Imagine a hospital where each doctor handles a different part—one for diagnosis, one for delicate surgery. 🥬 The Concept (Mixture-of-Experts): Mixture-of-Experts lets different sub-models specialize—one handles rough global layout; another polishes details—while only one works at a time.

How it works:
1. Split the job into stages (coarse vs. fine).
2. Route each timestep to the right expert.
3. Keep runtime similar to a single model since only one expert is active.
Why it matters: Without it, you’d pay extra compute for similar quality or lose quality to go fast. 🍞 Anchor: Early frames lock in room shape; later frames add crisp lantern textures.

🍞 Hook: Think of building a library before writing your own book. 🥬 The Concept (Data Acquisition Engine): A data engine gathers and organizes videos, actions, and camera poses from real footage, games, and Unreal Engine renders.

How it works:
1. Collect varied videos and game recordings with W/A/S/D and camera data.
2. Render synthetic scenes with known camera paths.
3. Filter for quality and add hierarchical captions.
Why it matters: Without matched actions and geometry, the model can’t learn true cause and effect. 🍞 Anchor: The system learns that pressing W in a hallway should make the doorway grow larger over time.

🍞 Hook: Reducing a big recipe to a quick weekday version without losing taste. 🥬 The Concept (Few-Step Distillation): Few-step distillation trains a fast student model to mimic a slow, careful teacher so it can generate high-quality frames in just a few steps.

How it works:
1. Use the teacher’s guidance signals.
2. Train the student to make similar choices quickly.
3. Practice on the student’s own rollouts so it can fix its small mistakes.
Why it matters: Without distillation, real-time response would be too slow. 🍞 Anchor: The fast model keeps the temple’s look and correct walk-forward motion while running live.

🍞 Hook: A conductor keeps the orchestra in time—they can’t play future notes yet! 🥬 The Concept (Causal Attention): Causal attention enforces that each new frame only depends on now and the past, not the future.

How it works:
1. Split time into chunks.
2. Within a chunk, smooth bidirectional attention keeps local consistency.
3. Across chunks, only past influences the future.
4. Cache previous info to go fast.
Why it matters: Without it, you can’t safely stream or interact in real-time. 🍞 Anchor: You press D to sidestep right; the next frame updates instantly based on your past view, not a hidden future frame.

03Methodology

At a high level: Inputs (image or video + actions + text) → Stage I Pre-training (learn to paint) → Stage II Middle-training (learn memory and control) → Stage III Post-training (go causal and fast) → Output (interactive, consistent video).

Stage I: Pre-training (establish the canvas)

What happens: Start from a powerful image-to-video diffusion foundation (Wan2.2, ~14B params) trained on diverse videos so the model already knows textures, motion basics, and object permanence.
Why it exists: If the model can’t paint beautifully and coherently, everything later will look wobbly.
Example: Feed a single outdoor photo; it produces a short, realistic pan with smooth lighting changes.

Stage II: Middle-training (add long memory and action control)

Build a fundamental world model with long context

What happens:
- Use longer clips (from ~5 s to ~60 s) so the model sees events reappear and must stay consistent.
- Increase weight on early (high-noise) steps that set global structure to reduce long-term drift.
- Train on both image-to-video (predict after a single frame) and video continuation (extend a clip) so it can start from anywhere.
Why it exists: To keep the temple doors, statue, and lighting stable as you roam around for minutes.
Example: Walk away from a bridge, come back 45 seconds later; it appears in the right spot and size.

Add controllability via action injection

What happens: Actions include discrete keys (W/A/S/D) and continuous camera turns. Camera motion is encoded with Plücker-inspired embeddings; keys become multi-hot vectors. These are fused and injected into the model with Adaptive LayerNorm (AdaLN) to scale/shift internal features.
Why it exists: Without action modulation, motion would be random rather than following your inputs.
Example: Hold W and turn the mouse right; the hallway expands forward and rotates smoothly.

Keep training feasible

What happens: Training a ~28B parameter Mixture-of-Experts (two ~14B experts, one active per step) on minute-long videos requires memory tricks: Fully Sharded Data Parallel (FSDP2) splits parameters/states across GPUs; Ulysses context parallel slices long sequences so attention fits.
Why it exists: Otherwise, GPU memory would run out before learning long horizons.
Example: A 60-second, 16 fps clip becomes sequence shards across machines to train in parallel.

🍞 Hook: Two teammates—one sketches the scene, the other inks fine lines. 🥬 The Concept (Mixture-of-Experts): Two experts share the load: a high-noise expert locks in layout and motion logic; a low-noise expert polishes textures and small movements, but only one runs per step, keeping speed similar to a single model.

Why it matters: You get world-scale stability without paying double at runtime. 🍞 Anchor: The room shape stays correct first; then lantern tassels flutter realistically.

🍞 Hook: A volume knob that changes the music based on your moves. 🥬 The Concept (Action Control Modulation): The model converts your inputs into feature tweaks via AdaLN, steering generation frame by frame.

Why it matters: Without it, pressing A wouldn’t consistently move you left. 🍞 Anchor: Tap S; the scene backs up with correct parallax.

Stage III: Post-training (go causal, few-step, and real-time)

Causal architecture adaptation

What happens: Replace full bidirectional temporal attention with block causal attention: local two-way attention inside small chunks for smoothness; strict past→future across chunks for autoregressive rollout. Initialize from the high-noise expert (better for dynamics). Add KV caching to avoid recomputing history.
Why it exists: To support live, stepwise generation that can respond under a second.
Example: As you stream forward, only the newest chunk computes fresh attention; older chunks are cached.

Few-step distillation with long-horizon self-rollouts

What happens:
- Distribution Matching Distillation (DMD) teaches the fast student to follow the slow teacher’s guidance with just a few denoising steps.
- Self-rollout: the student practices on its own outputs and learns to recover from small mistakes that accumulate over time.
- A light adversarial head (on the fake score network) brings in real-data supervision to boost realism beyond what the teacher alone provides.
Why it exists: To keep quality high while making the system responsive and stable over minutes.
Example: Ten-minute tours that keep style and layout steady, with minor drift only late.

🍞 Hook: Only count moves you’ve already made. 🥬 The Concept (Causal Attention): The model forbids peeking at future frames, enabling real-time interactivity and KV caching.

Why it matters: Without it, you can’t safely map inputs to immediate changes. 🍞 Anchor: Press W and see forward motion update now, not after rendering a whole future clip.

🍞 Hook: Turning a slow, careful recipe into a fast weeknight classic. 🥬 The Concept (Few-Step Distillation): Train a speedy student to imitate the slow teacher’s choices, and toughen it with self-rollouts and a gentle discriminator so it holds up over long runs.

Why it matters: Without it, either quality drops or latency is too high. 🍞 Anchor: The student keeps the temple’s ornate doors crisp while running at ~16 fps.

Secret sauce (why this recipe is clever)

Layered data + captions decouple what should stay (static scene) from what should change (motion), so the model learns control cleanly.
MoE gives maximum quality for roughly single-expert runtime.
Action injection via AdaLN is precise yet safe: it guides without wrecking the pre-trained visual prior.
Block causal attention + KV caching hit the sweet spot between coherence and speed.
DMD + self-rollout + adversarial head improves long-horizon stability and visual fidelity beyond plain distillation.

04Experiments & Results

The tests focused on what matters for an interactive world: does it look great, move richly, stay smooth, remain consistent over time, and react to controls? The team used VBench, a broad video benchmark, and generated 100+ videos each over 30 seconds to check long-horizon behavior, not just quick bursts.

Competition included strong recent world/video models such as Yume-1.5 and HY-World 1.5 (with other notable closed models discussed qualitatively). The bar was high: these systems already make good-looking videos and some support interactivity.

Scoreboard with meaning:

Imaging Quality and Aesthetic Quality: LingBot-World tops both (e.g., 0.6683 and 0.5660), which is like getting an A when others get high Bs. This means textures, lighting, and composition look cleaner and more appealing.
Dynamic Degree: LingBot-World achieves 0.8857 versus ~0.72–0.76 for baselines—like scoring a 9/10 on action richness when others score around 7.5. This shows the model responds with varied, meaningful motion instead of staying too still or repetitive.
Motion Smoothness and Temporal Flicker: It matches or stays competitive with the best, meaning movements flow without jitter and frames don’t flash inconsistently.
Overall Consistency: It maintains top-tier coherence across time, so the story and style don’t break when you pan away and return minutes later.

Qualitative highlights make the numbers feel real:

Emergent memory: Leave a statue off-screen for up to a minute; when you look back, it’s the same statue in the same spot with the same style. That’s like remembering where you put your backpack and finding it there later.
Reasoning about unobserved motion: Walk forward, then look front again; the bridge is now closer in a physically correct way. A car that left the frame keeps driving; it reappears where it should—not frozen or teleported.
Ultra-long tours: The system can sustain coherent scenes up to ten minutes, with only modest drift toward the end. That’s like keeping a classroom’s story consistent across a whole lesson, not just the first paragraph.
Real-time variant: LingBot-World-Fast runs around 16 fps with sub-second latency (e.g., 480p on a single node, with 720p supported in the stack), while keeping perceptual quality close to the slower teacher.

Surprising findings:

Long memory without explicit 3D: Even without storing a full 3D mesh, the model’s learned video-space memory kept landmarks consistent, suggesting strong implicit spatial understanding.
Style shifts with stable geometry: Prompts like “night,” “winter,” or “pixel art” re-style the world while preserving layout and motion. Geometry stays put even as the look changes.
Distill-and-advise works: Adding a light adversarial head on top of DMD nudged the student to surpass the teacher’s purely distilled quality in some cases, especially in long-horizon crispness.

Takeaway: The results show a rare combo—high visual quality, real interactivity, long memory, and live speed—in a fully open-source package the community can extend.

05Discussion & Limitations

Limitations (honest and concrete):

Memory stability: The model’s memory lives inside its context window and emergent representations, not in a dedicated long-term memory module. Very long runs can still drift.
Compute heavy: To get the best quality and speed, you need strong GPUs and multi-GPU setups. This puts it out of reach for many laptops today.
Action space: Controls mainly cover navigation and view changes; precise object interactions (e.g., “pick up the blue mug behind the red book”) are not yet reliable.
Interaction precision: Without fine-grained object grounding, targeted manipulations can be hit-or-miss.
Generation length & drift: After very long sequences, layouts can slowly morph, especially in highly detailed scenes.
Single-agent: Multi-agent logic (e.g., coordinating several characters with different goals) isn’t implemented yet.

Resources required:

Multi-GPU servers for middle-training (long sequences, MoE) using FSDP2 and context parallel.
Strong single or multi-GPU nodes for the real-time student with causal attention and KV caching.
Data pipelines for game capture and Unreal Engine renders plus VLM-based captioning.

When not to use:

If you need precise, object-level manipulation with guarantees (e.g., industrial pick-and-place with millimeter accuracy), a dedicated robotics stack with explicit 3D and controllers may be better.
If you must deploy on edge devices with tiny compute budgets, current latency/resolution targets may be too heavy.
If multi-agent social dynamics are essential (e.g., crowd simulations with policies), this version won’t cover that fully.

Open questions:

Can we add an explicit, compact world memory that persists beyond the context window to eliminate long-horizon drift?
How do we expand the action space to rich interactions (open, place, pour, write) while keeping quality and speed?
What’s the best mix of implicit video memory and lightweight 3D structure to get the best of both worlds?
How can we bring compute costs down so consumer GPUs can run real-time at higher resolutions?
What standardized benchmarks should measure interactive consistency, causality, and agent performance over 10+ minutes?

06Conclusion & Future Work

In three sentences: LingBot-World upgrades video generation into an open, interactive world simulator by pairing a layered data engine with a three-stage training path—first learn to paint, then to remember and follow actions, then to respond live with causal attention. It sustains minute-scale consistency, follows controls like W/A/S/D and camera turns, and runs in near real time, while offering promptable world edits and strong scores against state-of-the-art baselines. By releasing code and weights, it invites the community to build richer worlds, better agents, and more accessible tools.

Main achievement: Combining high fidelity, long-horizon consistency, and real-time controllability in a single open-source system—powered by hierarchical captions, MoE specialization, action modulation, block causal attention, and few-step distillation.

Future directions: Add explicit, durable memory to reduce drift; broaden the action space to precise object interactions; support multi-agent dynamics; and optimize for consumer GPUs. Richer evaluation suites for true interactivity and causality over very long horizons will sharpen progress.

Why remember this: It shows how to turn a beautiful daydream into a dependable simulator—by teaching cause and effect, respecting time’s arrow, and slimming a slow genius into a fast, practical partner—so anyone can build and explore persistent, playable virtual worlds.

Practical Applications

•Interactive museum tours where learners can change lighting or weather and navigate freely.
•Rapid game prototyping with live, controllable scenes and promptable world events.
•Robot navigation training (e.g., room touring, doorway finding) in safe simulated spaces.
•Autonomous driving scenario generation with controllable weather, lighting, and traffic events.
•Cinematic previsualization that holds layout and style over long shots.
•Education demos showing physics-like consistency (parallax, occlusion) when moving the camera.
•Data augmentation for 3D reconstruction, turning generated videos into training point clouds.
•Human-in-the-loop agent training where users guide an agent’s exploration then let it continue.
•Accessibility tools that let creators restyle scenes (night, winter, pixel art) while preserving geometry.
•Virtual field trips that respond in real time to student questions and camera moves.

Version: 1