Yume-1.5: A Text-Controlled Interactive World Generation Model

Xiaofeng Mao; Zhen Li; Chuanhao Li; Xiaojie Xu; Kaining Ying; Tong He; Jiangmiao Pang; Yu Qiao; Kaipeng Zhang

Yume-1.5: A Text-Controlled Interactive World Generation Model

Intermediate

Xiaofeng Mao, Zhen Li, Chuanhao Li et al.12/26/2025

arXiv PDF

Key Summary

•Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.
•It stays fast in real time by squeezing old video memory in three smart ways: over time, across space, and across channels (TSCM).
•It learns to handle its own small mistakes during long generations by training with its own past outputs (Self-Forcing with TSCM).
•You can trigger events with text, like “A ghost appears,” and the world reacts while you keep moving the camera.
•It separates text into two parts—what is happening (event) and how you move (action)—to cut computation and follow controls better.
•Compared to strong baselines, Yume1.5 follows keyboard instructions much better (0.836 score) while running with only 4 diffusion steps.
•It keeps quality steady over long videos and reaches about 12 frames per second at 540p on a single A100 GPU.
•A new linear-attention fusion plus adaptive memory downsampling lets it scale without slowing down as videos get longer.
•The dataset mix (real, synthetic, and curated event data) helps it generalize beyond games to realistic city scenes.
•Limitations remain (odd motion artifacts, crowded scenes), but the framework points toward richer, controllable virtual worlds.

Why This Research Matters

Interactive, controllable video worlds can change how we design games, films, training, and education: you can describe a place and walk through it instantly. Because Yume1.5 runs fast with long memory, creators can iterate live instead of waiting minutes per change. Text-triggered events let non-experts direct complex scenes—just type what should happen next. This lowers the barrier for small studios, teachers, and hobbyists to build rich experiences. Stable long-video generation also helps robotics and autonomy teams prototype environments safely. As these systems mature, they can become creative sandboxes where anyone can explore ideas and stories in motion.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how video games let you walk around a city, turn the camera, and see new things as you go? For AI video, doing that on the fly—while staying sharp, smooth, and under your control—has been very hard.

🍞 Hook: Imagine filming a school play. You pay more attention to the actors than to the curtains. 🥬 The Concept: Attention Mechanism

What it is: A way for AI to focus on the most important parts of data when making a decision.
How it works: (1) Look at everything, (2) score what seems important, (3) give bigger weight to high scores, (4) use that focus to decide the next step.
Why it matters: Without attention, the model treats every pixel or word the same and gets confused. 🍞 Anchor: When asked “What’s the capital of France?”, attention focuses the model on “capital” and “France,” leading to “Paris.”

🍞 Hook: Think of painting a picture layer by layer, each layer making the picture clearer. 🥬 The Concept: Video Diffusion Models

What it is: A method that starts from noisy frames and gradually denoises them into a coherent video.
How it works: (1) Add noise to training videos, (2) teach a model to remove that noise step by step, (3) during generation, begin with pure noise and remove it over a few steps, (4) produce frames that look real and flow over time.
Why it matters: This approach creates high-quality, smooth videos, but it can be slow if it takes many steps. 🍞 Anchor: It’s like turning a fuzzy picture into a clear one, one careful eraser pass at a time.

The world before: Early text-to-video models could make pretty clips but struggled to keep scenes consistent for long stretches, to respond in real time, or to follow your controls like a game camera. Many models were trained on game footage, so they didn’t generalize well to real streets, people, and weather. Others needed dozens of slow steps for each new frame, making live exploration feel laggy.

The problem: Make a video world that (1) you can explore with simple keyboard keys, (2) stays coherent for long walks, and (3) responds to text events like “It starts to rain,” without grinding to a halt.

🍞 Hook: Think of WASD keys in a game: W forward, A left, S back, D right. 🥬 The Concept: Keyboard-Based Camera Control

What it is: Using a small set of keys to move through the generated world and turn the camera.
How it works: (1) Read key presses (e.g., W+A), (2) map them to a few discrete camera actions (forward-left), (3) feed these actions to the model as part of its instructions, (4) generate the next frames accordingly.
Why it matters: Without simple controls, you can’t explore; the model would just pick a path on its own. 🍞 Anchor: Pressing W+D makes you drift forward and to the right down a neon-lit street.

Earlier attempts and why they failed:

Sliding windows: only remember the last few frames. Result: forgets distant details, breaks story continuity.
Heavy history: keep everything. Result: memory and time explode as the video grows.
Pose-path control: requires exact camera trajectories at every step. Result: not intuitive and brittle when viewpoints change.

🍞 Hook: Imagine copying a homework answer, then copying your own copy, then copying that copy—smudges pile up. 🥬 The Concept: Error Accumulation Mitigation

What it is: Strategies to keep small mistakes from snowballing during long video generation.
How it works: (1) Let the model practice using its own outputs as history, (2) teach it to recover from slightly wrong inputs, (3) distill many slow steps into a few fast ones without drifting.
Why it matters: Without this, long videos blur, wobble, or go off-track. 🍞 Anchor: A 30-second walk stays steady instead of turning into a jittery mess by the end.

🍞 Hook: Filming a whole movie is harder than a 3-second clip. 🥬 The Concept: Long-Video Generation

What it is: Creating long, continuous videos that stay consistent in look and motion.
How it works: (1) Generate new frames chunk by chunk, (2) summarize old history so it’s useful but light, (3) keep camera actions and events tied to what you’ve seen, (4) prevent quality from decaying over time.
Why it matters: Without long-video strength, exploration stalls or the world falls apart. 🍞 Anchor: You can walk 30+ seconds through a city and it still feels like the same place.

The gap Yume1.5 fills: a framework that (1) compresses history smartly so speed doesn’t crumble, (2) trains the model to handle its own imperfect past, and (3) adds a clean text channel for world events, all while giving you simple keyboard control.

Real stakes: Faster, controllable video worlds help with virtual tourism, film previsualization, simulation for robotics, training scenarios, education, and creative storytelling. If this works in real time on a single GPU, many more people can create and explore rich, living scenes from plain words.

02Core Idea

The aha: Treat video memory like a packed suitcase—compress it in time, space, and color channels (TSCM), teach the model to practice with its own past (Self-Forcing), and split text into what happens (event) and how you move (action) so worlds stay fast, coherent, and controllable.

Three analogies:

Librarian: Old chapters are summarized into notes (TSCM), you keep reading using your notes (Self-Forcing), and you add sticky tabs for “plot events” vs “page-turn actions” (event vs action text).
Backpacking: You carry light versions of old gear (compressed history), train hiking with your own pace (Self-Forcing), and separate the map (events) from footwork (actions) so you don’t waste energy.
Cooking show: Prep bowls condense ingredients (TSCM), the chef rehearses using yesterday’s leftovers (Self-Forcing), and the script marks scene changes (events) apart from camera moves (actions).

🍞 Hook: Think of shrinking old photos so your phone stays fast, but keeping today’s photo full size. 🥬 The Concept: Joint Temporal–Spatial–Channel Modeling (TSCM)

What it is: A way to shrink older video history over time, across image space, and across feature channels to keep memory useful yet cheap.
How it works: (1) Sample fewer old frames (temporal), (2) downsample old frames more aggressively the farther back you go (spatial), (3) compress feature channels of old tokens and fuse them via linear attention, (4) keep the current frame richly detailed.
Why it matters: Without TSCM, memory or speed collapses as videos get longer. 🍞 Anchor: The last few seconds stay crisp; a minute ago becomes bite-sized notes that still guide what you see next.

🍞 Hook: Practicing piano with your own recordings helps you fix your real mistakes. 🥬 The Concept: Self-Forcing with TSCM

What it is: Training the model to generate long videos using its own previously generated frames (not perfect ground truth), while distilling many slow steps into a few fast ones.
How it works: (1) Autoregressively generate chunks using compressed history (TSCM), (2) treat those generated frames as the context next time, (3) use a teacher–student distillation to match high-quality trajectories with fewer steps, (4) reduce error buildup.
Why it matters: Without this, a 4-step fast sampler drifts badly over time. 🍞 Anchor: With Self-Forcing, a 30-second city walk keeps its style and sharpness instead of melting by the final blocks.

🍞 Hook: In a game, you can press a key to make something happen, like “spawn fireworks.” 🥬 The Concept: Text-Controlled Event Generation

What it is: Letting you type an event (e.g., “A ghost appears”) that the model weaves into the ongoing world.
How it works: (1) Split text into Event (what happens) and Action (how you move), (2) pre-compute the small, finite Action embeddings, (3) encode the Event just once at the start or when it changes, (4) the model blends both to update the next frames.
Why it matters: Without event text, the world can’t react to your story ideas. 🍞 Anchor: “It starts raining” makes puddles appear and people open umbrellas while you still steer with WASD.

Before vs after:

Before: Real-time exploration meant cutting history (losing context) or running too slowly; models couldn’t easily follow keyboard actions and had weak text-driven events.
After: Compress history smartly, train on your own outputs, and separate event/action text. Now you get fast, steady, controllable worlds that you can also edit on the fly.

Why it works (intuition):

Compressing the right parts (old, faraway, wide-channel info) reduces cost without throwing away what matters for the next frame.
Practicing on your own outputs closes the gap between training and reality, so small mistakes don’t snowball.
Decoupling event vs action text keeps the movement signals cheap and snappy while keeping scene semantics rich.

Building blocks:

A diffusion transformer backbone for video.
Two text streams: Event (semantic scene) and Action (discrete controls).
Temporal–spatial downsampling for old frames and channel-compressed history fused by linear attention.
Autoregressive chunking that always conditions on compressed past.
Distillation that turns many slow steps into a few fast, robust ones.

🍞 Hook: A flashlight spreads light widely, but some flashlights focus the beam for efficiency. 🥬 The Concept: Linear Attention (variant)

What it is: A faster attention trick that scales with sequence length more gently than standard attention.
How it works: (1) Map keys/queries through a simple function, (2) compute attention via dot products that avoid the heavy softmax over all pairs, (3) fuse compressed history with current tokens efficiently.
Why it matters: Without it, fusing long histories would be too slow. 🍞 Anchor: It’s like switching to a focused beam so you can see far ahead without draining the battery.

03Methodology

At a high level: Input (text and/or an image) → Encode Event text + precomputed Action controls → Pack and compress old frames (TSCM) → Autoregressively denoise a short chunk of new frames → Update memory and repeat to extend the world.

Step-by-step recipe:

Inputs and control

What happens: You provide a prompt (e.g., “A stylish woman walks down a neon Tokyo street”) and, optionally, a starting image. You press keys (W/A/S/D, arrows) to move.
Why it exists: Without a clear event description and simple actions, the model can’t align motion and scene.
Example: Event = “A ghost appears.” Action at this moment = “W + left arrow.” The model should move you forward while turning left, and also add the ghost.

Dual-text encoding (Event vs Action)

What happens: Split your text into Event Description (what to generate) and Action Description (keyboard/camera moves). Feed both to a text encoder (like T5). Cache the small set of Action embeddings since they come from a finite vocabulary (e.g., W, A, S, D, combos).
Why it exists: This cuts expensive text encoding during ongoing inference and makes action following more reliable.
Example: Event: “Rain begins softly, street reflections grow.” Action: “W+D.” The model encodes the event once and mixes in cached W+D each frame.

Conditioned latent setup (Image-to-Video or Text-to-Video)

What happens: If you gave a starting image, it becomes a condition in the latent space with a mask that says which parts are fixed or to be generated. If only text, the model starts from noise.
Why it exists: A first image anchors style, layout, and identity; text-only allows pure imagination.
Example: Use a street photo as the first frame so buildings and colors persist while new motion is generated.

Temporal–spatial compression of history (TSCM part 1)

What happens: The model keeps many past frames, but samples fewer of the oldest ones and downscales them more (e.g., last 2 frames lightly downsampled; 7–23 frames ago, downsampled more). Recent frames stay detailed.
Why it exists: Without this, old context becomes too heavy; with naive truncation, you forget useful older clues.
Example: The café you passed 10 seconds ago is a small but still-recognizable memory token, so the street layout remains coherent.

Channel compression + linear attention fusion (TSCM part 2)

What happens: In parallel, old-frame features are squished in the channel dimension and merged with current-frame tokens using linear attention, then projected back to the normal size.
Why it exists: Reduces compute cost where attention is most expensive (lots of tokens), while keeping the gist of old content.
Example: The glows from distant neon signs are captured as compact features that still influence lighting continuity.

Autoregressive chunk generation

What happens: The model generates a short block of new frames (e.g., a few tenths of a second), conditioned on compressed history plus current Event+Action. Then it appends the new block to history and repeats.
Why it exists: Chunking makes the process stable and responsive to controls.
Example: Hold W for 2 seconds; the camera advances smoothly over four chunks, maintaining scene geometry.

Self-Forcing with distillation for speed

What happens: During training, the model uses its own generated frames as history (not ground-truth frames) to learn to recover from small errors. A teacher–student distillation then compresses many denoising steps into just a few (e.g., 4 steps) while matching high-quality outputs.
Why it exists: Few-step samplers are fast but tend to drift; training on your own outputs plus distillation makes few steps robust.
Example: After 20 seconds, the scene is still steady, not washed out, even at 4 steps per chunk.

Memory management with adaptive downsampling

What happens: As history grows, the system keeps applying stronger compression to older frames and lighter compression to newer ones. Timing stays nearly constant per step once beyond a few chunks.
Why it exists: Prevents slowdowns as the video gets long.
Example: Whether you’re at 10 or 25 seconds, inference time per step is nearly the same.

Training data mix

What happens: Alternate batches from (a) a real walking dataset (Sekai-Real-HQ with derived keyboard/camera labels), (b) a high-quality synthetic set (OpenVid filtered with VBench scores), and (c) a curated event dataset (e.g., rain, UFO, dragons) built from I2V synthesis and human screening.
Why it exists: Real data improves controllability and realism, synthetic data preserves general video skill, and event data teaches text-driven happenings.
Example: The model learns real street motion, keeps broad generative talent, and obeys “people move aside to avoid sprinkler.”

The secret sauce:

TSCM: A two-pronged compression (temporal–spatial + channel) that is tailored to where attention is most costly and what matters most for the next frames.
Self-Forcing + distillation: Close the train–test gap and make a few steps act like many.
Dual-text stream: Precompute action signals and encode events sparingly, delivering both control fidelity and speed.

Concrete data example:

Input text: “A bright summer day in a European city. People are walking. New event: a street sprinkler turns on; people step aside.”
Actions over 2 seconds: W → W+D → D → S (stop) → A.
Output: The camera moves forward and slightly right, then right, then slows and shifts left. Meanwhile, the sprinkler starts; nearby pedestrians react by stepping aside—consistent with your event text.

04Experiments & Results

The test: Can Yume1.5 both follow keyboard-like instructions and keep video quality high over long stretches, while running fast? The team used Yume-Bench and VBench-style metrics to measure instruction following, subject/background consistency, motion smoothness, aesthetics, and image quality.

The competition: Baselines included Wan-2.1 (a strong, general video model) and MatrixGame (an interactive world model), plus the earlier Yume. These are good yardsticks because they either do high-quality video or enable interactivity, but not always both in realistic scenes.

Scoreboard with context:

Instruction Following (how well the model obeys movement): Yume1.5 hits 0.836. Think of that like getting an A when others are closer to C+/B−. The earlier Yume trails at 0.657; Wan-2.1 and MatrixGame are much lower in this real-world test.
Quality metrics (subject/background consistency, smoothness, aesthetics, image quality): Yume1.5 stays competitive with the best open models and improves steadiness over time, especially with Self-Forcing + TSCM.
Speed: Yume1.5 uses only 4 denoising steps and still reaches about 12 fps at 540p on a single A100 GPU, while others often need 20–50 steps per frame and slow down.

Long-video stability:

Using 30-second sequences split into 5-second segments, Yume1.5 trained with Self-Forcing + TSCM keeps aesthetics and image quality more stable in later segments (4th–6th). For example, the final segment’s aesthetic score rises from ~0.442 (without the method) to ~0.523 (with it), and image quality from ~0.542 to ~0.601.
Inference-time stability: As more history accumulates, TSCM maintains nearly flat per-step timing beyond a moderate number of chunks. Full-context methods slow down sharply; naive spatial compression fluctuates more.

Ablations that matter:

Removing TSCM and reverting to older spatial-only packing lowers instruction following (0.767 vs 0.836). TSCM likely reduces bias from old motion directions and provides richer but efficient history.
Keeping Self-Forcing + TSCM boosts late-segment quality consistency versus training without it.

Surprising findings:

Four steps can be enough: With the right distillation and training on self-generated history, Yume1.5 maintains quality that typically requires many more steps.
Event control works with minimal extra data: By splitting event/action text and mixing a small curated event dataset, the model gains text-driven happenings without huge specialized corpora.

Takeaway from numbers: Yume1.5 meaningfully upgrades controllability and runtime without giving up image quality. It’s like switching from a fancy camera that takes great stills but lags in video, to a camcorder that’s both sharp and responsive for live action.

05Discussion & Limitations

Limitations:

Motion oddities appear: vehicles may move backward, or people may seem to walk strangely, especially in dense crowds.
High-resolution stress: Going from 540p to 720p helps, but artifacts can persist due to model capacity limits (5B parameters).
Domain extremes: Highly unusual or cluttered scenes can still trip the model’s long-range consistency.

Required resources:

A single A100 GPU can run real-time-ish at 540p with 4 steps. Higher resolutions or longer contexts benefit from more memory and compute.
Precomputed action embeddings and efficient text encoders help keep CPU/GPU overhead down; caching is important.

When not to use:

If you need pixel-accurate physics or strict 3D geometry constraints (e.g., engineering simulation), this generative approach won’t be precise enough.
For extremely crowded, fast-changing scenes where tiny details must be right, artifacts may be noticeable.
If absolute reproducibility is required across runs, stochastic generation may be unsuitable.

Open questions:

Can a Mixture-of-Experts backbone keep latency low while boosting capacity to reduce artifacts?
How far can event complexity go—multi-step, conditional chains like “it rains, then the power goes out, then traffic forms”—without new data or modules?
Can we tighten geometry and physics with lightweight constraints or differentiable scene priors while preserving speed?
How best to blend memory beyond vision (e.g., audio cues or symbolic world state) without slowing inference?
What’s the right balance of synthetic vs real data to generalize globally (different cities, weather, crowds) without drift?

06Conclusion & Future Work

In three sentences: Yume1.5 generates interactive video worlds from text or a single image, letting you move with simple keys while new events unfold on command. It stays fast and steady by compressing old history in time, space, and channels (TSCM), and by training on its own outputs with distillation so four denoising steps suffice. Splitting text into event and action gives controllability without bogging down the system.

Main achievement: A practical recipe for long, controllable, real-time-ish world generation—TSCM + Self-Forcing + dual-text encoding—delivering strong instruction following and stable quality.

Future directions:

Scale with smarter backbones (e.g., Mixture-of-Experts) to reduce artifacts without raising latency.
Richer multi-event narratives and bidirectional edits (add, remove, or transform events on the fly).
Integrations with simulators, audio, and lightweight physics for deeper realism.

Why remember this: It shows that you don’t have to pick between speed, length, and control—by compressing history wisely and training with your own past, you can explore living worlds in real time and still tell them what should happen next.

Practical Applications

•Film previsualization: Directors can describe scenes and move cameras live to block shots before real filming.
•Game prototyping: Designers sketch worlds with text and test movement and pacing using WASD control.
•Virtual tourism and education: Teachers and students explore historical streets or science settings by typing prompts and walking around.
•Training simulations: Emergency response or driving scenarios with text-triggered events like rain, crowds, or obstacles.
•Storyboarding with motion: Authors create moving scenes by adding events (e.g., “lanterns light up”) while steering the viewpoint.
•Robotics sim setup: Quickly generate diverse, realistic urban walks to test navigation strategies.
•Advertising and social content: Rapidly produce on-theme city walks or storefront tours guided by brand prompts.
•UX research: Prototype AR/VR navigation flows in synthetic cities without full engine builds.
•Architectural walkthroughs: Turn design descriptions into explorable paths for early feedback.
•Accessibility tools: Give voice or text commands to explore visual scenes for users who can’t easily control complex interfaces.

Version: 1