The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang; Hao Ouyang; Qiuyu Wang; Yue Yu; Yihao Meng; Wen Wang; Ka Leong Cheng; Shuailei Ma; Qingyan Bai; Yixuan Li; Cheng Chen; Yanhong Zeng; Xing Zhu; Yujun Shen; Qifeng Chen

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Intermediate

Hanlin Wang, Hao Ouyang, Qiuyu Wang et al.12/18/2025

arXiv PDF

Key Summary

•WorldCanvas lets you make videos where things happen exactly how you ask by combining three inputs: text (what happens), drawn paths called trajectories (when and where it happens), and reference images (who it is).
•Text alone is too vague for complex events, so this system ties each short motion sentence to a specific path you draw, fixing confusion in multi-character scenes.
•Trajectories aren’t just lines—they also encode speed, timing, and visibility (on-screen or hidden), which the model reads to control motion precisely.
•Reference images keep identities stable, so the same dog, person, or car looks consistent even if it leaves the frame and returns later.
•A special focusing trick, Spatial-Aware Weighted Cross-Attention, helps each caption pay attention to the exact area of its matching trajectory.
•A curated dataset of 280k triplets (trajectory, video, and motion-centric text) teaches the model to follow paths, respect visibility, and handle entries/exits.
•Compared to strong baselines (Wan 2.2 I2V, ATI, Frame In-N-Out), WorldCanvas better follows paths, matches text to the right mover, and preserves subject identity.
•The system shows emergent 'visual memory'—it remembers who is who and what the scene looks like, even across occlusions or off-screen moments.
•This makes world models feel interactive: instead of just predicting the future, they let you direct it like a movie maker.
•The method is practical for education, sports analysis, filmmaking, game prototyping, and robotics, where precise and controllable events matter.

Why This Research Matters

WorldCanvas lets people direct videos with precision, which is vital for safe simulations in driving, robotics, and training. Teachers can build clear visual stories for science or history by drawing paths and attaching short captions to show cause and effect. Filmmakers and game designers can quickly prototype complex scenes with specific characters while keeping visual continuity. Sports analysts can recreate plays accurately by tracing player movement and attaching intentions like “cuts left” or “sets screen.” Accessibility improves because creators can control visuals more directly without deep technical skills. Research on world models benefits from a practical interface that turns prediction into interaction. Overall, this bridges imagination and execution by making complex events easy to author and faithful to the plan.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you plan a school play, it’s not enough to say “Make it exciting”? You need to say who walks where, when they enter, and exactly what they do.

🥬 Filling (The Actual Concept):

What it is: Before this paper, many video AIs were like weather reporters: they described or predicted what might happen, but they didn’t let you precisely direct who does what, where, and when.
How it works (the old world):
1. Text-only video models let you write a prompt like “a dog chases a ball,” and they try to make a matching clip.
2. Some models let you draw a rough path (a trajectory), but they ignored important details like when the object appears or disappears, or how fast it moves.
3. Keeping a character’s look consistent (the same dog staying the same dog) was hard, especially if it left the frame and came back.
Why it matters: Without precise control, videos can be vague, characters can get mixed up, and complex multi-character scenes fall apart.

🍞 Bottom Bread (Anchor): Imagine telling an AI, “Two kids meet at the center; the girl on the left waves, the boy on the right spins.” A text-only system might swap the kids, miss the wave, or make both spin. It needs more than words to follow your plan.

🍞 Top Bread (Hook): Picture giving a friend directions: “Walk along this path, speed up here, hide behind that tree, then pop back out.” If they only hear the words but never see the map, things go wrong.

🥬 Filling (The Problem):

What it is: Text is too fuzzy for complex, timed, and spatial actions. It doesn’t say exactly where, when, or how fast things happen.
How it works (what people tried and why it failed):
1. Global prompts: One big sentence for the whole video. This can’t assign one instruction to each actor.
2. Bare-bones trajectories: Just dots on a screen, losing timing, speed changes, and visibility (entry/exit or occlusion).
3. Weak reference control: Even with a photo of your character, models struggled to keep that exact identity during motion.
Why it matters: When several agents interact—like a car brakes so a pedestrian can pass—it’s easy for models to misunderstand who should slow down and where the stopping happens.

🍞 Bottom Bread (Anchor): If you want “the old man steps back as the car brakes,” but the model makes the man chase the car or forgets to brake, you don’t get your story.

🍞 Top Bread (Hook): Imagine your world as a canvas where you can paint not just pictures, but events—like drawing lines for movement, pinning photos for characters, and writing short notes for actions.

🥬 Filling (The Gap and the New Direction):

What it is: The missing piece was a simple, unified way to specify who, what, when, and where—together.
How it works: Combine three inputs—trajectories (when/where), text (what), and reference images (who)—so the model gets an unambiguous plan.
Why it matters: With all three, the AI can follow precise motion, know which actor is which, and keep identity and scene consistent over time.

🍞 Bottom Bread (Anchor): You draw two paths (one fast, one slow), attach a tiny caption to each (“girl waves,” “boy spins”), and pin their photos. Now the model knows exactly who does what, where, and when.

🍞 Top Bread (Hook): Think of a video like a relay race. If the baton (identity and motion plan) is dropped even once, the whole race gets messy.

🥬 Filling (Real Stakes):

What it is: We need controllable, consistent video events for education, movies, games, robotics, and research.
How it works: Precision shaping of scenes means you can simulate traffic safely, storyboard films quickly, and prototype robot behaviors before real tests.
What breaks without it: Misunderstood actions, swapped roles, off-timed events, and identity drift ruin realism and usefulness.

🍞 Bottom Bread (Anchor): A driving school can test “a cyclist enters from the right while a car brakes” without ever risking real people—if the simulation does exactly what’s asked.

02Core Idea

🍞 Top Bread (Hook): Imagine directing a cartoon by drawing paths for characters, sticking their photos on the canvas, and writing short sticky notes for each: the world follows your plan.

🥬 Filling (The Aha! in One Sentence): The key insight is to fuse three inputs—text (what), trajectories (when/where), and reference images (who)—and bind each mini-text to its matching path using a spatially aware attention trick, so complex events become unambiguous and controllable.

Multiple Analogies (3 ways):

Orchestra analogy: Trajectories are the sheet music timing and notes (when/where), text is the conductor’s markings (what expression), and reference images are the instrument types (who). The attention trick keeps each musician following their own part.
Treasure map analogy: Paths are the dotted lines, captions are the clues, and reference images are snapshots of the treasure. Spatial attention is the compass ensuring the right clue matches the right spot on the map.
Comic book analogy: Panels show positions (trajectories), speech bubbles tell actions (text), and character designs define who’s speaking (reference image). Spatial attention makes sure each bubble belongs to the right character.

Before vs. After:

Before: One big prompt tried to steer a whole crowd; roles got swapped, timing was fuzzy, identities drifted.
After: Each actor has a personal path and caption; the model locks the right words to the right mover and preserves appearance, even across exits and re-entries.

Why It Works (intuition, no equations):

The model sees where each path is on the screen and boosts attention there for the matching caption. This reduces confusion when multiple agents look similar.
The path points aren’t just dots—they hint at speed (point spacing), timing (when a segment starts), and visibility (on/off-screen). The model reads these hints to choreograph motion.
Reference images anchor identity visually, so the same subject remains the same subject later.

Building Blocks:

🍞 Hook: You know how traffic lights tell cars when to go, slow, or stop?
🥬 The Concept (Multimodal Triplet):
- What it is: A trio of inputs—trajectory, text snippet, and reference image—for each subject.
- How it works: For each subject, pair a short motion caption with one path and one image; repeat for all subjects; the model fuses all trios together.
- Why it matters: No more guessing; each actor gets clear, local instructions.
- 🍞 Anchor: Two kids, two paths, two tiny captions. No mix-ups.
🍞 Hook: Think of a spotlight on stage that follows the right actor.
🥬 The Concept (Spatial-Aware Weighted Cross-Attention):
- What it is: A focusing mechanism that makes the caption pay extra attention to the video tokens near its path.
- How it works: For each path, we give a gentle boost to attention inside its moving box region, linking local words to local pixels over time.
- Why it matters: Prevents action swapping in multi-actor scenes.
- 🍞 Anchor: The caption “girl waves” lights up the girl’s region along her path, not the boy’s.
🍞 Hook: Imagine pinning a photo of your hero onto the first frame.
🥬 The Concept (Reference Images):
- What it is: A picture that defines who the subject is.
- How it works: The user can place and scale the image; the model keeps that identity consistent through motion, occlusion, and re-entry.
- Why it matters: No more random face or outfit changes.
- 🍞 Anchor: Your specific puppy photo stays your puppy, even if it runs off-screen and returns.
🍞 Hook: Like drawing a race track with tighter dots in slow sections and wider dots in fast sections.
🥬 The Concept (Trajectory Control):
- What it is: A user-drawn sequence of points encoding when/where and how fast something moves, plus when it is visible.
- How it works: Equal time steps between points; spacing controls speed; visibility flags model occlusion and entry/exit.
- Why it matters: Exact choreography replaces guesswork.
- 🍞 Anchor: A car slows for a corner (dense points), then speeds up on the straight (sparse points).

03Methodology

High-Level Recipe: Input → [Draw trajectories + attach captions + place reference images] → [Encode paths and images into model-friendly signals] → [Guide the video generator with spatially aware attention] → Output video that follows your plan.

Step A: Build the Multimodal Triplets (Data Curation Pipeline) 🍞 Hook: Imagine sorting your LEGO by color and size before building a castle—it makes building fast and accurate. 🥬 The Concept (Data Curation Pipeline):

What it is: A process to prepare clean examples of (trajectory, reference image, motion text) for training.
How it works:
1. Split raw videos into clean shots.
2. Detect foreground objects and mask them.
3. Pick a few keypoints per object and track them across frames to form trajectories, including visibility scores.
4. Randomly crop to simulate objects entering/exiting the frame.
5. Generate motion-focused captions using trajectory-visualized clips so text matches movement, not just looks.
6. Extract reference images from the first frame and lightly transform them (move/scale/rotate) so the model learns flexible placement.
Why it matters: The model learns exactly how text, paths, and pictures relate. 🍞 Anchor: You get pairs like “blue path = woman walks in and waves” plus her photo, not a vague global sentence.

Step B: Learn to Read and Follow Paths (Trajectory Control inside the model) 🍞 Hook: Think of drawing a glowing trail that a character should follow. 🥬 The Concept (Trajectory Control):

What it is: Turn user-drawn paths into signals the generator can follow.
How it works:
1. Convert each path’s points into smooth, blurry dot maps that light up where the subject should be over time.
2. Copy visual hints from the first-frame reference area along the path to help the model keep identity while moving.
3. Feed these extra channels alongside the usual video-generation inputs.
Why it matters: Without these signals, the model might ignore your paths or drift off course. 🍞 Anchor: Your “dog path” becomes a glowing lane the dog follows frame by frame.

Step C: Make Words Focus on the Right Movers (Cross-Attention → Spatial-Aware Weighted Cross-Attention) 🍞 Hook: You know how a teacher calls on the right student by looking at their desk? Location matters. 🥬 The Concept (Cross-Attention):

What it is: A way for the model to match words to visual regions.
How it works: The model compares text tokens with video tokens to decide which visuals each word should influence.
Why it matters: Without it, words affect the wrong places. 🍞 Anchor: The word “wave” affects the arm region, not the feet.

🍞 Hook: It’s like giving a gentle nudge so each caption pays more attention to its own path’s area. 🥬 The Concept (Spatial-Aware Weighted Cross-Attention):

What it is: Cross-attention with a boost near each trajectory’s moving box.
How it works:
1. For each actor, define a moving region around its path (using its reference box size).
2. Increase attention weights between that region’s video tokens and that actor’s caption tokens.
3. Still allow some attention elsewhere so context isn’t lost.
Why it matters: Prevents action swaps and keeps multi-actor scenes coherent. 🍞 Anchor: “Girl on the left waves” stays with the left path, even if both kids look similar.

Step D: Respect Timing and Visibility 🍞 Hook: Like a magic show: sometimes the magician is on stage, sometimes hidden behind the curtain. 🥬 The Concept (Motion Dynamics + Occlusion Detection):

What it is: Motion dynamics describe speed and timing; occlusion/visibility mark when a subject is seen or hidden.
How it works:
1. Equal time steps between path points set the rhythm.
2. Point spacing sets speed.
3. Visibility flags tell the model to show or hide the subject (simulate entry/exit or being behind objects).
Why it matters: Without this, entries/exits look wrong and speeds feel unnatural. 🍞 Anchor: The puppy disappears behind a couch (invisible segment) and reappears on the other side right on time.

Step E: Keep Identities Stable (Reference Images in action) 🍞 Hook: Like pinning a name tag on your character so the audience always knows who they are. 🥬 The Concept (Reference Images):

What it is: A user-provided picture that defines the subject’s look.
How it works: Place and scale the image on the first frame; the model keeps that appearance consistent throughout motion and re-entries.
Why it matters: No surprise face swaps. 🍞 Anchor: The same golden dragon stays golden and scaly even when it leaves the sky and returns.

Step F: User Interface for Creation

Define start/end times per trajectory (when the action begins/ends).
Draw point sequences (spacing controls speed).
Mark visible/invisible segments (occlusion and off-screen moments).
Attach a short, motion-focused caption to each path.
Drag-and-drop reference images onto the canvas.

The Secret Sauce:

The spatial-aware attention gently binds each caption to its path’s region, while the trajectory signals inject clear motion timing and speed. Together, they turn a messy crowd into a choreographed scene where each actor follows the right script.

04Experiments & Results

The Test: What did they measure and why?

Trajectory following: Do generated motions match the user’s drawn paths (distance error and visibility accuracy)?
Semantic alignment: Do videos reflect the motion captions both globally and locally (CLIP-based scores)?
Consistency: Do subjects and backgrounds stay stable across time (temporal consistency metrics)?

The Competition: Strong baselines

Wan 2.2 I2V (powerful general image-to-video model)
ATI (Any Trajectory Instruction; strong open-source trajectory control)
Frame In-N-Out (supports reference-guided creation)

The Scoreboard (with context):

Trajectory Accuracy (ObjMC): WorldCanvas scored lower error than others (91.06 vs much higher), like hitting the bullseye while others land farther from the target.
Appearance Rate (visibility correctness): 85.17%—think of it as correctly showing/hiding the actor in more than 8 out of 10 relevant frames, better than baselines.
Subject/Background Consistency: Higher scores (about 0.90+ for subject, 0.93+ for background) mean fewer flickers and better identity memory—like a steady movie instead of a jittery one.
CLIP-T Global/Local: Slightly higher than baselines, meaning the overall story and local actions line up better with the captions—like getting an A when others get B’s.

Qualitative Highlights:

Complex multi-agent events: When two kids must do different actions on different paths, baselines often swap actions or ignore one actor. WorldCanvas binds each caption to its correct path, so the right child does the right move.
Reference fidelity: Compared to Frame In-N-Out, WorldCanvas preserves the look of provided reference images more faithfully through motion and re-entries—like recognizing your friend even after they leave and return.
Event correctness: For prompts like “man steps back as a car brakes,” baselines misinterpret roles or timing, while WorldCanvas coordinates both actors properly.

Surprising Findings:

Emergent consistency: The model often “remembers” objects and scenes across off-screen intervals, suggesting a kind of visual memory.
Physical and causal hints: With only the cause drawn (like tipping a bottle), the model often generates plausible effects (spilling liquid), indicating basic physical commonsense.
Counterfactual control: It can follow imaginative prompts (like a shark in sand) while keeping motion and effects coherent, showing strong controllability.

User Study:

Across trajectory following, prompt adherence, text-trajectory alignment, reference fidelity, and overall quality, participants strongly preferred WorldCanvas—like a landslide win in five categories.

Bottom line: WorldCanvas consistently beats top baselines in path accuracy, identity stability, and local semantic grounding, especially in multi-actor and reference-driven scenes.

05Discussion & Limitations

Limitations:

Challenging camera moves: Under extreme rotations or rapid viewpoint changes, fine details can blur or warp.
Hidden-time reasoning: If the camera looks away while a process should continue (like filling a cup), the model sometimes underestimates progress.
Very complex logic chains: Long multi-step cause-and-effect stories can still trip it up.
Similar-looking agents: If two actors look nearly identical and paths overlap tightly, confusion is still possible without very clear captions and paths.

Required Resources:

Training used a large dataset (≈280k triplets) and significant compute (64 H800 GPUs). Running the trained model needs a capable GPU and the authoring UI for drawing paths, attaching captions, and placing references.

When NOT to Use:

You need pixel-perfect physics (e.g., engineering-grade simulations).
Scenes demand exact 3D geometry or precise camera tracking beyond what trajectories encode.
You cannot provide clear, separate paths and short, motion-focused captions for each actor.

Open Questions:

Stronger memory: How can models reliably reason about events that occur while off-screen or during long occlusions?
Richer 3D: Can we extend path control to full 3D trajectories and dynamic cameras with solid geometric consistency?
Safety and fairness: How do we ensure reference-based generation avoids bias and respects identity rights?
Interactive feedback: Can the model accept mid-generation edits (pause, tweak paths, resume) for real-time directing?
Physics grounding: How far can gentle physics priors go before we need full physical simulators under the hood?

06Conclusion & Future Work

3-Sentence Summary: WorldCanvas turns video generation into an interactive canvas by combining three inputs—text (what), trajectories (when/where), and reference images (who)—and tightly binding each caption to its matching path with spatially aware attention. This makes complex, multi-actor events controllable and consistent, with better motion following, identity stability, and local semantic accuracy than strong baselines. The system even shows emergent visual memory and basic causal sense, moving world models from passive predictors toward user-driven simulators.

Main Achievement: A practical, scalable framework—and dataset pipeline—that lets everyday users choreograph rich, precise “promptable world events” by drawing paths, writing short motion notes, and pinning character images, all fused by a clever spatial attention mechanism.

Future Directions:

Extend to 3D paths, robust camera control, and longer scenes with stronger memory.
Add interactive editing during generation and better physics-aware behavior.
Broaden datasets to cover more edge cases, complex group interactions, and safety-critical domains.

Why Remember This: It shows that giving the AI a clear plan—who, what, when, and where—unlocks a leap in control and coherence. With WorldCanvas, the world isn’t just predicted; it’s directed, opening doors for filmmaking, education, robotics, and safer simulations where precision matters.

Practical Applications

•Storyboard complex film shots by drawing actor paths, attaching brief action notes, and pinning reference costumes for continuity.
•Prototype game levels where NPCs follow precise patrol routes and scripted interactions without writing code.
•Simulate traffic scenarios (merging, braking, pedestrian crossings) for driver-assistance testing with exact timing and visibility.
•Create educational science demos showing motion principles (acceleration, occlusion) with simple drawn trajectories.
•Reconstruct and analyze sports plays by tracing players’ paths and annotating tactics like screens or cuts.
•Design robotics rehearsals by sketching robot and human paths to test safe interactions before real-world trials.
•Produce advertising mockups where brand mascots follow specific paths while staying visually consistent across shots.
•Generate accessibility-friendly learning materials that demonstrate sequences step by step with clear, controlled motion.
•Visualize logistics flows (e.g., warehouse robots crossing aisles) to spot bottlenecks via controlled multi-agent simulations.
•Create counterfactual or imaginative scenes (dragons, flying dogs) while maintaining coherent timing and identity.

Version: 1