Active Intelligence in Video Avatars via Closed-loop World Modeling

Xuanhua He; Tianyu Yang; Ke Cao; Ruiqi Wu; Cheng Meng; Yong Zhang; Zhuoliang Kang; Xiaoming Wei; Qifeng Chen

Active Intelligence in Video Avatars via Closed-loop World Modeling

Intermediate

Xuanhua He, Tianyu Yang, Ke Cao et al.12/23/2025

arXiv PDF

Key Summary

•The paper turns video avatars from passive puppets into active doers that can plan, act, check their own work, and fix mistakes over many steps.
•It introduces L-IVA, a new benchmark where an avatar must finish multi-step goals (like repotting a plant) in a noisy, unpredictable video world.
•The core method, ORCA, runs a closed loop called OTAR: it Observes a clip, Thinks about the next step, Acts by generating a new clip, then Reflects to verify outcomes.
•ORCA uses a dual-system “two-brain” design: System 2 plans strategically and predicts what should happen; System 1 writes very detailed action captions that video models can follow precisely.
•By keeping a belief state (the avatar’s memory of the world) and verifying outcomes before updating it, ORCA avoids getting confused by random video generation errors.
•On the L-IVA benchmark (100 tasks, 5 scenarios), ORCA achieves the best average task success rate (71%) and the highest physical plausibility and identity consistency.
•Human judges preferred ORCA’s videos over baselines, showing that closed-loop world modeling matters more than just single-clip prettiness.
•Open-loop planners can look good on simple tasks but fall apart on complex ones because they never notice and fix early mistakes.
•Ablations show each part—belief state, reflection, and the two-system split—contributes meaningfully to both success and preference.
•Limitations remain (depth perception, temporal sampling, and generative model control), but the framework scales as foundation models improve.

Why This Research Matters

Active, reliable video avatars can host livestreams, teach skills, and guide users through complex tasks without constant human micromanagement. By checking their own work and correcting mistakes, they produce more trustworthy, coherent content over many steps. This closed-loop approach makes long instructional videos and demonstrations feel consistent and purposeful rather than random. It also sets the stage for safer, more dependable AI agents as video generation models continue to improve. Ultimately, it moves digital characters closer to helpful teammates who can think ahead and adapt on the fly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a puppet only moves when someone pulls the strings? Now imagine that puppet learning to plan its own moves to finish a chore, like making tea. That’s the leap this paper aims for.

🥬 The World Before:

What it is: Video avatars used to be “passive.” They copied motions from speech, poses, or scripts and made pretty clips, but they didn’t truly think or plan.
How it worked: A model took a prompt (like text or audio) and generated a clip. For longer videos, it chained clips by feeding the end of one into the start of the next.
Why it mattered: These systems preserved identity and matched motions to inputs, great for short, reactive content—but they couldn’t handle open-ended goals over many steps.

🍞 Anchor: Like following a dance routine exactly, but never deciding how to clean a messy room on your own.

🍞 Hook: Imagine a student assigned “Host a product demo.” They must plan, adapt, and fix mistakes along the way. Video avatars need the same kind of agency.

🥬 The Problem:

What it is: Avatars cannot autonomously pursue long-term goals in unpredictable (stochastic) generative video worlds.
How it works: The same action caption can lead to different video outcomes; the avatar only “sees” what it generated, not the true state. Small errors early on can snowball.
Why it matters: Without adaptive planning and self-checking, avatars fail real multi-step jobs like livestream hosting or step-by-step tutorials.

🍞 Anchor: If you bake a cake and never taste the batter or peek in the oven, you won’t notice it’s going wrong until it’s burnt.

🍞 Hook: Imagine doing a scavenger hunt while wearing foggy glasses—you can’t see everything at once.

🥬 POMDP (Partially Observable Markov Decision Process):

What it is: A way to model decision-making when you only see part of the world and must choose actions anyway.
How it works: Keep a “belief” (best guess) of the full state, update it with each new observation, pick actions to reach the goal, and repeat.
Why it matters: Video avatars only observe their own generated frames, which are incomplete and noisy; they must reason under uncertainty.

🍞 Anchor: Like guessing where your lost sock is by checking a few drawers and updating your hunch each time.

🍞 Hook: Picture a mental map you hold while blindfolded—turns, doors, and objects you can’t see right now but remember.

🥬 Internal World Model (IWM):

What it is: A memory-and-prediction engine that estimates what the world is now and what will happen next if you act.
How it works: Combine history with new observations, predict outcomes of actions, and plan accordingly.
Why it matters: Without an IWM, the avatar’s plans drift because it can’t reliably track progress or foresee consequences.

🍞 Anchor: When cleaning your room, you remember which toys are already in the bin so you don’t keep picking them up again.

🍞 Hook: Suppose the weather can change suddenly—your plan needs to be flexible.

🥬 Stochastic Generative Environments:

What it is: Video generators are unpredictable; the same prompt can yield different visuals.
How it works: Randomness inside the model means outcomes vary clip to clip.
Why it matters: If the avatar assumes a stable world, its memory (belief) gets corrupted by surprises it never checked.

🍞 Anchor: It’s like rolling dice every time you ask for “pick up the red cup”—sometimes it’s the wrong hand or cup unless you verify and retry.

🍞 Hook: Imagine a team where one person makes the plan and another writes the exact recipe steps.

🥬 Gap and Stakes:

What it is: Past systems either planned once (open-loop), reacted without memory, or assumed a predictable world. None closed the loop to verify outcomes before updating memory.
How it works: The missing piece is a closed-loop architecture that (1) maintains a belief state, (2) plans, (3) grounds actions in precise captions, and (4) reflects to verify and correct.
Why it matters: This enables real tasks—product demos, cooking lessons, or office workflows—where long-horizon correctness and visual consistency are essential.

🍞 Anchor: Like a careful chef who tastes the soup, adjusts seasoning, and only then records the recipe for the next batch.

02Core Idea

🍞 Hook: You know how a GPS doesn’t just tell you one route at the start—it keeps checking where you are and reroutes if you miss a turn?

🥬 The “Aha!” Moment:

What it is: Treat the video generator as a noisy world and control it with a closed loop—Observe, Think, Act, Reflect—plus a two-brain controller that plans strategically and writes precise action captions.
How it works: Keep a belief state, predict the next state, generate a clip with a detailed caption, then verify the outcome. If it’s off, revise or replan before updating memory.
Why it matters: Without this loop, early randomness breaks long tasks. With it, the avatar stays on track.

🍞 Anchor: Like cooking with constant taste tests—you don’t serve until the soup actually matches the plan.

🍞 Hook: Imagine a coach and a star player working together—the coach designs the play, the player executes it exactly.

🥬 Hierarchical Dual-System Architecture:

What it is: Two specialized “brains.” System 2 (coach) reasons and predicts; System 1 (player) turns plans into precise, model-specific action captions.
How it works: System 2 maintains goals and belief, chooses subgoals, and predicts outcomes; System 1 crafts detailed captions that video models can follow.
Why it matters: High-level reasoning and low-level control need different skills; separating them improves both strategy and execution.

🍞 Anchor: The coach draws the playbook; the player uses exact footwork to run it.

🍞 Hook: Think of your inner voice while doing homework: you notice what’s done, decide the next step, do it, then check your work.

🥬 OTAR (Observe-Think-Act-Reflect):

What it is: A closed-loop control cycle for generative video.
How it works: Observe the new clip and update belief; Think to pick the next subgoal and predict the next state; Act by generating a precisely captioned clip; Reflect to verify and accept/reject before memory updates.
Why it matters: It catches errors early, avoiding belief corruption and snowballing mistakes.

🍞 Anchor: Like building LEGO step by step and checking each layer is sturdy before adding the next.

🍞 Hook: Imagine keeping a mental checklist while assembling furniture.

🥬 Belief State:

What it is: The avatar’s running memory of objects, their states, what’s completed, and what’s next.
How it works: Start from the initial scene and goal; update after each verified clip; use it to choose actions.
Why it matters: Without it, the avatar repeats steps or acts out of order.

🍞 Anchor: You tick off “attach leg A” before moving to “attach leg B.”

Three Analogies (same idea, different lenses):

Navigation: The GPS (System 2) plans; the car’s control (System 1) follows precise lane instructions; the map updates if you drift (Reflect).
Kitchen: The head chef plans the menu; the line cook executes exact steps; constant tasting decides if to adjust seasoning.
Classroom: The teacher sets the learning plan; the student performs each exercise exactly; self-checking corrects misunderstandings before the next topic.

Before vs After:

Before: Avatars reacted to prompts, often drifting on long tasks; errors went unnoticed and compounded.
After: Avatars plan, act, verify, and adapt; they maintain consistent identity and higher task success over many steps.

Why It Works (intuition):

Predict-then-verify blocks randomness from corrupting memory.
Separation of concerns gives each “brain” a focused job: clear plans and precise prompts.
Continual belief updates let the agent reason under uncertainty like a POMDP.

Building Blocks:

Belief State (memory), System 2 (strategic planner), System 1 (action grounder), OTAR (the loop), Outcome Verification (quality gate), and I2V model as the stochastic world.

03Methodology

At a high level: Initial Scene + Intention → Initialize Belief and Plan → OTAR Loop: [Observe → Think → Act → Reflect] → Video Sequence and Completed Task.

🍞 Hook: Imagine following a treasure map: you note where you are, plan the next waypoint, take a step, then check you’re still on track.

🥬 Overview of Steps:

Initialize belief state from the starting image and the high-level intention; break the goal into subgoals using only visible objects.
Observe: After each generated clip, update what changed—object positions, hands, completed subgoals.
Think (System 2): Choose the next subgoal and predict what the world should look like after success.
Act (System 1): Translate the subgoal into a highly specific action caption tailored to the video model.
Reflect: Compare the new clip with the prediction; accept if aligned, else revise the caption or replan; only then update belief.

🍞 Anchor: Like marking your path on the map, taking the next step, and checking your compass before you proceed.

Concept Sandwiches within the Recipe:

🍞 Hook: Picture remembering where every puzzle piece is while building. 🥬 Belief State:

What it is: A structured memory of the scene, objects, avatar pose, and checklist progress.
How it works: Start from scene parsing; after each accepted clip, update object states and subgoal status.
Why it matters: Prevents repeating steps or acting on the wrong object. 🍞 Anchor: You don’t try to place a piece you already used.

🍞 Hook: Think of a principal and a PE coach—different jobs, same mission. 🥬 System 2 (Strategic Planner):

What it is: The reasoning brain that analyzes progress, selects subgoals, and predicts the next state.
How it works: Reads belief + last observation, chooses the next command, and writes a predicted outcome description.
Why it matters: Keeps long-horizon coherence and avoids random wandering. 🍞 Anchor: The principal sets the schedule and milestones for the day.

🍞 Hook: Now think of the PE coach teaching exact moves. 🥬 System 1 (Action Grounder):

What it is: The execution brain that converts a subgoal into a precise, model-specific caption.
How it works: Expands vague commands into detailed, hand-by-hand, object-by-object instructions for the I2V model.
Why it matters: Video models need precise guidance to reduce randomness. 🍞 Anchor: The coach says “left foot here, right foot there” so you nail the routine.

🍞 Hook: Imagine the world is a bit dice-rolly; you need to check. 🥬 Outcome Verification (Reflect):

What it is: A gate that compares what was predicted with what actually appeared in the clip.
How it works: If matched, accept and update belief; if not, revise the caption or replan and retry (up to N times).
Why it matters: Stops bad generations from poisoning memory. 🍞 Anchor: Like proofreading a paragraph before putting it in your final essay.

🍞 Hook: Pretend your TV is also the playground where actions happen. 🥬 I2V Model (Image-to-Video):

What it is: The generative engine that turns a still image + action caption into a new video clip.
How it works: Receives the last frames and a detailed caption; outputs a stochastic (somewhat random) clip.
Why it matters: It’s the “world” the avatar lives in—powerful but unpredictable. 🍞 Anchor: You press “play” after typing the move you want to see.

Detailed Walkthrough (Transfer Plant example):

Input: Initial image (table, pots, seedling, soil), Intention: “Transfer the plant to the bigger pot.”

Initialize: System 2 lists interactive objects and decomposes goals: (a) add soil to pot, (b) remove seedling, (c) place seedling in big pot, (d) fill remaining soil.
Observe: After each clip, it notes where the soil level is, which pot the seedling is in, and what the hands are doing.
Think: If (a) is incomplete, next command: “Scoop soil into large pot until bottom covered”; predicted outcome: “visible soil layer in large pot; scoop returns to resting.”
Act: System 1 writes the precise caption: “Using the right hand from the left side of the image, scoop soil from the brown bag and pour into the large gray pot, creating a visible thin layer; the left hand remains still on the table.”
Reflect: If the clip instead shows the wrong pot or missing soil, mark reject, add clarifiers (e.g., “gray pot closest to the camera; pour slowly”), and retry. Only accept when the visual matches the prediction, then mark subgoal (a) as done.

Repeat for (b)–(d) with the same loop.

Secret Sauce:

Predict-then-verify shields the belief from generative noise.
The two-brain split matches how different skills are needed: broad reasoning vs. precision prompting.
Continuous belief updates make POMDP-style planning practical in a stochastic video world.

04Experiments & Results

🍞 Hook: When you try a new study method, you don’t just hope—it gets tested on quizzes and judged by teachers.

🥬 The Test:

What it is: L-IVA, a 100-task benchmark across Kitchen, Livestream, Workshop, Garden, and Office scenes, each needing 3–8 interaction steps with multiple objects.
How it works: Agents must finish high-level goals through coherent multi-clip generations. Metrics check goal completion (TSR), physics realism (PPS), action–video match (AFS), identity consistency, overall aesthetics, and human preference (BWS).
Why it matters: Success means real, long-horizon competence in noisy generative environments—not just pretty single clips.

🍞 Anchor: Like grading both your homework answers and how clearly you showed your work.

The Competition:

Open-Loop Planner: Plans all steps once, never looks back.
Reactive Agent: Looks and acts turn-by-turn but keeps no belief and no reflection.
VAGEN-style CoT: Has world-model-style reasoning but assumes predictable outcomes (weak to video randomness).

Scoreboard with Context:

Task Success Rate (TSR): ORCA averages 71.0%—imagine getting an A when others hover around B-/C+ levels. It shines especially in complex, high-dependency scenes (Garden, Workshop) where unnoticed early mistakes break later steps.
Physical Plausibility (PPS): ORCA tops the chart (≈3.72/5). Think of it as “fewer floating objects and more sensible hand–object contact.”
Action Fidelity (AFS): ORCA is best or tied across scenes (≈0.64), meaning the video actually shows what the caption asked for.
Identity Consistency and Aesthetics: ORCA remains competitive and often best in subject consistency because the Reflect stage filters out bad generations that would drift identity.
Human Preference (BWS): ORCA wins clearly; people preferred its coherent, goal-focused videos even if a baseline sometimes looked slightly prettier in a single clip.

🍞 Hook: Sometimes guessing everything in advance “kind of works” for easy chores, but not for tricky ones.

Surprising Findings:

Open-Loop can look strong in simple, low-dependency scenes (Kitchen, Livestream) because it “attempts” all steps regardless of what actually happened. But it crashes on complex tasks when early slips go uncorrected.
ORCA may spend some budget on retries (Reflect), which can slightly hurt speed on easy tasks, but this investment pays off big in hard, dependency-heavy tasks.
Reactive baselines lack a belief state, so they repeat actions (e.g., keep adding soil) and violate object permanence.

Ablations (Workshop focus):

Remove System 1 (no precise grounding): TSR and human preference drop—abstract commands alone don’t control I2V reliably.
Remove Reflect (no verification): Consistency and preference fall—bad clips sneak in and poison later steps.
Remove Belief State (no memory): TSR plunges—without memory, the agent loses track of subgoals and order.

🍞 Anchor: It’s like seeing that planners who check their work and write precise instructions do better on long projects than those who just rush through.

05Discussion & Limitations

🍞 Hook: Even the best team can be held back if the tools are wobbly.

🥬 Honest Assessment:

Limitations:
1. Vision-Language Model (VLM) perception can miss subtle temporal glitches when only a few frames are sampled, causing false accepts.
2. Weak depth/spatial understanding can propose impossible moves (e.g., grabbing a far background object).
3. Image-to-Video (I2V) generation may ignore fine-grained instructions, hallucinate or drop objects, and struggle with precise physics.
4. Retry loops cost time and compute; on very simple tasks, this can be overkill.
Required Resources:
- A strong VLM for System 2 and Reflect; a capable I2V model; prompt engineering; and enough GPU/TPU to run multiple retries per step.
When NOT to Use:
- Single-shot aesthetic demos with no interaction; ultra-precise physical simulations (e.g., lighting a tiny wick) where current I2V models often fail; or tasks with hard real-time limits and tiny compute budgets.
Open Questions:
1. How to add better 3D spatial grounding so Reflect understands depth and contact more robustly?
2. Can we learn model-specific controllers that reduce prompt brittleness and retries?
3. How to adaptively choose frame sampling for Reflect to catch fleeting artifacts without blowing context length?
4. Could lightweight learned critics replace or assist VLM reflection to cut costs?

🍞 Anchor: Think of ORCA as a smart driver whose car has slippery tires—better tires (future I2V/VLM) will make the same driver even safer and faster.

06Conclusion & Future Work

🍞 Hook: Picture turning a puppet into a careful problem-solver that checks each move before building the next.

🥬 Takeaway:

3-Sentence Summary: This work upgrades video avatars from passive reactors to active agents by introducing L-IVA (a benchmark) and ORCA (a closed-loop planning-and-acting framework). ORCA runs an Observe–Think–Act–Reflect cycle with a dual-system “two-brain” design, maintaining a belief state, predicting outcomes, writing precise action captions, and verifying results. This prevents random video generations from corrupting memory, enabling reliable multi-step task completion.
Main Achievement: Showing that closed-loop world modeling with outcome verification and a hierarchical planner–grounder split dramatically improves long-horizon success and human preference in stochastic generative environments.
Future Directions: Stronger 3D spatial reasoning, better temporal analysis, learned grounding policies to reduce retries, and tighter coupling with more controllable video generators.
Why Remember This: The big idea—predict, act, verify, then update—turns “pretty clips” into purposeful behaviors, bringing virtual hosts, tutors, and assistants closer to real agency.

🍞 Anchor: Like a chef who tastes, adjusts, and perfects each step, ORCA keeps avatars on recipe even when the kitchen is noisy.

Practical Applications

•Autonomous product hosting in livestreams that plan demos, adapt to audience prompts, and keep identity consistent.
•Step-by-step educational videos (cooking, crafts, science labs) that verify each step before moving on.
•Virtual concierge or tour guide that plans multi-stop explanations and checks visual cues to stay on track.
•Office task walk-throughs (e.g., assembling equipment) with verified progress at each subgoal.
•Game or simulation characters that complete multi-stage quests reliably in generative cutscenes.
•Marketing storyboard generation where actions align tightly with brand/product interactions.
•Human-in-the-loop content creation where the Reflect stage flags bad takes for easy reshoots or edits.
•Accessibility-friendly tutorials that ensure visible, verifiable changes happen in each step for clarity.
•Training data generation for downstream models with high action–caption alignment and physical plausibility.
•Interactive social media challenges where avatars adapt plans to user comments while preserving coherence.

Version: 1