UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang; Yuechen Zhang; Xu He; Yuan Gao; Zhi Cen; Bin Xia; Yan Zhou; Xin Tao; Pengfei Wan; Jiaya Jia

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Beginner

Jiehui Huang, Yuechen Zhang, Xu He et al.12/8/2025

arXiv PDF

Key Summary

•UnityVideo is a single, unified model that learns from many kinds of video information at once—like colors (RGB), depth, motion (optical flow), body pose, skeletons, and segmentation—to make smarter, more realistic videos.
•It uses a clever training trick called dynamic noise scheduling to switch between three jobs: generate videos from conditions, estimate conditions from videos, and generate both together.
•A modality-adaptive switcher helps the model know which type of input it is using (like depth vs. flow) so it can process each one correctly without getting confused.
•An in-context learner uses short text hints (like 'depth map' or 'skeleton') so the model understands the kind of information it’s working with and generalizes better.
•Training on all modalities together speeds up learning and improves zero-shot generalization—handling new objects and scenes it has never seen before.
•A new 1.3M-sample dataset (OpenUni) and a 30K-sample benchmark (UniBench) help the model learn and be fairly tested.
•UnityVideo beats or matches strong baselines in text-to-video, controllable generation, depth estimation, and video segmentation.
•It shows better 'world awareness,' like obeying physics (refraction, motion), keeping backgrounds consistent, and understanding object boundaries.
•The framework is scalable: adding more modalities keeps improving performance without chaos.
•Limitations include occasional autoencoder artifacts and the need for large compute, but the path forward is clear: better encoders and more modalities.

Why This Research Matters

World-aware video generation can make synthetic videos feel natural, consistent, and physically believable, improving everything from movies to education. A single model that understands depth, motion, and objects can help creators direct scenes precisely while keeping backgrounds stable and actions realistic. Robots and AR/VR systems benefit because better depth and motion understanding can make navigation safer and interactions smoother. Medical, scientific, and engineering simulations can look more accurate while being easier to control. Finally, this unified approach reduces the need for many separate tools, lowering friction and cost for real-world production pipelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a movie, you don’t just see colors—you also feel the scene’s depth, follow how things move, recognize where people are, and understand what each object is? Your brain mixes many clues to make sense of the world. Video AIs used to focus mostly on just the colorful pictures (called RGB). That’s like trying to understand a busy street using only a single eye and ignoring clues like how far things are, how they move, or which pixels belong to which object.

Before this research, video generation models could make impressive clips, but they often missed the deeper 'world rules.' They might mess up physics like refraction through glass, make backgrounds flicker, or confuse moving parts. Other systems tried to fix this by adding just one helper signal at a time—like using depth maps to guide video generation or using skeletons to control human motion. That helped, but each method was like learning only one instrument in an orchestra. You get better at that instrument, but you still don’t have the full song.

The problem was bigger: different 'modalities' (kinds of video information) weren’t being learned together. Modalities like depth (how far), optical flow (how things move), segmentation (which pixels belong to what), skeletons (stick-figure motion), and DensePose (detailed body surfaces) each provide complementary clues. But most models either (1) took one of these as a control to generate RGB, or (2) predicted one modality from RGB. Rarely did they do both directions together, and almost never did they learn several modalities and tasks in one unified framework. That meant weaker cross-modal interaction and limited 'world awareness.'

People tried single-modality control (e.g., depth-to-video) and even some two-way systems, but they often trained separate models for each task or stuck to just one or two modalities. This isolated training prevented the model from learning the rich, shared structure of the world. Also, sequential stage-by-stage training could cause forgetting—when you learn a new task, you degrade performance on the old one.

The gap: we needed a unified way to train across many modalities and many tasks, all at once, inside a single model that can switch objectives smoothly. With that, the model could share knowledge between tasks (like generating and estimating), and between modalities (like depth strengthening motion understanding, and segmentation clarifying object boundaries). If done right, this could accelerate learning, improve quality, and unlock zero-shot generalization—performing well on new stuff without direct training.

Why this matters: Your videos could look and feel more real—no odd hand distortions, no background shimmer, and better motion. In practical life, that means safer robots that understand depth and motion, better filmmaking tools that follow physics, and more accurate educational or medical animations. It can also help creators precisely control video generation using different inputs (like a stick figure pose or a segmentation mask) without losing visual quality. In short, unifying modalities helps AI 'see' more like we do—by combining many clues of the world into one coherent understanding.

02Core Idea

Aha! Moment in one sentence: Teach one big video model to learn many kinds of world clues (depth, motion, body pose, segmentation, and more) and many tasks together, and it becomes better at making and understanding videos—even ones it never saw before.

Three analogies:

Orchestra: Imagine training a musician to play violin, piano, and drums together. They start to feel rhythm, harmony, and melody in a deeper way. UnityVideo learns many 'instruments' (modalities) to compose better 'music' (videos).
Sports Team: A team with runners, defenders, and goalies communicates better and wins more. Depth, flow, and segmentation 'talk' to each other in the model and improve overall play.
Recipe Book: Baking, grilling, and steaming teach different skills. Learning them together helps a chef choose the best method for each ingredient. The model learns which 'cooking method' (task) to use for each input.

Before vs After:

Before: Separate models or single-modality controls; good visuals but shaky physics, flicker, and brittle control.
After: One model trained on many modalities and tasks together; stronger physics, steadier backgrounds, finer edges, and better zero-shot generalization.

Why it works (intuition without equations):

Different modalities give complementary clues: depth nails 3D structure, optical flow explains motion, segmentation separates objects, skeletons capture pose, and DensePose refines body parts.
Training across multiple tasks (generate from conditions, estimate conditions from video, and jointly generate both) lets knowledge flow both ways.
Switching goals dynamically during training (by adding or removing noise differently for each task) prevents forgetting and encourages a shared 'world model' inside the network.
A modality switcher (like a dial) and in-context prompts (tiny text hints like 'this is a depth map') tell the model exactly what kind of data it’s handling, reducing confusion and boosting generalization.

Building blocks explained with the Sandwich pattern:

🍞 Hook: Imagine learning to read a comic by seeing the pictures, the speech bubbles, the motion lines, and the scene panels together. 🥬 Unified Multi-Modal Joint Training: It is a way to teach one model from many video clues (RGB, depth, flow, segmentation, skeleton, DensePose) at the same time. How it works: (1) Put all modalities into a shared model, (2) let them interact through attention so clues reinforce each other, (3) train on several tasks so knowledge flows both ways. Why it matters: Without it, the model only learns part of the story and struggles with physics and consistency. 🍞 Anchor: Using both depth and flow together makes the model better at tracking a runner’s motion while keeping the background steady.

🍞 Hook: Think of a student doing math, science, and art in one school day, noticing patterns that connect them. 🥬 Multi-Task Learning: It means the model learns different jobs together (control generation, estimation, joint generation). How it works: (1) Randomly pick a task each step, (2) use the right noise schedule, (3) share the same backbone so skills transfer. Why it matters: Without it, learning one job can harm another, and the model forgets. 🍞 Anchor: Training to both generate depth from text and to estimate depth from a video helps the model understand depth better in both directions.

🍞 Hook: You know how friends from different clubs share tips—soccer moves can help in basketball defense. 🥬 Cross-Modal Interaction: It is different modalities helping each other through shared representations. How it works: (1) Concatenate tokens, (2) let attention mix information across RGB and others, (3) learn shared patterns (like edges or motion). Why it matters: Without it, the model can’t combine clues and misses global understanding. 🍞 Anchor: Segmentation shows boundaries, which helps depth estimation avoid 'bleeding' edges.

🍞 Hook: When you enter a library, signs like 'Fiction' or 'Science' guide you to the right shelves. 🥬 In-Context Learner: It uses short text hints (like 'depth map' or 'skeleton') to tell the model what kind of data it’s seeing. How it works: (1) Feed modality-type text into cross-attention, (2) process modality tokens with their hint, (3) recombine with RGB. Why it matters: Without hints, the model may confuse modalities or overfit to specific content. 🍞 Anchor: A prompt 'two persons' during training helps it later segment 'two objects' in zero-shot settings.

🍞 Hook: Like switching a camera mode from 'portrait' to 'sports' for the right settings. 🥬 Modality-Adaptive Switcher: It is a learned 'dial' that adjusts the network for each modality. How it works: (1) Each modality has an embedding, (2) it generates scale/shift/gates for layers, (3) optional expert heads at input/output reduce confusion. Why it matters: Without it, the model blends modalities and mis-predicts outputs. 🍞 Anchor: Asking for a segmentation mask will not accidentally produce a skeleton because the switcher selects the right head.

🍞 Hook: When practicing music, sometimes you mute the band and focus on your instrument. 🥬 Dynamic Noise Scheduling: It controls which parts are noised or clean to switch tasks. How it works: (1) For conditional generation, keep condition clean and denoise RGB; (2) for estimation, keep RGB clean and denoise modality; (3) for joint generation, noise both. Why it matters: Without it, the model can’t smoothly learn all tasks together and may forget. 🍞 Anchor: To generate video from depth, the depth stays clean while the RGB is denoised.

🍞 Hook: Matching socks by color and size under different lighting conditions. 🥬 Conditional Flow Matching: It teaches the model to predict the 'velocity' to move from noise to data under a condition. How it works: (1) Mix clean data with noise over time t, (2) learn to point toward clean data, (3) condition on text or modalities. Why it matters: Without it, the model learns slower and aligns worse with conditions. 🍞 Anchor: Given 'a dog running' plus a skeleton stick figure, the model learns to move from noise toward the correct dog-running video.

🍞 Hook: If you can ride a bike, you can often try a scooter without lessons. 🥬 Zero-shot Generalization: It is handling new cases not seen in training. How it works: (1) Learn shared principles across modalities, (2) use in-context hints to adapt, (3) leverage cross-modal patterns that transfer. Why it matters: Without it, the model fails on new objects or styles. 🍞 Anchor: Trained mostly on human data, it can still segment two non-human objects when prompted right.

03Methodology

At a high level: Inputs (text + any subset of modalities like depth/flow/segmentation/skeleton/DensePose) → Dynamic task routing with noise scheduling → Shared Diffusion Transformer with in-context learner and modality-adaptive switcher → Outputs (RGB video, estimated modalities, or both).

Step-by-step, like a recipe:

Data preparation (OpenUni 1.3M multimodal samples)

What happens: Gather real and synthetic videos, extract five modalities (depth with Depth Anything, flow with RAFT, segmentation with SAM, skeleton with DWPose, DensePose from Meta), and filter for quality (duration, resolution, aesthetics, no OCR text overlays). Build UniBench (30K+ synthetic for ground truth; plus curated real data) for fair testing.
Why it exists: The model needs rich, synchronized signals across modalities to learn shared world rules. Without this, it overfits or misses key cues.
Example: A 5-second clip of a person walking gets a depth map (near/far), optical flow (motion vectors), a skeleton (joint positions), DensePose (body surface), and segmentation (which pixels are person vs. background).

Tokenization and shared backbone

What happens: Encode RGB and other modalities into latent tokens via a VAE and then process them with a single Diffusion Transformer (DiT). Concatenate modality tokens (e.g., along width) so attention layers can mix clues.
Why it exists: A shared brain lets modalities talk to each other. Without shared layers, you lose cross-modal reinforcement.
Example: Depth edges align with RGB edges; attention layers learn that a sharp depth jump often means an object boundary.

Dynamic task routing via noise scheduling (the three modes)

What happens: Each training step, randomly choose one task with probabilities tuned by difficulty:
- Conditional generation: Keep the condition modality clean, denoise RGB from noise.
- Estimation: Keep RGB clean, denoise the target modality.
- Joint generation: Denoise both from noise, conditioned on text.
Why it exists: This lets the model practice all jobs evenly, preventing forgetting and encouraging shared understanding.
Example: If the step is 'flow estimation,' the RGB frames are clean, and the model learns to denoise flow latents toward the correct flow.

In-Context Learner (modality-aware prompts)

What happens: Provide two short text prompts—one for content (e.g., 'a dog running in a park') and one for modality type (e.g., 'depth map'). Use cross-attention to fuse content with RGB tokens and modality-type with modality tokens.
Why it exists: The model needs to know 'what' it’s seeing and 'which kind' of data it is. Without this, it may overfit to content and fail to generalize across types.
Example: The text 'two persons' during training helps the model later generalize to 'two objects' for text-driven segmentation.

Modality-Adaptive Switcher (architecture-level modulation)

What happens: Each modality gets a learned embedding. This embedding, combined with timestep info, produces scale/shift/gate parameters for layers (AdaLN-style). Also, lightweight expert heads at the input/output help keep modalities separate when needed.
Why it exists: Even with text hints, the network benefits from a dedicated 'mode' per modality, reducing confusion and stabilizing outputs.
Example: When asked for a segmentation output, the segmentation head is used—preventing accidental skeleton-like outputs.

Curriculum training (stage-wise modality grouping)

What happens: Train first on pixel-aligned modalities (depth, flow, DensePose) on single-person data to learn strong alignment. Then expand to more diverse data and add pixel-unaligned modalities (segmentation, skeleton).
Why it exists: Jumping into all modalities at once can be noisy and slow. A curriculum builds a sturdy foundation, then adds complexity.
Example: After learning to reconstruct depth precisely, the model uses that structure to improve segmentation boundaries.

Conditional Flow Matching objective (unified loss view)

What happens: For each mode, mix clean data with noise by a factor t, and train the model to predict the velocity toward the clean target(s), under the right condition(s).
Why it exists: It stabilizes training and aligns learning with the denoising path, making conditional generation and estimation more consistent.
Example: In joint generation, both RGB and depth are noised; the model learns to push both toward their clean versions while following the text prompt.

Inference (plug-and-play control)

What happens: At test time, choose which inputs to provide (text only, text + depth, text + skeleton, etc.) and which outputs to request (RGB only, depth only, or both). The switcher picks the right mode; the in-context prompts clarify data type.
Why it exists: Real users want flexibility—sometimes control a video from a pose, sometimes get a depth map from a video, sometimes do both.
Example: Give a reference skeleton sequence to animate a dancer while keeping the background steady.

Secret sauce (why this is clever):

The dynamic noising trick cleanly unifies three tasks into one training loop. The switcher and in-context prompts prevent modality confusion and spark generalization. Curriculum training speeds convergence and avoids early chaos. Altogether, the model learns a single, richer 'world model' that makes videos more physically correct, stable, and controllable.

Concrete walkthrough example (depth → video):

Input: Text 'a person jumping over a puddle' + a depth map sequence.
Routing: Conditional generation mode—depth clean, RGB noised.
Processing: In-context prompt 'depth map' guides modality branch; switcher sets depth mode; attention layers blend depth edges with text intent.
Output: A stable, realistic jump with correct puddle depth cues and reflections.

Concrete walkthrough example (video → depth):

Input: An RGB clip of a dog running.
Routing: Estimation mode—RGB clean, depth noised.
Processing: In-context 'depth map' hint; switcher selects depth head; cross-modal attention links motion edges to depth discontinuities.
Output: A smooth, accurate depth sequence capturing the dog and background separation.

Concrete walkthrough example (text → video + depth):

Input: Text 'a glass of milk poured on a table, side view.'
Routing: Joint generation—both RGB and depth noised.
Processing: Content prompt drives RGB; modality prompt drives depth; the two co-evolve with shared attention.
Output: A video that respects refraction in glass, steady background, and a depth map consistent with the scene.

04Experiments & Results

The test: Does unifying many modalities and tasks in one model actually improve video quality, physics, and understanding? The team evaluated UnityVideo on three fronts: (1) text-to-video generation, (2) controllable generation (e.g., depth-to-video), and (3) modality estimation from RGB (like depth and segmentation). They measured visual quality, consistency, and motion (via VBench metrics), depth accuracy (AbsRel and δ < 1.25), and segmentation accuracy (mIoU, mAP). They also ran a user study to see if humans preferred the outputs.

The competition: UnityVideo was compared against strong open and commercial models for video generation (e.g., OpenSora, Hunyuan, Wan, Kling), controllable generation (e.g., Full-DiT, VACE), and estimation systems (e.g., DepthCrafter, Geo4D, Aether), plus video segmentation baselines (e.g., SAMWISE, SeC). Testing used public VBench and the new UniBench, which includes highly accurate synthetic ground truth from Unreal Engine for fair scoring.

The scoreboard with context:

Text-to-video: UnityVideo achieved top scores across background consistency, overall consistency, and dynamic degree while remaining competitive in aesthetics. Think of it as getting an A+ in 'following physics and staying stable' and an A in 'looking pretty,' where others might get B’s in stability.
Controllable generation: It followed control signals (like depth and skeleton) more faithfully and kept backgrounds steadier, reducing flicker and odd distortions. That’s like perfectly following a dance coach while also keeping the stage lights steady.
Depth estimation: AbsRel around 0.022 with δ < 1.25 near 99%—that’s exceptional, like guessing nearly every distance in a scene with very tiny error.
Video segmentation: Higher mIoU and mAP than strong baselines, meaning cleaner object masks and better detection—like neatly coloring inside every line while spotting small details others miss.

Surprising (and cool) findings:

Zero-shot generalization: Even when trained mainly on humans, the model could handle new objects and multi-object scenarios for segmentation just by changing the in-context prompt. That’s like a student trained on 'two people' successfully segmenting 'two chairs' without extra lessons.
Physics awareness: The model followed physical rules better—like realistic refraction in glass and consistent motion damping—thanks to complementary supervision from depth and flow.
Faster convergence: Joint multimodal training reached lower video generation loss than single-modality or RGB-only setups, meaning it learned faster and better.
No negative interference: Adding more modalities didn’t cause chaos; performance improved steadily as more modalities were included.

Ablations that explain why it works:

Single vs multiple modalities: Training with just depth or just flow helped; training with both helped more, especially in imaging quality and consistency, showing that modalities truly complement each other.
Single-task vs multi-task: Doing only one task sometimes hurt performance; unifying tasks recovered and surpassed it, proving that tasks help each other.
Architecture choices: Both the in-context learner and the modality switcher contributed gains individually; together they worked best, reducing confusion and boosting stability.
Modality-specific heads: Lightweight dedicated output layers further prevented the model from mixing up outputs (e.g., not confusing segmentation with skeleton), improving reliability at scale.

Human preference: In a user study, people preferred UnityVideo’s outputs for physical realism and overall quality more often than competing models, aligning with the automatic metrics.

Bottom line: Across many tests and angles, UnityVideo didn’t just match others—it often set a new bar for stability, control, depth accuracy, and segmentation, all while being a single, flexible model.

05Discussion & Limitations

Limitations:

VAE artifacts: The autoencoder that turns videos into latents can introduce slight blurriness or reconstruction quirks. This sometimes limits the sharpest details.
Heavy compute: A unified 10B-parameter DiT with many modalities needs significant training resources and careful curriculum scheduling.
Modality expansion: While the framework scales well, adding many more modalities (e.g., audio, events, LiDAR) will require continued care to avoid subtle confusion.
Extreme edge cases: Very long videos, extremely fast motions, or rare physical effects may still challenge consistency.

Required resources:

Large-scale multimodal data (OpenUni) and a capable compute stack for training.
Good pre-processing (quality filters, OCR cleaning) and expert models for modality extraction if creating new data.
For best results, a tuned inference pipeline (steps, CFG scale) and, optionally, task-specific prompts.

When not to use:

If you only need a tiny, specialized tool for one very narrow task (e.g., ultra-fast single-frame depth on mobile), a small specialist may be more efficient.
If you must generate ultra-high-resolution photorealism with strict latency limits, simpler pipelines may be preferable for now.
If your domain needs modalities UnityVideo wasn’t trained on and you can’t provide matching in-context hints, results may degrade.

Open questions:

How far can zero-shot generalization go with even more modalities and scale? Do we see emergent reasoning like in large language models?
Can better autoencoders remove reconstruction limits and bring crisper details?
How to incorporate 3D scene representations directly (e.g., NeRFs or 4D priors) while keeping the unified simplicity?
What is the best curriculum ordering for new modalities, and can the model self-schedule training difficulty?
How to blend audio or tactile signals into the same unified scheme for richer world models?

06Conclusion & Future Work

Three-sentence summary: UnityVideo trains a single video model on many kinds of world clues (depth, flow, segmentation, skeleton, DensePose) and many tasks (conditional generation, estimation, and joint generation) at once. With dynamic noise scheduling, a modality-adaptive switcher, and an in-context learner, it learns a shared 'world model' that improves physics, stability, and control, while generalizing to new situations. It outperforms or matches strong baselines across video quality, depth estimation, and segmentation, supported by the large OpenUni dataset and the UniBench benchmark.

Main achievement: Showing that unifying multiple modalities and tasks in one diffusion-transformer framework produces a more world-aware video model that learns faster, generalizes better, and achieves stronger scores across diverse benchmarks.

Future directions: Improve the autoencoder for sharper detail, scale up parameters and modalities (e.g., add audio or 3D priors), refine the curriculum, and explore even richer in-context capabilities. Also, extend the unified approach to longer videos and more complex interactions.

Why remember this: It’s a clear step toward 'world models' for video—systems that don’t just draw pretty pixels but understand how the world works. By learning many clues together and many jobs at once, UnityVideo makes videos that look better, move more realistically, and follow your controls more faithfully.

Practical Applications

•Film and animation tools that follow physics (refractions, realistic motion) while letting creators control scenes with depth maps, skeletons, or masks.
•AR/VR content generation with consistent backgrounds and accurate depth for more comfortable and immersive experiences.
•Robotics perception: estimate depth and motion robustly from video feeds to support navigation and manipulation.
•Sports and dance visualization: guide motions with skeletons to generate training or choreography videos.
•Education demos: physics-friendly clips (pendulums, fluids, refraction) that match textbook rules.
•Medical or scientific animations: stable, physically plausible videos that can be controlled by segmentation or depth inputs.
•Video editing: replace backgrounds and objects cleanly using segmentation while preserving motion smoothness.
•Game prototyping: generate scene clips with consistent geometry and motion for level design references.
•Surveillance and traffic analysis: improved motion and object understanding for safer city planning simulations.
•Content accessibility: generate clear, simplified modality outputs (like depth or segmentation) to assist downstream tools.

Version: 1