VINO: A Unified Visual Generator with Interleaved OmniModal Context

Junyi Chen; Tong He; Zhoujie Fu; Pengfei Wan; Kun Gai; Weicai Ye

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Beginner

Junyi Chen, Tong He, Zhoujie Fu et al.1/5/2026

arXiv PDF

Key Summary

•VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.
•It connects a Vision-Language Model (VLM) to a Multimodal Diffusion Transformer (MMDiT) so all instructions and references become a shared set of tokens the generator can understand.
•Special learnable query tokens act like helpful notes that improve instruction following, make training steadier, and fix many multimodal conflicts.
•A token-boundary mechanism reuses the VLM’s visual start/end markers inside the diffusion model’s latent stream, keeping identities and attributes consistent across images and videos.
•VINO uses both high-level VLM features and low-level VAE latents, so it can follow directions while preserving fine details for precise edits.
•A progressive three-stage training plan teaches the model step by step: start from text-to-video, handle long and short prompts, then learn multi-task image/video generation and editing.
•On image and video benchmarks, VINO keeps strong base-generation quality while adding powerful editing, identity preservation, and multi-reference control.
•Ablation studies show learnable tokens stabilize training, image-guidance scales balance fidelity vs. motion, and special latent-separating tokens prevent early-frame artifacts.
•The approach shows a practical path to truly unified visual creation where text, images, and videos can be mixed and matched in-context.
•This unified design reduces the need for separate models, makes creative workflows simpler, and points toward general-purpose multimodal generators.

Why This Research Matters

Creative people won’t need to juggle many separate tools to go from an idea to an edited video—one model can do it all. Teams can keep a character’s identity the same across posters, trailers, and social clips without manual fixes. Teachers and students can mix text with example images or short clips to create visual stories quickly and consistently. Companies can streamline production pipelines, cutting costs and reducing handoffs between incompatible systems. Researchers gain a clean testbed to explore richer multimodal reasoning because inputs are interleaved and grounded in one place. As stronger VLMs appear, the same framework can immediately gain better instruction following and semantic control. Overall, this points toward everyday, general-purpose visual assistants that understand what you say and what you show—and create exactly what you mean.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your school backpack holds books for math, science, and art, so you don’t need a different bag for each class? Imagine if making pictures and videos with AI could be like using one backpack for everything.

🥬 The World Before: For a long time, AI artists used a bunch of separate "bags"—one model for turning text into pictures, another for turning text into videos, and others just for editing. These worked well on their own, but they didn’t talk to each other. If you wanted to create a video based on a photo style, then edit it, you often had to hop between different tools, convert formats, and hope everything stayed consistent. Meanwhile, multimodal assistants could understand both text and images, but when it came to actually generating high-quality visuals, they still handed the job off to separate image/video generators.

🥬 The Problem: Two big roadblocks stood in the way of one model doing it all. First, different tasks speak different “languages”: generation tasks like rich, descriptive captions ("a red robot riding a bike at sunset"), but editing tasks use short, to-the-point instructions ("make the robot blue, keep the background"). Second, when you give a model multiple signals at once—text, reference images, even reference videos—most models get confused about what to prioritize. They might mix up identities, change the wrong object, or follow the text but ignore the reference picture’s exact look.

🥬 Failed Attempts: Earlier tries glued extra control modules onto an image model (like adding new arms to a robot), or built separate pathways per task or per modality. These helped for specific jobs (like pose control or depth maps) but didn’t scale to everything at once. Others tried connecting a perception model (understands text+images) to a generator, but the connection was too weak: high-level meaning came through, yet fine details (textures, local edits, identity) got lost. And when inputs arrived all mixed together, the model could not keep sources separate.

🥬 The Gap: What was missing was a single generator with a shared brain that could read all inputs together in one interleaved stream—and a neat way to keep different sources apart while still letting them work as a team. The generator needed to understand long and short instructions, hold onto identity details, and decide what to copy exactly and what to creatively fill in, across both images and videos.

🍞 Anchor: Think of a movie director who can read a script, study reference photos, watch example clips, and then shoot both stills and scenes—without hiring separate teams each time. That’s the dream the paper aims to achieve.

🍞 Hook: Imagine a bilingual friend who looks at a picture and reads a sentence, then explains both at once in simple words—that friend is like the next concept.

🥬 The Concept (Vision-Language Model, VLM): A VLM is an AI that understands images/videos and text together.

How it works:
1. It reads text instructions and views reference images/videos.
2. It turns them into a shared set of tokens that capture meaning.
3. These tokens become guidance for the visual generator.
Why it matters: Without a VLM, the generator misunderstands mixed instructions (like which person to edit) or misses subtle attributes.

🍞 Anchor: If you say, “Replace the man’s glasses with the ones in Image 1,” the VLM links the words “glasses” with the exact glasses in the picture labeled Image 1.

🍞 Hook: You know how a Swiss Army knife folds many tools into one handle? That’s our next idea.

🥬 The Concept (Unified Visual Generator): A unified visual generator is one model that can make and edit both images and videos, guided by interleaved text and visuals.

How it works:
1. Accepts text, images, and videos as one combined context.
2. Uses one shared diffusion backbone to generate or edit.
3. Keeps identities and styles consistent across tasks.
Why it matters: Without unification, users juggle multiple models and lose consistency between steps.

🍞 Anchor: You can feed it a pet’s photo and say “turn this into a superhero, then animate it jumping”—in one place, without switching tools.

02Core Idea

🍞 Hook: Imagine building a Lego city where streets, cars, and people click together perfectly—even if they come from different sets. The big trick is the connectors that make every piece fit.

🥬 The Aha! Moment: Treat all control signals (text, images, videos, and even special learnable tokens) as one interleaved sequence of conditioning tokens, then drive a single diffusion backbone with them—while marking clear boundaries so the model knows what came from where.

🍞 Anchor: Like giving the chef one recipe card that mixes ingredients and notes from multiple dishes but with colored dividers, so the chef knows which part belongs to which dish.

🍞 Hook: Picture a librarian who shelves books, maps, and DVDs on one smart timeline so you can find and use them together.

🥬 The Concept (Multimodal Diffusion Transformer, MMDiT): MMDiT is the generator that denoises image/video latents by attending to the full interleaved context.

How it works:
1. Start from noisy latents (images/videos in a compressed form).
2. Attend to interleaved tokens from the VLM and VAE latents.
3. Iteratively denoise to produce crisp frames.
Why it matters: Without this shared attention, different modalities would fight each other or get ignored.

🍞 Anchor: When you ask for “a rainy city scene, match Image 2’s coat style,” MMDiT looks at both the rain description and the coat reference while forming each frame.

🍞 Hook: Think of color-coded folders in a binder—blue for text notes, red for photo references, green for video clips—all sorted in one timeline.

🥬 The Concept (OmniModal Context): An interleaved, in-order stream that mixes text, images, videos, and learnable tokens, all with clear separators.

How it works:
1. The VLM encodes everything into tokens.
2. Special start/end markers wrap each visual source.
3. The generator attends across the whole sequence.
Why it matters: Without interleaving and boundaries, sources blur together, causing identity swaps and wrong edits.

🍞 Anchor: If you include Image 1 (red hat) and Image 2 (blue scarf), the model knows which features to copy from which item when you say "give her the hat from Image 1 and the scarf from Image 2."

🍞 Hook: Imagine slipping a few sticky notes into your notebook that help you remember tricky parts during a test.

🥬 The Concept (Learnable Query Tokens): Extra trainable tokens inserted into the VLM input that learn to bridge high-level instructions and low-level generation details.

How it works:
1. Add a small set of trainable tokens to the prompt.
2. They learn to refine short/ambiguous instructions.
3. Their features are passed to the generator alongside other tokens.
Why it matters: Without them, training gets noisier and the model is worse at precise edits and following short instructions.

🍞 Anchor: When the instruction is “make it vintage,” these tokens help the model reliably pick the right textures and tones—like film grain and faded colors—every time.

🍞 Hook: Think of packing both a world map (big picture) and a magnifying glass (fine detail) for a treasure hunt.

🥬 The Concept (VAE Latents + Token Boundaries): Use VAE latents for detailed visual fidelity, and reuse the VLM’s start/end markers to wrap those latents so the same source stays grouped.

How it works:
1. Encode reference images/videos into VAE latents (fine-grain detail).
2. Place them in the sequence with matching start/end markers used by the VLM tokens.
3. The generator learns both the meaning (VLM) and the exact look (VAE) from the same source.
Why it matters: Without this, the model forgets tiny details or mixes sources, causing attribute leakage.

🍞 Anchor: Asking “match the stripes from Image 1” keeps the exact stripe pattern and color because the semantic and latent tokens are bound together.

🍞 Hook: Imagine a coach who first teaches long plays, then short drills, then full games.

🥬 The Concept (Progressive Multi-Stage Training): A curriculum that starts from text-to-video, adapts from long to short prompts, and then trains all tasks together.

How it works:
1. Align the VLM output to the base video model with a simple connector.
2. Mix long and short prompts so the model handles both styles.
3. Train multi-task generation/editing with mixed data.
Why it matters: Without this progression, the model either forgets its generative strengths or fails to learn concise, practical edits.

🍞 Anchor: After this training, a short edit like “make the shirt green, keep the logo” works just as well as a long, flowery prompt.

03Methodology

At a high level: Inputs (text + optional reference images/videos) → VLM encodes all into interleaved tokens (plus learnable tokens) → MMDiT reads those tokens and VAE latents with clear boundaries → Diffusion denoises latents into the final image or video.

Step 1: Collect and format the conditions

What happens: The system receives a system prompt, your instruction, and any reference images/videos. Visual inputs are labeled (Image 1, Image 2, Video 1) and placed before the instruction. Learnable query tokens are appended.
Why this step exists: Clear ordering and labels let the model refer to the right source (“use glasses from Image 1”). Without this, cross-references are fuzzy and edits target the wrong thing.
Example: “Replace the sunglasses worn by the person in the video with the strange-shaped ones in Image 1.” The prompt includes Video 1 and Image 1, in that order, then the instruction.

🍞 Hook: Imagine a smart librarian filing books, photos, and movies into one index. 🥬 The Concept (VLM encoding): A frozen VLM reads the formatted prompt, producing multimodal token embeddings; a small MLP projects them to the generator’s space.

How it works:
1. Tokenize text; tokenize visuals into VLM visual tokens.
2. Add learnable query tokens at the end.
3. Apply causal masking so tokens look “forward” consistently.
Why it matters: Without a strong encoder, the generator can’t correctly line up words with visuals. 🍞 Anchor: The phrase “Image 2’s scarf” is tied to the exact visual tokens for Image 2’s scarf.

Step 2: Add fine-detail visual latents with clear boundaries

What happens: References are also encoded by a VAE into latents (compact, detail-rich). These are inserted into the generator’s sequence, using the same start/end markers that wrapped the VLM’s tokens for that source.
Why this step exists: VLM tokens carry meaning; VAE latents carry textures and local structure. Both are needed for precise, identity-true edits.
Example: If Image 1 shows a yellow raincoat, the VAE latents preserve its exact shade and fabric texture, not just the idea of “yellow coat.”

🍞 Hook: Picture putting colored dividers before and after each section in a binder so pages don’t get mixed. 🥬 The Concept (Token-Boundary Mechanism): Reuse the VLM’s visual start/end markers to wrap the matching VAE latents.

How it works:
1. For each visual source, place <vision_start> before and <vision_end> after its VLM tokens.
2. Use the same markers (after projection) to wrap its VAE latents in MMDiT.
3. Attention learns “these belong together,” avoiding cross-talk.
Why it matters: Without shared boundaries, the model mixes sources, causing identity swaps. 🍞 Anchor: “Give her Image 1’s hat” won’t accidentally copy Image 2’s scarf.

Step 3: Arrange everything on a shared timeline with 3D RoPE

What happens: In the VAE branch, image and video latents (plus the noised target latents) are positioned with a unified 3D RoPE schedule along the time axis, separated by special tokens.
Why this step exists: The model must tell static references apart from moving targets and handle different lengths cleanly.
Example: A single image reference is a short blip on the timeline; a video reference is a longer segment; the target sequence comes last.

🍞 Hook: Think of placing snapshots and a short movie clip on one storyboard, with sticky notes marking start and end. 🥬 The Concept (3D RoPE Timeline): A shared positional layout that interleaves modalities in time.

How it works:
1. Assign temporal positions to each modality segment.
2. Insert special tokens between segments.
3. Let attention use these cues to avoid misreading static images as part of a video’s motion.
Why it matters: Without it, first frames can warp or borrow structure from references. 🍞 Anchor: The model won’t mistakenly animate the static reference photo as if it were the first frames of the target video.

Step 4: Denoise with the Multimodal Diffusion Transformer (MMDiT)

What happens: The MMDiT attends across the full interleaved context—VLM tokens, learnable tokens, VAE latents with boundaries—and iteratively denoises noisy latents into the output frames.
Why this step exists: Shared attention lets the model balance text intent and visual fidelity in one pass.
Example: “Turn this dog into a watercolor hero like Image 1, then make a 5s clip of it running through puddles.”

Step 5: Train progressively from easy to hard

What happens: Start with a strong text-to-video base model. First, train a small connector to align VLM outputs with the backbone’s expected space. Next, mix long and short prompts to handle both. Finally, train all generation and editing tasks together with a staged data mixture.
Why this step exists: Jumping straight into everything causes forgetting and instability. The curriculum preserves base strengths and adds new skills smoothly.
Example: After just 1k steps in the final stage, VINO already beats many open-source image editors on ImgEdit.

🍞 Hook: Like tuning a radio dial between two stations—too much one way and you lose clarity. 🥬 The Concept (Classifier-Free Guidance, CFG, for images/videos): A knob at inference that balances fidelity to references vs. allowing motion and creativity.

How it works:
1. Use text CFG for text strength; use image CFG to enforce identity.
2. Higher image CFG = stronger identity but less motion.
3. Pick moderate values for the best trade-off.
Why it matters: Without this control, outputs are either too loose (lose identity) or too stiff (frozen motion). 🍞 Anchor: With image CFG around 1.5, a person keeps their face and outfit while still moving naturally in the video.

The Secret Sauce:

Interleaved omni-modal context avoids building separate heads or modules per task.
Shared boundary tokens tie meaning (VLM) and detail (VAE) from the same source, preventing identity leaks.
Learnable query tokens calm training, sharpen instruction following (especially short prompts), and reduce gradient noise.
A progressive curriculum keeps the base model’s generative power while layering new multi-task skills.

04Experiments & Results

The Test: The team checked if VINO can keep its strong image/video generation skills while gaining editing, reference following, and instruction obedience. They measured visual quality, semantic alignment, identity preservation, motion dynamics, and edit accuracy on standard benchmarks.

The Competition: VINO was compared to powerful systems—both open and closed—like HunyuanVideo (its base), SDXL, DALL·E 3, OmniGen2, VACE-14B, Lucy-Edit, and others. Many competitors still use separate models or pipelines, while VINO aims to unify everything.

The Scoreboard (with context):

Text-to-Video (VBench): VINO scores about 82.8 overall, roughly matching strong open-source leaders. That’s like getting a solid A after adding two new clubs (editing and references) without dropping your original grades. With an LLM prompt rewriter, it ticks up to 83.17 and improves semantic alignment notably.
Text-to-Image (Geneval): Base HunyuanVideo scores around 0.61 overall; VINO holds its own near 0.59 without rewriter and jumps to 0.75 with rewriter—like moving from a B to an A average among strong classmates when you let a tutor polish your prompts.
Subject-to-Video (OpenS2V): VINO’s total score is 57.85, edging or rivaling strong open and closed systems, with balanced motion and identity metrics—think of winning an all-around medal by being good at both dance (motion) and costume accuracy (identity).
Image Editing (ImgEdit): Even after just 1k steps in the final stage, VINO beats many open-source editors; after full training it reaches an average of 4.18 (on a 5-point scale), rivaling popular pipelines. That’s like joining mid-season and still making varsity, then becoming a starter by season’s end.
Image Editing (GEdit): VINO achieves balanced semantic consistency and perceptual quality (averages ~7.26 and ~7.71) across many edit types, competitive with strong open models and within touching distance of top closed systems where text-rendering isn’t required.
Video Editing (OpenVE-Bench): Judged by Gemini 2.5 Pro and Qwen3VL, VINO achieves the highest overall scores among open baselines across global style, background change, local edits, creative edits, and camera edits. That’s like topping the leaderboard on routines that require both smooth moves and rule-perfect form.

Surprising Findings:

Rapid Editing Gain: Despite receiving image-edit training late (Stage 3), VINO adapts unusually fast—only 1k steps in and it outperforms many seasoned editors. This suggests the unified architecture reuses its video priors for precise image edits.
Semantic Boost from Stronger VLMs: Replacing a weaker text encoder with a modern VLM improved semantic scores noticeably while preserving core quality—confirming that better “understanding” uplifts “making.”
Guidance Sweet Spot: Image CFG acts like a thermostat—too low and identity drifts; too high and motion freezes. A moderate middle produces the best balance.

Ablation Highlights (Why design choices matter):

Learnable Query Tokens: Training curves become smoother, gradients calmer, and edits more faithful. Removing them makes optimization jumpy and final results less precise.
Special Tokens Between Latents: Without separators, the model entangles static references with the video’s early frames, causing warps. With separators, first frames stay clean and stable.
Image CFG: Raises identity fidelity but can lock motion if too high; moderate values keep motion alive.

Takeaway: VINO retains the generative strength of its base model while adding robust unified editing and reference-following. The numbers say it’s not just a Swiss Army knife—it cuts well, too.

05Discussion & Limitations

Limitations:

No Text Rendering: Because the base generator lacks text-drawing skills, tasks that need changing printed words or adding signs are disadvantaged.
Editing Data Quality: Instruction-edit datasets are smaller and simpler than big generation datasets. Mixing them in can slightly trim peak visual fidelity or motion richness.
Attention Cost: Full attention over long, interleaved sequences is quadratic. Many references (plus a video) raise memory and latency.
Modality Scope: VINO currently supports what the VLM handles best (text, images, video). Audio or 3D would require either conversion or stronger multimodal encoders.

Required Resources:

A strong VLM (e.g., Qwen3VL-4B-Instruction) and a capable video diffusion backbone (e.g., HunyuanVideo).
Multi-GPU training with memory optimizations (ZeRO-2, gradient checkpointing) to fit long sequences and video-heavy batches.
Curated, mixed datasets spanning long/short prompts, image/video generation, and diverse editing instructions.

When Not to Use:

If you need high-fidelity text rendering, logos, or OCR-heavy edits.
If you must edit extremely long videos with many references under strict latency constraints.
If your task is single-modality and ultra-specialized (a smaller, task-specific model might be cheaper and faster).

Open Questions:

Can we make attention more efficient (e.g., sparse or linear attention) without hurting multimodal grounding?
How can we build richer, higher-quality instruction-edit datasets that don’t trade off motion or fidelity?
Can the token-boundary trick expand to more modalities (audio, depth, 3D) while staying simple and robust?
How far can learnable query tokens go—can they learn roles (e.g., region pointers, temporal anchors) or adapt on-the-fly per task?
Can better in-context reasoning (e.g., longer chains of references and rules) improve multi-identity edits even further?

06Conclusion & Future Work

Three-Sentence Summary: VINO is a single visual generator that unifies image and video creation and editing by feeding interleaved text and visual tokens into one diffusion backbone. It binds high-level meaning (VLM tokens) and fine detail (VAE latents) from the same sources with shared boundary markers, and steadies training using learnable query tokens. A progressive curriculum preserves base generation strengths while adding powerful, precise multimodal editing and reference-following.

Main Achievement: Showing that interleaved, in-context computation—plus simple but clever hooks like shared boundary tokens and learnable queries—can turn one backbone into a capable all-in-one visual creator without bolting on modality-specific modules.

Future Directions: Make attention more efficient for longer, heavier contexts; expand to more modalities (audio, 3D, depth, physics cues); collect richer edit datasets; and explore smarter learnable tokens that act like dynamic tools (e.g., region selectors, motion stylers). Adding text-rendering capability would also unlock full-score performance on edit suites that include signage or typography.

Why Remember This: VINO turns the messy toolbox of separate models into one well-organized workbench. It proves that carefully interleaving signals—and marking what belongs together—lets a single system follow complex instructions, preserve identities, and work across images and videos. That’s a practical step toward general-purpose visual creation where you simply say what you want, show a few examples, and the model does the rest.

Practical Applications

•Brand asset creation: Generate a mascot image, then animate it into short ads while preserving identity and style.
•Film pre-visualization: Combine a script snippet with reference stills to produce storyboard images and quick motion previews.
•E-commerce: Edit product photos (color, background, materials) and make short showcase videos with consistent looks.
•Education: Turn lesson prompts and example pictures into visuals and short explainer clips for classrooms.
•Social media content: Remix a selfie with a style reference and create a matching video reel in one pass.
•Game development: Prototype character skins and motion teasers from concept art and short text notes.
•Marketing localization: Keep the same spokesperson’s identity while changing outfits, backgrounds, or styles per region.
•News and documentaries: Produce consistent B-roll visuals aligned with text scripts and archival photo references.
•Design iteration: Rapidly explore edits (replace, add, remove, restyle) before committing to final assets.
•Virtual influencers: Maintain face and outfit consistency across image posts and video stories with minimal manual touch-up.

Version: 1