Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong; Zeyu Zhang; Yuwei Guo; Zhuoran Zhao; Songchun Zhang; Anyi Rao

Composing Concepts from Images and Videos via Concept-prompt Binding

Intermediate

Xianghao Kong, Zeyu Zhang, Yuwei Guo et al.12/10/2025

arXiv PDF

Key Summary

•This paper introduces BiCo, a one-shot way to mix ideas from images and videos by tightly tying each visual idea to the exact words in a prompt.
•It uses small plug-in binders attached to a Text-to-Video diffusion transformer so specific prompt tokens carry the look, style, or motion they represent.
•A Diversify-and-Absorb Mechanism creates many versions of the training prompt while a special 'absorbent' token soaks up extra details that don't match the chosen concepts.
•A Temporal Disentanglement Strategy learns video concepts in two stages—first as still frames (spatial), then as full videos (temporal)—so image and video concepts combine smoothly.
•The binder is hierarchical: a global binder learns broad connections, and per-block binders fine-tune token–concept links at each transformer block.
•On 40 test cases, BiCo beats past methods in prompt following, concept preservation, and motion quality, with a large lead in human-rated overall quality.
•It can compose not just objects but also styles, lighting, and motions, and even multiple concepts from a single input without requiring masks.
•BiCo enables creative tasks like mixing a dog from one image, a city at night from another, and a volcano's motion from a video—all guided by a single composed prompt.
•The method is lightweight to train (one-shot per source) and works with modern T2V models using cross-attention.
•Limitations include treating all prompt tokens equally and struggling with common-sense reasoning in tricky scenes.

Why This Research Matters

BiCo makes video creation more like assembling trusted parts than guessing with prompts, which saves time and preserves the exact look, style, and motion you want. It empowers solo creators to remix assets from photos and videos into new scenes without complex tools like masks. For studios, it improves continuity across shots by letting teams consistently reuse subjects, styles, and motion patterns. Educators can build engaging, accurate demonstrations by composing styles and motions that match their lessons. Marketers can prototype concepts quickly while keeping brand identity intact. Because it works in one-shot and at the token level, it lowers the barrier to high-quality, flexible video generation. This shift could accelerate visual storytelling across social media, advertising, film previsualization, and education.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how movie makers cut scenes from different clips—like a hero from one shot, a stormy sky from another, and the music from a third—and mix them into one powerful trailer? They pick the best bits and blend them so everything feels like one story. AI artists want to do that too: take a dog from an image, a glowing neon style from another, and a gentle camera pan from a video, then create one smooth, new video.

🥬 The Concept (The world before): Text-to-Video (T2V) diffusion models got really good at making videos from text prompts. But when creators tried to personalize these models—say, keep a specific dog’s look while adding a new motion—they ran into trouble. Older tricks, like training small adapters (LoRA) or adding learnable text embeddings, could learn a single subject or a style, but they struggled to pick apart complex, overlapping ideas without help like masks. And most methods were locked into simple combos like “subject from image + motion from video,” not more flexible mixes like “style from one video + lighting from another + subject from an image.”

Why it matters: Without good concept composition, creators either compromise on the final look or spend hours hand-editing, and still risk losing the exact subject identity, style, or motion they want.

🍞 Example: Imagine wanting “a beagle bartender (from a photo) shaking a drink (from a video) in Minecraft style (from another clip).” Before, you could sort of move the beagle, but keeping the bartender tools, the Minecraft blocky look, and the shaking rhythm all at once was messy.

🍞 You know how a recipe tells you which ingredients are salt, flour, or butter, and you can swap in gluten-free flour but keep the same butter? That’s clear labeling.

🥬 The Problem: AI prompts are just words, but the model has to guess which word should carry which visual idea. If the model can’t tightly link “beagle” to the exact fur pattern or “Minecraft” to the blocky visuals, ideas get tangled: the dog’s face melts, the style leaks, or the motion becomes jerky. Existing methods either (1) require masks (hard for styles and overlaps), (2) learn a few big adapters that can’t be mixed freely, or (3) only support a narrow form of composition.

Why it matters: Without precise word-to-visual binding, changing one part (like adding ‘at night’) accidentally scrambles others (like the subject’s identity).

🍞 Example: If you say “A wooden boat on a beach from twilight to nightfall,” the model should treat “boat” as the object, “beach” as setting, and “twilight to nightfall” as motion-time. If those get mixed, you might end up with a beach turning into a boat.

🍞 Imagine making a collage without scissors—you just smear pictures together and hope it looks right.

🥬 Failed Attempts: Prior works used LoRA stacks, fused multiple adapters, or predicted subjects and motions jointly. These often: (a) couldn’t specify which exact concept to extract, (b) drifted from the original subject details, (c) needed multiple examples (not one-shot), or (d) didn’t support flexible token-level mixing. Image-only methods that did allow some mixing didn’t generalize well to videos (time adds a whole new challenge).

Why it matters: Creators want plug-and-play flexibility: pick any concept from any source (image or video), attach it to the right word, then compose a fresh prompt that just works.

🍞 Example: You want “line-art style” from one image and “slow zoom-in” from a video, but older tools can’t pull a style cleanly without masks, and can’t combine timing well.

🍞 Think of a backpack with labeled pockets. If pockets are unlabeled, small important items get lost.

🥬 The Gap: What was missing was a universal, token-level way to bind visual concepts directly to the exact prompt words—so later you could just assemble a new sentence using these enriched tokens and the model would know exactly what to draw and animate.

Why it matters: With solid bindings, you can edit or mix concepts like Lego bricks—no masks, no joint training across all assets, and no guessing.

🍞 Example: Bind “butterfly” to the exact blue-purple wing pattern from a photo, bind “eruption” to a specific lava plume motion from a video, then compose: “A butterfly on a yellow flower, volcano erupting in the background,” and get a cohesive scene.

🍞 You know how cartoons separate background art from character cels so characters can move without redrawing the whole scene?

🥬 Real Stakes: For solo creators, this saves time and preserves identity, style, and motion accurately. For studios, it means reusing assets consistently across shots: the same hero, the same mood, the same camera moves—mixed in new ways, fast.

Why it matters: It can power advertising mockups, film previsualization, education content, and social media storytelling—unlocking reliable, flexible creativity.

🍞 Example: A teacher could compose a science video: “A pixel-art comet streaks across the sky above a realistic cityscape at dusk,” pulling style from one image, the city from another, and motion from a time-lapse clip.

02Core Idea

🍞 Imagine you have magnet labels for each word in your sentence—‘dog,’ ‘volcano,’ ‘Minecraft’—and for each label you stick on the exact visual idea you want from a picture or video. Now when you write a new sentence, those labels bring their visuals with them.

🥬 The Aha! Moment (one sentence): Bind visual concepts directly into the prompt tokens using small ‘binder’ modules, then compose a new prompt by selecting and mixing those bound tokens from different sources, so the model knows exactly what to draw and animate for each word.

Three Analogies:

Lego Bricks: Each token becomes a Lego brick that already contains the correct color and shape (concept). You can snap bricks from different sets together to build something new without repainting.
Recipe Cards: Each ingredient card (‘vanilla,’ ‘chocolate,’ ‘whipped cream’) carries its exact flavor from specific brands; when you mix cards, you get predictable taste.
Labeled Playlist: Each song tag (‘drums,’ ‘vocals,’ ‘bass’) holds the right sound sample; rearranging tags makes a new track without losing quality.

Why It Works (intuition):

Diffusion Transformers use cross-attention to let visual tokens look at text tokens. If we upgrade text tokens so they already ‘carry’ the desired visuals (appearance, style, motion), the model doesn’t have to guess.
A hierarchical binder tunes tokens at two levels: a global pass for broad alignment and per-block passes for fine details at different denoising stages.
Prompt diversification trains the binders to stay loyal to the key concept words under many phrasings, while an absorbent token soaks up extra details that don’t belong to any chosen concept.
Video learning is split: first learn spatial looks from frames (like images), then add a temporal branch for motion, keeping image and video concepts compatible.

Before vs After:

Before: ‘Dog’ might look different each time; ‘volcano’ style leaks into the sky; motion looks jittery; image and video assets don’t mix easily.
After: ‘Dog’ token reproduces your exact beagle; ‘volcano’ token triggers the specific lava plume; styles and lighting travel with their tokens; image and video concepts combine smoothly.

Building Blocks (each with the Sandwich pattern):

🍞 You know how you first pick a toolbox before fixing a bike? 🥬 Diffusion Transformers (DiT): DiT is a transformer-based engine that denoises noisy video frames step-by-step to make a clean video. How it works: (1) Start with noise, (2) pass through many transformer blocks, (3) at each block, refine the guess of the clean video, (4) use text via cross-attention to guide what to draw. Why it matters: Without DiT, there’s no strong, scalable video generator to follow our token instructions. 🍞 Anchor: Generating “a boat at sunset” from random noise becomes possible as DiT repeatedly cleans the image while listening to the prompt.

🍞 Imagine a spotlight on the main actor while the background dims. 🥬 Attention Mechanism: Attention lets the model focus on the most relevant parts of input—here, focusing the visual features on the right prompt tokens. How it works: (1) Compare each visual piece to each word, (2) give higher weights to better matches, (3) blend information accordingly. Why it matters: Without attention, the model treats ‘the’ like ‘volcano,’ causing confusion. 🍞 Anchor: Asking “a red car” makes the model attend more to ‘red’ and ‘car’ than to filler words.

🍞 Think of telling a friend a story and they animate it for you. 🥬 Text-to-Video (T2V) Models: T2V models turn text prompts into short videos by guiding denoising with your words. How it works: (1) Encode text, (2) inject it via cross-attention into the denoiser, (3) produce frames over time. Why it matters: Without T2V, we can’t create videos from descriptions or compose concepts in motion. 🍞 Anchor: “A butterfly flapping on a flower” becomes a moving scene.

🍞 Picture naming your backpack pockets: ‘dog,’ ‘style,’ ‘motion.’ You tuck the right items into each pocket. 🥬 BiCo (Bind & Compose): BiCo binds visual concepts to their exact prompt tokens so later you can compose a new prompt by selecting tokens from different sources. How it works: (1) Train small binders per source to attach visuals to tokens, (2) pick which bound tokens you need, (3) assemble them into your final prompt, (4) generate. Why it matters: Without binding, concepts leak and mix incorrectly. 🍞 Anchor: Use ‘dog’ token from a photo, ‘Minecraft’ token from another image, and ‘eruption’ token from a video to make a new scene.

🍞 Think of a coach giving a team plan (global) and per-player tips (per-block). 🥬 Hierarchical Binder Structure: Small MLP modules update prompt tokens globally and per transformer block so tokens carry the right visuals at every denoising stage. How it works: (1) Global binder adjusts tokens, (2) per-block binders fine-tune them for each block, (3) for videos, add a temporal branch. Why it matters: Without this, tokens won’t preserve details consistently across the whole generation process. 🍞 Anchor: The ‘bird’ token keeps its exact colors and shape from start to finish.

🍞 Imagine asking the same question in many ways to be sure your friend understands the key words. 🥬 Diversify-and-Absorb Mechanism (DAM): Create many prompt variations keeping key concept words fixed, and add an ‘absorbent’ token that soaks up extra details not tied to those concepts. How it works: (1) VLM extracts key spatial/temporal concepts, (2) composes diverse prompts, (3) train with an extra absorbent token to capture leftovers, (4) drop that token at inference. Why it matters: Without it, stray details stick to the wrong words. 🍞 Anchor: ‘dog’ stays the dog; leaves rustling or random background clutter get absorbed instead of polluting ‘dog.’

🍞 Think of learning a dance: first poses (still frames), then timing (motion). 🥬 Temporal Disentanglement Strategy (TDS): Learn video concepts in two stages—first frames-only for looks, then full videos with a spatial+temporal dual-branch binder and a gate. How it works: (1) Stage 1 learns spatial looks like images, (2) Stage 2 adds a temporal MLP, (3) a gate mixes spatial/temporal paths, starting from zero for stable learning. Why it matters: Without TDS, image and video concepts clash. 🍞 Anchor: A city’s look (buildings, colors) learned first; then smooth day-to-night time-lapse added cleanly.

03Methodology

High-Level Recipe: Input (images/videos + their prompts) → Concept Binding (train binders to attach visuals to exact tokens) → Concept Composition (choose which bound tokens to use and assemble a new prompt) → T2V Generation (Diffusion Transformer makes the video).

Step 1: Text Conditioning Refresher

What happens: The base T2V model (Wan2.1-T2V-1.3B) uses cross-attention so latent video tokens look up the text prompt tokens as keys/values at each transformer block.
Why it exists: It’s how the model listens to the prompt when denoising.
Example: In “A wooden boat on a beach from twilight to nightfall,” ‘boat,’ ‘beach,’ ‘twilight to nightfall’ are the tokens that should guide object, setting, and motion.

Step 2: Hierarchical Binder Structure (Concept Binding) 🍞 Hook: Think of a school system: a principal sets overall rules (global), teachers tailor lessons for each class (per-block). 🥬 The mechanism: Attach tiny MLP ‘binders’ to the prompt tokens.

Global Binder: Updates all prompt tokens once (residual MLP with learnable scale γ starting at 0) to lay down broad concept-word alignment.
Per-Block Binders: For each DiT block i, another MLP further refines tokens before that block’s cross-attention, preserving details at the right stage of denoising.
Video Dual-Branch: For videos, each binder’s MLP splits into spatial and temporal branches in Stage 2, with a learnable gate g(p) that starts at zero (so we begin purely spatial and smoothly introduce temporal learning). Why it matters: Details stick better across the denoising cascade; motion concepts don’t wash out texture, and texture doesn’t ruin motion. 🍞 Anchor: The ‘bird’ token keeps its feather pattern (spatial) while gaining smooth flapping timing (temporal).

Step 3: Two-Stage Inverted Training Strategy (for binders) 🍞 Hook: When cleaning a very messy room, you first tackle the big piles before the tiny crumbs. 🥬 The mechanism: Train binders in two stages and bias early training toward higher noise levels.

Stage A: Train only the global binder; sample more high-noise steps (threshold α=0.875) so the binder learns strong, robust alignments first.
Stage B: Train both global and per-block binders over all noise levels normally for fine detail. Why it matters: Anchor good global behavior early; makes later fine-tuning stable and effective. 🍞 Anchor: Starting with blurrier frames, the binder learns ‘what is dog vs background’ before refining fur strands.

Step 4: Diversify-and-Absorb Mechanism (DAM) 🍞 Hook: If you want someone to remember a word, you practice saying it in many sentences—but keep the word itself the same. 🥬 The mechanism: Use a VLM (e.g., Qwen2.5-VL) to (1) extract key spatial and temporal concepts and (2) compose many diverse prompts that always keep those key words. Train with an extra learnable absorbent token appended to the prompt sequence.

What happens: The absorbent token ‘drinks up’ any visual detail not represented by the fixed key words, stopping leakage into the concept tokens.
Why it exists: One-shot training is fragile; this stabilizes which token owns which visual.
Example: For a dog in a park video, ‘dog,’ ‘grass,’ ‘sunny,’ and ‘walking’ remain fixed across phrasings like “A happy dog walks on sunny grass” vs “On sunny grass, a dog strolls.” The absorbent token captures stray leaves or bystanders.

Step 5: Temporal Disentanglement Strategy (TDS) 🍞 Hook: Learn the look of a dance move from photos, then learn the rhythm with music. 🥬 The mechanism: Make image and video concept learning compatible.

Stage 1 (video-as-images): Train binders on individual frames with only spatial prompts—just like image training.
Stage 2 (full videos): Add temporal MLP branch for motion, fuse with spatial branch using a learnable gate initialized at zero.
Why it exists: Avoid clashes between static looks and moving dynamics; let motion layer on top of a solid appearance base.
Example: First learn “cityscape, harbor, mountains” (spatial); then “twilight to nightfall timelapse” (temporal) without corrupting city details.

Step 6: Concept Composition at Inference 🍞 Hook: Now you pick which labeled pockets (tokens) you want to wear today. 🥬 The mechanism: Take your target prompt (e.g., “A wooden boat on a beach lies from twilight to nightfall”). For each concept word or phrase, route that token through the binder trained on the source you want. Merge the updated tokens into one composed prompt and feed that to the T2V model.

Why it exists: Lets you flexibly mix concepts (subject from image A, style from image B, motion from video C) without masks or joint training.
Example: ‘boat’ → binder from a reference image with the exact wood texture; ‘twilight to nightfall’ → binder from a timelapse video.

Secret Sauce Summary:

Token-Level Binding: Precise control comes from teaching the model exactly which word owns which visual.
Hierarchy Across Blocks: Different denoising phases need different kinds of token info; per-block binders maintain fidelity.
DAM (Diverse Prompts + Absorbent Token): Locks concept ownership and prevents contamination.
TDS: Aligns image-like learning with video motion so concepts from both media types mix naturally.

Practical Training Details:

Base model: Wan2.1-T2V-1.3B.
Binder MLPs: two linear layers with LayerNorm + GELU, residual with learnable scale γ (zero-initialized).
Iterations: 2400 per stage; two stages.
Noise emphasis: α=0.875 in Stage A.
Inference length: 81 frames; other T2V hyperparameters as in Wan2.1.
Hardware: NVIDIA RTX 4090 GPUs.

04Experiments & Results

The Test: The authors built 40 test cases mixing one image and one video (e.g., from DAVIS and the Internet). They measured: (1) how well the output matched the prompt (CLIP-T), (2) how well it preserved visual concepts from both sources (DINO-I harmonic mean), and (3) human ratings for Concept Preservation, Prompt Fidelity, Motion Quality, plus Overall Quality.

The Competition: BiCo was compared to four representative methods: Textual Inversion, DreamBooth-LoRA (DB-LoRA), DreamVideo, and DualReal. For fairness, some baselines were adapted to the same T2V backbone.

The Scoreboard (with context):

CLIP-T (text–video alignment): BiCo 32.66 vs DualReal 31.60. That’s like getting a slightly higher essay score when both are good—but shows BiCo keeps better focus on the requested story.
DINO-I (visual concept preservation across sources): BiCo 38.04 vs DualReal 32.78. That’s a strong bump—like turning a B into a solid A for keeping the subject/style right.
Human Ratings (out of 5): • Concept Preservation: BiCo 4.71 vs DualReal 3.10 (a big jump—like from a C+ to an A). • Prompt Fidelity: BiCo 4.76 vs DualReal 3.11 (sticking to instructions very well). • Motion Quality: BiCo 4.46 vs DualReal 2.78 (smoother, more natural motion). • Overall Quality: BiCo 4.64 vs DualReal 3.00 (about +54.7% improvement—like leaping from an average to near-excellent score).

Surprising/Notable Findings:

Style and Non-Object Concepts: BiCo can extract and reuse styles (e.g., line art, pixel art) and lighting/time-of-day, not just objects. Many prior methods struggle here, especially without masks.
One-Shot Works: Even with just one example per source, the bindings are accurate enough to compose scenes reliably, thanks to DAM and the hierarchical binders.
Image+Video Compatibility: TDS pays off; concepts from images and videos combine naturally, avoiding the usual conflict where motion learning ruins appearance (or vice versa).
Visual Examples: In motion transfer and creative style mixing, baselines either froze the video, drifted from the subject, leaked background details, or failed to follow the prompt tightly. BiCo held identity, style, and motion together more consistently.

Ablations (what components matter):

Hierarchical binders greatly improved both concept fidelity and motion quality over a simple global-only baseline.
Prompt diversification improved concept binding, but adding the absorbent token reduced unwanted details and improved motion.
TDS (two-stage, spatial→temporal) was crucial for high overall quality when mixing image and video concepts.
Two-stage inverted training (focus on high-noise early) stabilized training; removing it hurt results noticeably.

Takeaway: Across automatic metrics and human judgments, BiCo wasn’t just a tiny step better—it was often a full letter grade ahead, especially in the human sense of “does this look like what I asked for, with the right subject and smooth motion?”

05Discussion & Limitations

Limitations (be specific):

Equal Token Treatment: All tokens get the same binder treatment, but in reality some (subjects, actions) matter more than function words. This can cause concept drift when visually complex items (like a wild, whimsical hat) don’t fit well into a single token’s capacity.
Common-Sense Reasoning: BiCo can mis-handle logic-heavy edits (e.g., adding an extra leg to hold a prop instead of reusing a limb). It binds what you say, but doesn’t deeply “reason” about anatomy or physics.
One-Shot Sensitivity: While designed for one-shot, poor or ambiguous references (occlusions, low resolution) can still hurt binding accuracy.
Motion Extremes: Highly non-standard or very fast, complex motions might be under-modeled in the temporal branch without more examples.

Required Resources:

A modern DiT-based T2V model with cross-attention (e.g., Wan2.1-T2V-1.3B) and a single GPU like an RTX 4090 for short binder training (≈2400 steps per stage).
A capable VLM (e.g., Qwen2.5-VL) to extract key concepts and write diversified prompts.
Clean reference media (ideally one clear image or a short, representative video) per concept source.

When NOT to Use:

If you require strict physical plausibility or complex multi-step reasoning (e.g., intricate interactions with tools, crowds, or anatomy), BiCo may create visually plausible but logically flawed outputs.
If you need long-form narrative consistency across minutes of footage; BiCo targets shorter clips (e.g., ~81 frames) and local composition.
If your concepts are extremely abstract (e.g., ‘freedom’ with no visual anchor), binding to a token will be unstable.

Open Questions:

Adaptive Token Weighting: Can we automatically detect and boost critical tokens (subjects, verbs) so they carry more representational capacity without drifting?
Richer Temporal Modeling: How far can the dual-branch temporal path scale to complex multi-actor or camera motions without extra supervision?
Better Reasoning: Could we loop VLM reasoning into composition time to catch anatomy/physics errors before generation?
Robustness to Noisy Inputs: How well do bindings transfer when references are partially occluded or stylized beyond typical distributions?
Beyond One-Shot: What minimal extra supervision (e.g., a few frames or a quick mask) provides the biggest quality jump without hurting BiCo’s simplicity?

06Conclusion & Future Work

Three-Sentence Summary: BiCo binds visual concepts to exact prompt tokens using small hierarchical binders, then composes new prompts by selecting bound tokens from multiple sources (images and videos) to generate coherent videos. With prompt diversification and an absorbent token, it locks each concept to the right word, and with temporal disentanglement, it learns motion on top of appearance so image and video concepts play nicely together. Experiments show large gains in concept preservation, prompt fidelity, and motion quality over prior methods, even in one-shot settings.

Main Achievement: Turning text tokens into reliable, reusable ‘concept carriers’ that can be mixed-and-matched across sources—unlocking flexible, mask-free composition of objects, styles, lighting, and motions.

Future Directions: Add adaptive token importance so key words (subjects, actions) get more representational power; integrate stronger VLM reasoning at composition time to avoid logic errors; scale temporal modeling for complex, longer scenes; explore lightweight multi-shot tuning that preserves BiCo’s flexibility.

Why Remember This: BiCo reframes composition as prompt-token engineering—once a word is truly bound to a look or motion, you can write new sentences to create new videos, confidently. That makes creative video building feel less like guesswork and more like assembling trusted parts, bringing fast, precise visual storytelling closer to everyone.

Practical Applications

•Personalized character reuse: Bind a specific character’s identity once, then place them in new scenes with new motions and styles.
•Style transfer for video: Extract a sketch, pixel-art, or painterly style and apply it to different subjects and motions.
•Lighting and time-of-day control: Bind ‘twilight-to-nightfall’ or ‘golden hour’ and reuse it across many scenes for mood consistency.
•Motion borrowing: Take a dance, drift, or eruption motion from a video and apply it to new subjects.
•Asset library building: Create a token library (subjects, styles, motions) that teammates can compose into new prompts on demand.
•Text-guided editing: Keep parts of a scene (subject, background) via their bound tokens while swapping others via fresh text.
•Decomposition for cleanup: Extract and keep only certain concepts (e.g., dogs but not cats) from crowded inputs.
•Previsualization for film: Rapidly mix reference looks and camera moves to explore scene ideas before shooting.
•Education demos: Combine scientific styles (e.g., diagram look) with motions (e.g., orbit paths) for clear explanations.
•Marketing mockups: Quickly try brand assets with different environments and motions while preserving identity.

Version: 1