DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu; Junqiao Li; Jiangfan Deng; Gen Li; Siyu Zhou; Zetao Fang; Shanshan Lao; Zengde Deng; Jianing Zhu; Tingting Ma; Jiayi Li; Yunqiu Wang; Qian He; Xinglong Wu

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate

Jiawei Liu, Junqiao Li, Jiangfan Deng et al.12/24/2025

arXiv PDF

Key Summary

•DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.
•It teaches a Diffusion Transformer (DiT) to accept ‘arbitrary’ keyframes anywhere on the timeline, not just the first and last frames.
•A smart ‘intermediate-conditioning’ design and Adaptive Tuning fix a tricky mismatch caused by the video VAE so the model follows mid-video keyframes accurately.
•A Supervised Fine-Tuning (SFT) stage on a small but high-quality set boosts cinematic motion and better prompt following.
•Tailored Direct Preference Optimization (DPO) uses positive/negative pairs to reduce abrupt cuts and make character motion more physically believable.
•A Shared-RoPE trick in the super-resolution model removes flicker and color shifts when raising resolution.
•Segment-wise Auto-Regressive (SAR) generation builds very long videos piece by piece while keeping memory use low and transitions smooth.
•Across tests, DreaMontage matches or beats strong commercial systems in motion and prompt-following, and clearly wins in multi-keyframe control.
•Ablations show each piece (SFT, DPO, Shared-RoPE) delivers clear, measurable gains, especially for smooth transitions and motion realism.

Why This Research Matters

This work makes it far easier for anyone to turn scattered images and short clips into one continuous, movie-like shot. Indie creators and students can prototype scenes that used to need big crews, fancy cameras, and lots of takes. Marketing teams can smoothly connect product shots and real footage for polished ads with less manual editing. Teachers can transform storyboards into engaging explainer videos that hit precise learning moments on a timeline. Game studios can blend posters, concept art, and gameplay into unified cutscenes. And because the method scales to long videos without breaking the flow, it opens doors to trailers, documentaries, and music videos that feel professionally stitched—without the stitching.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how some movies look like they were filmed in one continuous take, with no obvious cuts? That style feels super immersive, like you’re walking through the scene yourself.

🥬 The Concept (One-Shot Video): A one-shot (long-take) video is a single continuous scene without visible breaks.

How it works: (1) The camera keeps rolling, (2) the story keeps flowing, (3) and every moment connects smoothly to the next.
Why it matters: If the flow breaks, the magic breaks too; the audience can feel lost or jolted out of the story. 🍞 Anchor: Think of riding a bike down a long path without stopping—every pedal connects to the next. That’s the feeling of a one-shot.

🍞 Hook: Imagine flipping through a photo album so fast that it becomes a mini-movie. If the pages don’t line up, it looks jittery.

🥬 The Concept (Temporal Coherence): Temporal coherence means each frame connects naturally to the next so motion looks smooth and believable.

How it works: (1) Track what’s on screen, (2) keep shapes and colors consistent, (3) make movements evolve logically.
Why it matters: Without it, videos flicker, objects morph weirdly, and scenes feel jumpy. 🍞 Anchor: Like connecting puzzle pieces in order—when pieces fit, the whole picture feels right.

🍞 Hook: Imagine a super-talented art robot that can turn a story into a moving picture.

🥬 The Concept (Video Generation Models): These AIs create videos from text, images, or both.

How it works: (1) Encode inputs (text/images) into a hidden language, (2) start from noisy video guesses, (3) refine the noise step-by-step into a crisp video.
Why it matters: They let anyone make complex videos without cameras, actors, or big budgets. 🍞 Anchor: Like a digital movie studio in your laptop that can act out your storyboard.

🍞 Hook: You know how a teacher can give you the first and last sentences of a story and ask you to fill in the middle?

🥬 The Concept (First–Last Frame Conditioning): Many models only use the starting and ending images to guide a video.

How it works: (1) Fix the first frame, (2) fix the last frame, (3) the AI invents the middle.
Why it matters: It’s simple, but it can miss important moments in the middle and cause awkward jumps. 🍞 Anchor: If you only know the beginning and the end of a magic trick, the middle might feel confusing.

🍞 Hook: Imagine you could pin pictures or short clips at any point in time and say, ‘Make the video pass through these spots.’

🥬 The Concept (Arbitrary Frame Guidance): You can place any number of images or mini-clips at precise times to steer the whole video.

How it works: (1) You choose timestamps, (2) supply frames/clips, (3) the AI weaves smooth motion that hits each target on time.
Why it matters: It gives creators fine-grained control and avoids clumsy clip-stitching. 🍞 Anchor: Like planning a road trip with exact stops; the car’s path is smooth, but it still reaches each checkpoint.

🍞 Hook: Imagine shrinking a giant painting into a neat postcard without losing the important parts.

🥬 The Concept (Variational Autoencoder, VAE): A VAE compresses videos into a tiny code (latent space) and can expand them back later.

How it works: (1) Encoder squeezes frames into latents, (2) the generator works in this compact space, (3) Decoder turns latents back into pixels.
Why it matters: Working in latents makes generation faster and lighter on memory. 🍞 Anchor: Like zipping a big file so your computer can handle it easily, then unzipping when you’re done.

🍞 Hook: Picture a super-planner that edits noise into art by looking at everything at once.

🥬 The Concept (Diffusion Transformer, DiT): DiT is a model that removes noise step-by-step to form a video, using a Transformer to coordinate details across space and time.

How it works: (1) Start with noisy latents, (2) repeatedly predict and remove noise, (3) use attention to keep scenes consistent.
Why it matters: It creates sharp, coherent videos that respect both the prompt and the timeline. 🍞 Anchor: Like sculpting a statue by sanding noise away until the figure appears.

The World Before: Early video generators could make short clips from text or a single image but struggled with long, smooth one-shots. People tried gluing separate clips together, but transitions often felt like hard cuts, not fluid progress.

The Problem: We needed a model that could follow many mid-video hints (frames or clips) at exact times and still keep the motion smooth and believable.

Failed Attempts: First–last-only control missed the middle. Multi-keyframes helped but still produced flicker and abrupt scene changes, especially at higher resolution.

The Gap: Three big blockers stood in the way—(1) the VAE’s temporal downsampling made mid-frame alignment imprecise, (2) big style or scene changes between conditions caused hard cuts, and (3) long videos blew up memory.

Real Stakes: This matters for filmmakers prototyping scenes, teachers turning storyboards into learning videos, game studios crafting cutscenes from mixed assets, and small teams making ads without giant budgets. A tool that stays smooth, follows directions, and runs efficiently can unlock a lot of creativity for everyone.

02Core Idea

🍞 Hook: Imagine you’re directing a school play and you place actors at certain marks on stage—then the play flows through each mark perfectly, without stopping the show.

🥬 The Concept (DreaMontage, the Aha!): DreaMontage is an AI recipe that lets you drop key images or short clips anywhere in time and still get one continuous, smooth, cinematic video.

How it works: (1) Teach the DiT to accept mid-video conditions cleanly, (2) polish motion and style with high-quality examples (SFT), (3) align behavior with preferences to avoid cuts and weird motion (DPO), (4) generate long videos in memory-friendly chunks (SAR), (5) upsample with a flicker-free trick (Shared-RoPE).
Why it matters: Without this, you either lose control of the middle or you get jerky, stitched-together results. 🍞 Anchor: Like connecting comic panels you chose along a timeline, while the AI draws silky transitions in between.

Three Analogies:

Train Tracks: Your keyframes are stations; DreaMontage lays smooth tracks so the train glides station to station without bumps.
Choreography: You pin signature poses; the dancer (the AI) flows gracefully through each pose on beat.
GPS with Waypoints: You set multiple stops; the route is scenic, continuous, and always hits each waypoint.

Before vs. After:

Before: Models mostly respected the first and last frames; the middle could wobble, flicker, or hard-cut.
After: You can plant many guide frames/clips anywhere; transitions are smooth, motion feels real, and very long shots are practical.

Why It Works (Intuition, no equations):

Intermediate-Conditioning fixes the ‘speak the same language’ problem between mid-frames and DiT by concatenating condition latents directly with noise latents and adapting training so the model learns the correct alignment.
Shared-RoPE in super-resolution makes the high-res model ‘agree’ on where a condition belongs in time, so it boosts detail without boosting flicker.
Visual Expression SFT feeds the model curated, cinematic moves so it learns stylish but stable motion.
Tailored DPO shows the model pairs of ‘good vs. bad’ examples, teaching it to dislike abrupt cuts and silly motions.
SAR chops long videos into overlapping chunks that hand off context, so memory stays low and continuity stays high.

Building Blocks (new concepts with sandwiches):

🍞 Hook: You know how you can tape a reminder note onto a page so you see it right when you flip there?

🥬 The Concept (Intermediate-Conditioning via Channel Concatenation): The model sticks the condition’s latent next to the noisy latent so the DiT can read both at once at that time.

How it works: (1) Encode the condition image/clip to a latent, (2) concatenate along channels with the noise latent, (3) DiT learns to follow the condition precisely at that timestamp.
Why it matters: Without it, mid-video instructions feel fuzzy or off-timed. 🍞 Anchor: Like placing a sticky note exactly where you need it in a book so you never miss it.

🍞 Hook: Imagine tuning a guitar so it matches the song you’re about to play.

🥬 The Concept (Adaptive Tuning): A quick, targeted training pass that aligns the model to handle mid-conditions well despite the VAE’s quirks.

How it works: (1) Filter one-shot-like training data, (2) re-encode single condition frames properly, (3) approximate video segments so the first frame is accurate and the rest are sampled, (4) fine-tune lightly.
Why it matters: Without this, the model misreads mid-frames and drifts. 🍞 Anchor: Like warming up your instrument before the concert so every note rings true.

🍞 Hook: Think of two dancers sharing the same beat so their moves stay synchronized.

🥬 The Concept (Shared-RoPE for Super-Resolution): Give the condition tokens the same positional ‘beat’ as the target frames they guide so high-res upsampling stays stable.

How it works: (1) Append condition tokens to the sequence, (2) assign them the same Rotary Position Embedding (RoPE) as their target time, (3) only the first frame for video-conditions to keep compute low.
Why it matters: Without it, SR amplifies tiny mismatches into flicker and color hops. 🍞 Anchor: Like making sure the drummer and guitarist share the same metronome.

🍞 Hook: Imagine learning to draw cool action scenes by copying the best examples.

🥬 The Concept (Visual Expression SFT): Supervised fine-tuning on a small, hand-picked set of videos with strong motion and cinematic traits.

How it works: (1) Curate categories like camera work, VFX, sports, transitions, (2) train with random mid-conditions, (3) boost motion expressiveness and instruction following.
Why it matters: Without it, movement can feel timid or off-style. 🍞 Anchor: Like practicing with a top-tier playbook so your moves look pro.

🍞 Hook: You know how a coach shows two replays—one good, one bad—and tells you which to imitate?

🥬 The Concept (Tailored DPO): Preference learning that pushes the model toward smooth transitions and realistic motion using paired examples.

How it works: (1) Build pairs ‘smooth vs. abrupt cut’ using a trained VLM, (2) build pairs for ‘realistic vs. weird motion’ with human help, (3) train to prefer the positives over a reference model.
Why it matters: Without it, the model may keep bad habits like hard cuts or rubbery limbs. 🍞 Anchor: Like learning by watching ‘do this, not that’ highlight reels.

🍞 Hook: Think of writing a super-long story one chapter at a time, always rereading the last page so the new chapter fits.

🥬 The Concept (Segment-wise Auto-Regressive, SAR): Generate a long video in overlapping segments that hand off context.

How it works: (1) Slide a window to choose the next segment, (2) condition on the tail latents from the previous segment plus local guide frames, (3) fuse overlaps smoothly.
Why it matters: Without it, you run out of memory or lose continuity on very long shots. 🍞 Anchor: Like weaving long scarves from shorter pieces that overlap perfectly so you can’t see the seams.

03Methodology

High-Level Recipe: Inputs (text prompt + arbitrary frames/clips + their times) → VAE encode to latents → Base DiT with intermediate-conditioning (Adaptive Tuning) generates low-res video latents → Super-Resolution DiT with Shared-RoPE refines to high-res → SAR stitches long videos segment by segment → VAE decode to pixels.

Step A: Preparing and Encoding Conditions

What happens: Images/clips and prompts are encoded. The VideoVAE turns frames into compact latents; text goes through a text encoder.
Why it exists: Latent-space work is faster and keeps memory low; text embedding lets DiT ‘understand’ instructions.
Example: User gives three guides—t=0s (a train interior photo), t=10s (a shattered-window image), t=15s (a cyberpunk street clip). All become latents; the text ‘ride through the window into a neon city’ becomes embeddings.

Step B: Intermediate-Conditioning via Channel Concatenation (with Adaptive Tuning)

What happens: At each guided timestamp, the condition latent is concatenated with the current noisy latent along channels and fed into the Base DiT.
Why it exists: This makes the condition unmissable to the model at the right time. Adaptive Tuning fixes the VAE’s temporal mismatch by re-encoding single frames and approximating clip segments so training matches inference.
What breaks without it: The model might treat mid-frames as vague hints, causing mistimed or off-style transitions.
Example: At 10s, the window-shatter image latent sits right beside the noise latent. The DiT ‘sees’ it and plans a proper break-through shot.

Step C: Base Generation (Low-Res Latents, 480p)

What happens: The DiT denoises step by step, guided by text and mid-conditions, to produce a coherent low-res latent video.
Why it exists: Low-res first is cheaper and more stable for long sequences; it sets solid structure and motion.
What breaks without it: Direct high-res generation can wobble and cost too much memory.
Example: The model forms a smooth camera push in the train car, aligns the window break near 10s, and flows into a neon city by 15s.

Step D: Super-Resolution with Shared-RoPE

What happens: A second DiT upsamples to 720p/1080p. Besides channel concatenation, DreaMontage also appends condition tokens to the sequence and assigns them the same RoPE as the frames they guide (first frame only for video-conditions).
Why it exists: High-res often magnifies tiny mismatches into flicker; Shared-RoPE forces time alignment.
What breaks without it: Colors may pop-shift between frames; edges shimmer.
Example: The neon signs glow consistently across frames instead of strobing. The shattered glass sparkles smoothly rather than blinking.

Step E: Visual Expression SFT

What happens: The base is fine-tuned on a small hand-picked set covering camera moves (like FPV, dolly), visual effects, sports, spatial perception, and advanced transitions.
Why it exists: To teach cinematic motion and better prompt following.
What breaks without it: Movement can be timid or drift off-instruction.
Example: ‘Zoom into an eye, then fly through the pupil’ becomes a bold, steady move rather than a shaky morph.

Step F: Tailored DPO to Avoid Abrupt Cuts and Unnatural Motion

What happens: Build contrastive pairs. For ‘cuts’, use a VLM trained to rate severity and select best/worst from same inputs. For ‘motion’, assemble challenging prompts and use human raters to pick natural vs. weird motion. Train the policy to prefer positives over a reference.
Why it exists: Preference learning corrects specific bad habits that general training missed.
What breaks without it: You may still see hard scene jumps or rubbery limbs.
Example: When a skier transitions to surfing, momentum and body pose carry over realistically, instead of snapping into a new stance.

Step G: Segment-wise Auto-Regressive (SAR) Inference for Long Videos

What happens: Long targets are split by a sliding window; user conditions become candidate boundaries. Each new segment conditions on the tail of the previous latents plus local guides. Overlaps are fused.
Why it exists: To generate minutes-long shots within memory limits and keep continuity at joins.
What breaks without it: Memory overflows or visible seams between chunks.
Example: A 45-second one-shot is built from 10–15s chunks; the handoff keeps the same character, lighting, and motion direction.

Step H: Decode to Pixels

What happens: The final latent sequence goes through the VAE decoder to produce the video frames.
Why it exists: You can now watch the output in full resolution.
Example: The full ‘train → shatter → cyberpunk fly-through’ plays smoothly, exactly hitting all user-placed moments.

Secret Sauce (Why This Combo Shines):

The intermediate-conditioning plus Adaptive Tuning aligns mid-frames despite the VAE’s causal downsampling.
Shared-RoPE prevents SR from inventing flicker.
SFT gives motion style; DPO removes two stubborn artifacts.
SAR scales to long videos without losing the one-shot illusion.

04Experiments & Results

The Test: Because few models support true arbitrary mid-conditions (including clips), the team shows qualitative cases spanning single images, multi-keyframes, video-to-video transitions, and mixed image–video storylines. For fair numbers, they reduce the task to two common sub-settings—multi-keyframe and first–last—to compare with state-of-the-art competitors.

How They Measured (GSB Protocol): 🍞 Hook: Imagine two game replays shown side-by-side and judges vote which one looks better. 🥬 The Concept (GSB—Good/Same/Bad): Human evaluators compare pairs and label if model A is better (Good), worse (Bad), or equal (Same) for each category.

How it works: (1) Show two generated videos with the same inputs, (2) judges rate Visual Quality, Motion Effects, Prompt Following, Overall Preference as Good/Same/Bad, (3) compute a score: (Wins − Losses) / (Wins + Losses + Ties).
Why it matters: Subtle qualities like motion realism and transitions are hard for automatic metrics; humans catch them. 🍞 Anchor: Like judging two dances on TV: one can win on smoothness even if both costumes look great.

The Competition: For multi-keyframe, they compare to Vidu Q2 and Pixverse V5. For first–last frame, they compare to Kling 2.5—strong commercial references.

Scoreboard with Context:

Multi-Keyframe vs. Vidu Q2: DreaMontage wins Overall Preference by +15.79%. Biggest edge is Prompt Following at +23.68%, showing better obedience to complex instructions. Motion Effects also improve (+7.89%), with a small trade-off in Visual Quality (−2.63%).
Multi-Keyframe vs. Pixverse V5: DreaMontage wins Overall Preference by +28.95%. Prompt Following again leads with +23.68%. Motion is slightly behind (−2.63%), while Visual Quality ties (0.00%). Result: users prefer DreaMontage thanks to narrative coherence.
First–Last vs. Kling 2.5: Visual Quality ties (0.00%), but DreaMontage wins in Motion Effects (+4.64%) and Prompt Following (+4.64%), resulting in a higher Overall Preference (+3.97%).

Ablations (What Each Piece Adds):

Visual Expression SFT vs. Base: Motion Effects jump by +24.58% with Overall Preference +20.34% (Visual Quality unchanged). SFT is the ‘cinematic motion booster.’
Tailored DPO (Cuts): +12.59% on cut-smoothness—transitions grow more natural.
Tailored DPO (Subject Motion): +13.44%—fewer anatomical distortions and more plausible movement.
Shared-RoPE vs. SR Base: Massive +53.55%—it crushes flicker and cross-frame color shifts at high resolution.

Surprising Findings:

Small, high-quality SFT data swung motion quality more than expected, suggesting ‘less but better’ can beat massive generic sets.
A simple sequence-position trick (Shared-RoPE) dramatically stabilized SR—a tiny change with huge payoff.
Automatic VLMs could rank abrupt cuts, but humans were still needed to judge motion realism reliably.

05Discussion & Limitations

Limitations:

Real-time generation is still hard at high resolutions and long durations; SAR helps, but decoding and SR remain heavy.
The VAE’s causal downsampling is only approximated for mid-clip conditions; while Adaptive Tuning mitigates mismatch, it’s not mathematically perfect.
DPO relies on curated preference data; building pairs (especially for subtle motion issues) needs human effort and domain knowledge.
The model’s style and motion biases reflect its curated datasets; rare genres or niche camera tricks may need extra fine-tuning.

Required Resources:

Strong GPUs with sizable VRAM for SR and long sequences (multi-GPU preferred for training).
Precomputation storage for VAE latents to speed training.
Access to VLMs and human annotators for building preference pairs.

When NOT to Use:

Live interactive scenarios needing sub-second latency at 1080p+.
Tasks demanding exact physics simulations (e.g., engineering-grade trajectories) rather than cinematic plausibility.
Inputs with extremely inconsistent or low-quality conditions (e.g., mismatched resolutions, heavy compression) that can confuse high-res alignment.

Open Questions:

Can we design a non-causal or selectively causal video VAE that natively aligns single-frame conditions without extra tricks?
Could better automatic judges for motion realism reduce human labeling in DPO?
How to integrate camera-path control and 3D scene understanding for even richer one-shot cinematography?
Can audio-aware training co-synchronize sound cues with visual transitions for music videos and trailers?
Is there a lightweight SR that retains Shared-RoPE stability but cuts compute for near-real-time previews?

06Conclusion & Future Work

Three-Sentence Summary: DreaMontage is an AI framework that turns scattered images and short clips into one smooth, long, single-shot video by letting you place guides at any time and filling the gaps seamlessly. It solves mid-frame alignment, removes abrupt cuts and weird motion, and scales to long durations using smart training and a segment-wise generation strategy. The result is a controllable, cinematic video maker that matches or beats strong commercial systems on motion and instruction-following.

Main Achievement: A practical, end-to-end recipe for arbitrary frame-guided one-shot video generation that stays coherent, expressive, and efficient—even for long sequences—through intermediate-conditioning, SFT+DPO alignment, Shared-RoPE SR, and SAR.

Future Directions: Build VAEs that align perfectly with single-frame and clip conditions; reduce DPO’s reliance on humans with better motion judges; add fine camera-path control and 3D awareness; sync audio and motion; and shrink compute for real-time previews.

Why Remember This: It turns storyboards—images and tiny clips placed on a timeline—into living, breathing one-shot films. For creators, students, and small teams, it’s like having a steady-handed camera crew, a motion coach, and a finishing artist all inside one model.

Practical Applications

•Storyboard-to-one-shot conversion for filmmakers and students to preview scenes with precise keyframe control.
•Cinematic trailers that blend concept art, character posters, and short live clips into a single flowing sequence.
•Game cutscenes built from mixed assets (images and gameplay snippets) while preserving style and character consistency.
•Dynamic product ads that morph from static posters into usage footage with smooth brand-aligned transitions.
•Educational explainers where key illustrations are pinned to timestamps and the AI fills in the motion between them.
•Sports highlight reels that transition between stills and clips without hard cuts, preserving momentum and pose realism.
•Music videos that hit specific visual beats on the timeline, with long, continuous camera feels.
•Museum and gallery installations where still artworks are animated into a coherent, continuous visual journey.
•Social media ‘one-take’ vlogs extended to long durations without quality decay using SAR.
•Pre-visualization (pre-viz) for complex camera moves (FPV, dolly in/out) to test ideas before real-world shoots.

Version: 1