DreamStyle: A Unified Framework for Video Stylization

Mengtian Li; Jinshu Chen; Songtao Zhao; Wanquan Feng; Pengqi Tu; Qian He

DreamStyle: A Unified Framework for Video Stylization

Intermediate

Mengtian Li, Jinshu Chen, Songtao Zhao et al.1/6/2026

arXiv PDF

Key Summary

•DreamStyle is a single video-stylizing model that can follow text, copy a style image, or continue from a stylized first frame—without switching tools.
•It is built on a strong Image-to-Video (I2V) generator and adds style “clues” using a clever condition-injection design.
•A new token-specific LoRA lets the model learn differently from different tokens (video, style image, first frame) so they don’t confuse each other.
•The team built two paired datasets using a practical pipeline: stylize the first frame with top image models, then animate it with I2V plus ControlNets for motion alignment.
•This two-stage dataset (big CT + small high-quality SFT) balances style fidelity and structure/motion coherence.
•Across three tasks, DreamStyle beats popular competitors in style consistency and overall video quality, and holds its structure well.
•It also enables multi-style fusion (mix text + multiple style images) and longer videos by chaining segments via first-frame guidance.
•Ablations show token-specific LoRA and the two-stage datasets are key to performance.
•There are limits: multi-shot scenes and heavy geometric changes can still strain structure preservation, and compute demands are non-trivial.
•DreamStyle makes video stylization more flexible, controllable, and practical for real creators.

Why This Research Matters

DreamStyle lowers the barrier for creators to restyle videos consistently, so artists and hobbyists can achieve professional looks without juggling multiple tools. Teachers and students can quickly convert the same content into different visual styles, making learning more engaging. Marketers and brands can stay on-message by keeping structure and motion intact while shifting the visual theme for campaigns. Filmmakers and game designers can iterate on art direction faster, exploring diverse looks from the same footage. Everyday users can stylize family videos into fun art styles without dealing with flicker, inconsistency, or steep settings. The framework also shows how to build better video datasets pragmatically, which benefits future research. Lastly, the unified model paves the way for controllable, longer stylized videos and creative style mixing.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a class movie, and you want it to look like watercolor one day, pixel art the next, and origami the day after. Changing the whole movie’s look shouldn’t mean remaking the whole thing from scratch.

🥬 The Concept (Video Stylization): Video stylization is changing a video’s look (colors, textures, and drawing style) while keeping its story and motion. How it works: (1) Keep what’s happening (people, places, actions), (2) change how it looks (e.g., oil paint, crayon), (3) make every frame match the same style, (4) keep motion smooth across frames. Why it matters: Without stylization, you’d be stuck with the original camera look; with poor stylization, styles drift and flicker, making viewers dizzy.

🍞 Anchor: Think of a regular phone video of a dog running. Stylization can make it look like LEGO blocks while the dog still runs the same path.

The World Before: AI could already generate images and short videos. But stylizing videos well was tricky. Most tools listened to just one kind of instruction at a time: a text prompt (like “low-poly style”), a style image (a reference painting), or a stylized first frame (to guide long clips). Each had perks and drawbacks: text was flexible but vague, a style image was accurate but hard to find for new styles, and a first frame helped long videos but needed a very good starting image. Models tended to pick only one of these, so users had to juggle multiple tools and lost control.

The Problem: Two big roadblocks stood in the way. (1) Single-modality thinking: Most systems only accepted one kind of style instruction, limiting creativity and control. (2) Weak training data: Good stylized-video pairs were rare. Borrowing from image datasets often broke either style consistency (style drifts), temporal coherence (frames flicker), or motion realism (action looks stiff).

🍞 Hook: You know how building a domino chain needs well-chosen, evenly spaced dominoes? If some are too light or too heavy, the chain fails.

🥬 The Concept (Image-to-Video, I2V Model): An I2V model takes a still picture and predicts a plausible short video from it. How it works: (1) Encode the image, (2) imagine the next frames, (3) generate a sequence with smooth motion. Why it matters: I2V models are strong motion “animators,” so they can turn a great first frame into a great video.

🍞 Anchor: Give the model a photo of a cat mid-pounce; it creates a few seconds of the cat landing and looking around.

Failed Attempts: Some methods stylized a single frame and tried to “push” the style through the rest with feature tricks. Others used only text or only a style image. Many leaned on open image models and hoped video models would catch up. The result? Trade-offs: either good style but flickery motion, or smooth motion but bland style.

🍞 Hook: Think of a dance coach who keeps everyone on beat.

🥬 The Concept (ControlNet): ControlNet is a helper network that nudges generation with structure cues (like depth maps or human poses). How it works: (1) Extract structure (depth or pose) from a source, (2) guide the generator to follow that structure, (3) keep motion aligned across frames. Why it matters: Without it, motion can drift, subjects can morph, and videos won’t match the original action.

🍞 Anchor: Use a pose sequence from a real dancer to guide a cartoon dancer so the moves match.

The Gap: The field needed (1) one model that could listen to text, style images, and first-frame guidance, (2) a reliable way to build high-quality stylized–raw video pairs for training, and (3) a method to prevent different guidance types from confusing each other inside one model.

🍞 Hook: Imagine a backpack add-on instead of buying three whole suitcases.

🥬 The Concept (LoRA): LoRA is a lightweight add-on that teaches a big model new tricks by adjusting tiny low-rank parts. How it works: (1) Freeze most of the big model, (2) add small trainable matrices, (3) train just those to learn the task, (4) combine with the base at run time. Why it matters: It’s efficient and keeps the base model’s skills.

🍞 Anchor: It’s like clip-on lenses for a camera—cheap, fast, and you keep your original camera.

Real Stakes: • Creators want consistent, cool-looking videos without heavy manual work. • Educators and brands want the same content in multiple art styles. • Filmmakers and game devs want rapid visual experimentation. • Ordinary users want to remix their clips without learning five tools. DreamStyle steps in with a unifying model, a smart data-building pipeline, and a special LoRA to keep signals clear.

02Core Idea

The “Aha!” in one sentence: Use one I2V-based model that can accept text, style images, and stylized first frames as style clues, and teach it cleanly with a token-specific LoRA and a carefully built paired-video dataset.

Multiple Analogies:

Swiss Army Pen: Instead of carrying three pens (text-only, image-only, first-frame-only), you carry one pen with three tips you can flip out anytime.
Recipe + Photo + First Bite: A chef can cook from a written recipe (text), a food photo (style image), or by tasting a starter bite (first frame). This kitchen can do all three.
GPS with Multiple Signals: A navigator can use a street map (text), a satellite image (style image), or your current location (first frame). Combining them is more reliable.

Before vs After:

Before: Separate tools for different conditions; painful to mix them; weak datasets made results flicker or drift in style.
After: One model that flexibly uses any or several conditions at once, trained on paired stylized–raw videos built by stylizing the first frame then animating with consistent motion cues. Style sticks, motion flows, and creators gain control.

Why It Works (intuitively):

The base I2V model already knows how to animate. If you feed it a high-quality stylized first frame, it can keep the look while moving forward.
Adding style images and text via a neat “condition injection” lets the model pull in both precise (image) and flexible (text) style hints.
Token-specific LoRA acts like three labeled lanes for different token types so their gradients don’t crash into each other during learning.
The dataset pipeline avoids classic mismatches by using the same control signals (like depth/pose) to animate both stylized and raw sequences, making training pairs that truly correspond.

🍞 Hook: You know how giving an artist clear instructions avoids mixed-up art?

🥬 The Concept (Condition Injection Mechanism): It’s a tidy way to feed style clues into the model. What it is: A design that plugs style image tokens and first-frame tokens as special frames, text through cross-attention, and raw video through image channels—without redesigning the whole model. How it works: (1) Encode inputs (video, style image, first frame), (2) add them as extra channels/frames in the right spots, (3) keep text in the usual text-attention path, (4) let the base I2V do its animation magic. Why it matters: Without clean injection, the model either ignores a clue or jumbles signals, causing style confusion or motion errors.

🍞 Anchor: It’s like adding a legend to a map: colors (style image), a starting pin (first frame), and road names (text) all go where readers expect.

🍞 Hook: Imagine prepping ingredients neatly before cooking a feast.

🥬 The Concept (Data Curation Pipeline): It’s the two-step process for building good training pairs. What it is: First stylize the initial frame using strong image models, then animate both stylized and raw versions with the same motion controls to form a pair. How it works: (1) Make a stylized first frame (text- or style-image-guided), (2) animate stylized and raw using identical ControlNets (depth/pose) so their motions correspond, (3) auto and manual filtering ensure quality, (4) build two datasets: big CT for breadth, small SFT for polish. Why it matters: Without clean pairs, the model can’t learn style consistency and motion together.

🍞 Anchor: It’s like baking two cakes with the same pan and timer—one vanilla (raw) and one chocolate (stylized)—so you can compare slices fairly.

🍞 Hook: Think of color-coded notebooks for math, science, and art so your notes don’t mix.

🥬 The Concept (Token-specific LoRA): A small, smart add-on that treats different token types with their own “up” matrices. What it is: A LoRA where all tokens share one down matrix, but each token type (video, style image, first frame) has its own up matrix. How it works: (1) Project token features down (shared), (2) route them to the correct up matrix by token type, (3) add back as a residual tweak, (4) train efficiently. Why it matters: Without it, gradients from different conditions blur together, causing style confusion and degradation.

🍞 Anchor: It’s like having three dedicated highlighters so you don’t smear notes by using one marker for everything.

Building Blocks:

Strong I2V base (Wan14B-I2V) with minimal changes
Condition injection: text via standard cross-attention; raw video via image channels; first-frame/style-image as extra frames
ControlNets (depth/pose) to keep motion aligned when building data
Two-stage training: big CT (InstantStyle/SDXL) then fine SFT (Seedream 4.0)
Token-specific LoRA to separate learning lanes

03Methodology

At a high level: Inputs (raw video + one or more style clues) → Encode and inject conditions → Diffusion/flow-matching denoising inside I2V (with LoRA) → Stylized video output.

Step-by-step (the recipe):

Gather Inputs

What happens: You provide a raw video and at least one style clue: a text prompt, a style image, or a stylized first frame. You can also combine clues (e.g., text + one or more style images).
Why it exists: Clear inputs let the model know what to keep (content/motion) and what to change (style).
Example: Raw video of a boy flying a paper plane; style prompt: “colored pencil style, bright tones”; style image: a colored-pencil postcard.

Encode Everything

What happens: The model converts frames and images into compact “latents,” and reads the text into tokens using a text encoder. A VAE turns images/video into latents; text goes to cross-attention.
Why it exists: Latents are easier and faster to work with; text tokens provide flexible style semantics.
Example: The postcard image becomes a latent vector grid; the words “colored pencil” become tokens like [colored],[pencil].

Inject Conditions Cleanly (Condition Injection Mechanism)

What happens: The raw video latent is fed through image channels; the stylized first frame (if any) is attached as the very first extra frame; the style image (if any) is attached as an extra final frame; text flows through the base cross-attention path. Mask channels mark which frames are first-frame, video, or style-image tokens.
Why it exists: This keeps the base I2V architecture mostly unchanged and avoids heavy compute while giving each clue a clear place.
Example: For style-image guidance, the postcard latent is concatenated as the tail frame, so the network sees it as a style anchor to copy from.

Denoise with a Strong Animator (I2V + Flow Matching)

What happens: The diffusion-style process starts from a noisy latent and learns to predict cleaner frames step by step (flow matching objective). The LoRA modifies attention and FFN layers lightly to learn stylization without retraining the whole model.
Why it exists: Diffusion/flow matching is very good at making sharp, coherent images and videos; LoRA is efficient and preserves base skills.
Example: After several steps, the noisy “colored pencil” frames grow clearer until the boy and his plane are recognizable with pencil textures.

Keep Signals Separated (Token-specific LoRA)

What happens: All tokens pass through a shared down matrix, but each token type (video, style image, first frame) has its own up matrix. Residuals are added back to the model’s features.
Why it exists: Prevents the model from mixing up roles—e.g., confusing a style image with a video frame—so style stays stable and structure holds.
Example: The style-image tokens get their own adaptation lane, so pencil strokes don’t accidentally warp the boy’s face or pose.

Decode to Video

What happens: The cleaned latent sequence is turned back into visible frames by the VAE decoder.
Why it exists: Latent space is for fast learning; decoding makes the final stylized video you can watch.
Example: You now see the boy and plane moving naturally, with consistent colored-pencil shading.

The Training Data Pipeline (how the pairs were built):

Step A: Stylize First Frames • Use top image stylizers: InstantStyle (SDXL plugin) for diverse style-image-guided results; Seedream 4.0 for very high-quality text-guided results. • Add helpers: depth ControlNet and an ID plugin for face/structure stability when needed.
Step B: Animate with Consistent Motion • Use the same ControlNet signals (depth or human pose) to drive both the stylized and the raw sequences, so their motion matches closely. • Why: If you use mismatched controls, stylized and raw won’t align, and training will confuse content vs. style.
Step C: Filter for Quality • Big CT dataset (~40K pairs at 480p, up to 81 frames): automatic filtering by VLM captions and style consistency (CSD); • Small SFT dataset (~5K pairs): manual filtering for top quality and content alignment; includes multiple style images per sample (1–16).

Two-Stage Training (the schedule):

Stage 1 (CT): Sample ratio across tasks = text : style image : first frame = 1 : 2 : 1 to strengthen style-image skills (often trickiest). Train ~6000 iters with LoRA rank 64, AdamW lr 4e-5, gradient accumulation for effective batch 16.
Stage 2 (SFT): Fine-tune ~3000 iters on the cleaner set to boost style fidelity and aesthetics.

Using DreamStyle (how you run it):

Text-guided: Provide raw video + a style-only prompt (e.g., “low poly style, bright facets”).
Style-image-guided: Provide raw video + one or more style reference images (e.g., a low-poly fox poster).
First-frame-guided: Provide raw video + its stylized first frame. For long videos, chain segments: use the last frame of segment k as the first-frame input for segment k+1.
Multi-style fusion: Mix a style prompt with multiple style images; adjust weights to blend influences.

Secret Sauce (what’s especially clever):

Unified condition injection that reuses the base I2V paths, so the model stays fast and stable.
Token-specific LoRA, which acts like labeled lanes for learning so signals don’t smear together.
A practical, scalable data pipeline that makes true stylized–raw pairs by animating both with the same controls, solving a long-standing dataset gap.

Concrete Data Example:

Inputs: Raw clip of a swimmer; Style prompt: “pixel art style”; Style image: a pixel-art beach scene.
Process: Encode video; attach the pixel-art image as the last frame; feed the prompt through text attention; denoise with token-specific LoRA.
Output: The swimmer moves like the original, but the water and character show stable pixel blocks across all frames (high CSD), with body shape preserved (solid DINO score).

04Experiments & Results

The Test: The researchers measured whether videos matched the requested style (style consistency), kept the original content/pose (structure preservation), and looked good over time (dynamic degree, image quality, aesthetic quality, subject/background consistency). They also checked text–video alignment for text-guided tasks.

The Competition: For text-guided, DreamStyle was compared with three commercial systems (Luma, Pixverse, Runway). For style-image-guided, it was compared to StyleMaster (T2V). For first-frame-guided, it was compared with VACE and VideoX-Fun.

The Scoreboard (with context):

Text-guided: DreamStyle achieved higher text–video alignment (CLIP-T ~0.167 vs Luma 0.132, Pixverse 0.155, Runway 0.154) and better structure preservation (DINO). Think: getting an A- while others get B’s on following the style instructions and keeping the subject’s pose.
Style-image-guided: In V2V mode, DreamStyle delivered strong style consistency (CSD ~0.515) and solid video quality metrics; its T2V version (no raw video input) scored even higher on some metrics due to fewer constraints, showing the base model’s strength carries through the LoRA.
First-frame-guided: DreamStyle posted the top style consistency (CSD ~0.851) against VACE (0.689) and VideoX-Fun (0.766). That’s like keeping the same outfit perfectly across the whole dance.
Quality trade-off note: Dynamic degree (more motion) can lower the raw “image quality” metric due to motion blur—DreamStyle embraced natural motion, so raw IQ sometimes dips while overall realism improves.

User Study Highlights (20 trained annotators; 1–5 scale):

Text-guided: DreamStyle ~4.14 (style), ~3.95 (content), ~3.95 (overall), beating others by large margins.
Style-image-guided: DreamStyle ~4.36 (style), ~3.87 (content), ~4.20 (overall); StyleMaster lagged strongly.
First-frame-guided: DreamStyle ~4.37 (style), ~4.12 (content), ~4.24 (overall); the baselines trailed notably on style stability across frames.

Surprising/Notable Findings:

One model can do T2V reasonably well even though it was trained for V2V stylization—thanks to LoRA keeping the base I2V/DiT capabilities.
Letting multiple conditions be active together at inference (e.g., text + style image) improves control and creativity—multi-style fusion emerges naturally.
Building training pairs by animating both stylized and raw with the same ControlNet cues greatly reduces motion mismatches compared to trying to invert or align after the fact.

Ablations (what really mattered):

Token-specific LoRA: Removing it hurt CSD (style consistency dropped from ~0.515 to ~0.413) and slightly reduced DINO (structure). Visuals showed style degradation/confusion, confirming its necessity.
Datasets and two-stage training: Only-CT led to weaker style fidelity; Only-SFT improved CSD but hurt structure (limited scale and alignment). The full two-stage scheme balanced both, giving the best practical results.

Bottom line: Across three tasks and multiple metrics, DreamStyle consistently matched or beat strong baselines, especially in the key target—style consistency—while keeping motion and content coherent.

05Discussion & Limitations

Limitations (honest look):

Multi-shot or complicated scene cuts: The base I2V model and training data focus on single-shot clips, so cross-shot consistency isn’t guaranteed.
Structure tensions with strong styles: Styles with heavy geometric deformation (e.g., flattening faces into origami folds) can fight the raw video’s structure; DINO scores may dip in those cases.
Data bias: Even with careful curation, the datasets reflect what the stylizers and sources can do; rare styles or motions may be underrepresented.
Compute and memory: Training uses sizable GPUs and careful batching; very long videos still require segment chaining.
Control signal limits: Depth/pose cues can’t capture all motion nuances (e.g., cloth, water turbulence), leaving some residual mismatches.

Required Resources:

A capable I2V base model (e.g., Wan14B-I2V) and VRAM to run it.
Access to strong image stylizers (InstantStyle/SDXL, Seedream 4.0) for building/expanding datasets.
ControlNet tools (depth/pose) and optional ID preservation for faces.
Time for filtering (automatic + manual) if you expand datasets.

When NOT to Use:

Multi-shot, story-level edits that need global scene memory and consistent characters across cuts.
Tasks demanding exact geometric preservation while also demanding extreme stylization—those goals can clash.
Real-time or on-device scenarios with tight compute/memory budgets.

Open Questions:

Can we add scene-level memory to handle multi-shot sequences with consistent style and characters?
Can richer controls (optical flow, segmentation, 3D cues) further improve motion/style alignment?
How to better balance dynamic degree and per-frame image quality without sacrificing realism?
Could a learned router (instead of manual token-type routing) outperform token-specific LoRA while staying stable?
How to build even larger, more diverse paired datasets without heavy manual filtering?

06Conclusion & Future Work

3-Sentence Summary: DreamStyle is a unified video stylization framework that accepts text, style images, and stylized first frames in one model. It works by cleanly injecting these conditions into a strong I2V backbone, training with a token-specific LoRA on high-quality paired stylized–raw video data built via a practical two-step pipeline. The result is stable style, solid motion, and flexible control—outperforming specialized baselines across three tasks.

Main Achievement: Showing that a single, efficiently adapted I2V model—equipped with a clean condition-injection design, token-specific LoRA, and a well-curated two-stage dataset—can handle all major video stylization modes with state-of-the-art style consistency.

Future Directions:

Extend to multi-shot and longer narratives with scene memory and character tracking.
Add richer control signals (segmentation/flow/3D) for harder motions (cloth, water, crowds).
Learn adaptive routing for LoRA or mixture-of-experts to scale to more token types and tasks.
Broaden datasets to rarer styles and edge cases; explore self-supervised consistency objectives.

Why Remember This: DreamStyle turns scattered, single-mode stylization tools into one practical, controllable system—making creative video restyling more accessible, consistent, and fun for artists, educators, and everyday users.

Practical Applications

•Turn classroom science demos into watercolor or comic styles to make lessons engaging while keeping the same motions.
•Create brand-consistent ads by restyling product videos into a chosen art style without reshooting.
•Produce safer previsualization for films and games by testing multiple art directions on the same footage quickly.
•Make social media clips into pixel art or low-poly while preserving the action to stand out without manual editing.
•Generate long stylized travel vlogs by chaining segments with first-frame guidance for consistent looks throughout.
•Blend multiple references (e.g., a style prompt + two posters) to invent a unique hybrid style for music videos.
•Rapidly prototype educational animations from real footage (e.g., physics experiments) with stable stylization across frames.
•Personalize family videos (birthdays, sports) into cartoon styles while keeping faces recognizable via ID preservation.
•Localize content visually for different regions (e.g., manga-style vs. Western comic) using the same base video.
•Augment datasets for creative AI apps by generating paired raw/stylized videos for training and evaluation.

Version: 1