SkyReels-V3 Technique Report

Debang Li; Zhengcong Fei; Tuanhui Li; Yikun Dou; Zheng Chen; Jiangping Yang; Mingyuan Fan; Jingtao Xu; Jiahua Wang; Baoxuan Gu; Mingshan Chang; Wenjing Cai; Yuqiang Xie; Binjie Mao; Youqiang Zhang; Nuo Pang; Hao Zhang; Yuzhe Jin; Zhiheng Xu; Dixuan Lin; Guibin Chen; Yahui Zhou

SkyReels-V3 Technique Report

Intermediate

Debang Li, Zhengcong Fei, Tuanhui Li et al.1/24/2026

arXiv PDF

Key Summary

•SkyReels-V3 is a single AI model that can make videos in three ways: from reference images, by extending an existing video, and by creating talking avatars from audio.
•It uses a diffusion Transformer (a powerful pattern-finder and fixer) inside a multimodal in-context learning setup so it can understand images, text, video, and audio together.
•A special data pipeline cleans and prepares training pairs to avoid copy–paste artifacts and to keep characters and scenes consistent over time.
•Multi-reference conditioning lets you guide a video with up to four images, keeping identities and backgrounds steady while following instructions from text.
•Hybrid training on both images and videos, plus multi-resolution training, helps the model generalize to many scenes, sizes, and aspect ratios.
•The video extension system supports both smooth single-shot continuation and pro-style shot switching (like cut-in, cut-out, reverse shot), with strong temporal coherence.
•For talking avatars, key-frame-constrained generation and audio–visual alignment yield accurate lip-sync and long, stable 720p videos at 24 fps.
•On benchmarks, SkyReels-V3 reaches state-of-the-art or near state-of-the-art results for visual quality, following instructions, and reference consistency.
•It approaches closed-source leaders while remaining open-source, making it a strong foundation for research and production.
•This unified design reduces tool-switching, speeds content creation, and supports use cases from ads and education to live commerce and entertainment.

Why This Research Matters

SkyReels-V3 reduces the need for multiple tools by unifying image-, video-, and audio-guided generation in one system. That speeds up content creation for educators, marketers, and filmmakers while keeping characters, styles, and scenes consistent. Better lip-sync and long-form stability make avatars more trustworthy and engaging for customer support, lectures, and entertainment. Strong instruction following turns prompts into reliable shots, saving time on reshoots and edits. Open-source availability encourages research progress, transparency, and innovation. As video becomes a core language of the internet, tools like this help people tell clearer, more expressive stories. It also sets a foundation for future interactive, real-time, multimodal creativity.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how making a movie usually takes a whole team—someone writes the script, another directs, another records sound, and someone else edits? For a long time, AI video tools were like separate little helpers—one that made short clips from text, another that tried to edit style, and another that lip-synced faces—but they didn’t work together well. If you wanted a specific person, in a specific place, moving in a certain way, speaking the right words, you had to juggle many tools and pray they’d match up.

🍞 Top Bread (Hook): Imagine building a LEGO city using pieces from different sets that don’t click together. You can build something, but it’s wobbly and falls apart when you add more floors.

🥬 The Concept: Conditional Video Generation is making videos based on given clues—like images, words, or sounds.

How it works:
1. You give the AI conditions (a prompt, reference images, a starter video, or an audio clip).
2. The AI plans what should appear in each frame and how frames connect.
3. It generates a sequence that should follow your conditions.
Why it matters: Without conditions, the AI just guesses and often makes pretty but irrelevant clips. 🍞 Anchor: If you say “A girl rides a bike by the beach,” conditional video generation keeps the bike, girl, and beach consistent across frames.

The problem was that past systems could handle one kind of condition at a time. A text-to-video model might ignore exact character identity. A face-animator might nail lip movements but lose background consistency. Many models created nice-looking frames but didn’t keep things coherent over time—faces drifted, clothes morphed, and scenes jumped around like a glitch.

🍞 Top Bread (Hook): You know how you can follow a cooking video better if you see the chef, read the recipe, and hear the instructions? Using just one is harder.

🥬 The Concept: Multimodal In-Context Learning means the AI learns from several kinds of information (text, images, audio, video) at once, in the same example.

How it works:
1. Turn each input (text, audio, images, video) into tokens the AI can read.
2. Mix them in one context window so the model can compare and connect them.
3. Train the model to produce video that respects all inputs together.
Why it matters: If the AI can’t see all clues at once, it forgets or mismatches parts, like lips that don’t match speech. 🍞 Anchor: Give a portrait image and a speech clip; the model matches mouth shapes to sounds while keeping the same face and background.

Before SkyReels-V3, researchers tried clever tricks. Some pasted a subject from a reference image into frames (leading to copy–paste artifacts and stiff motion). Others trained separate models for identity, motion, and audio, then stitched results—often causing style clashes and timing errors. Even strong diffusion video models struggled to balance identity preservation, motion realism, scene stability, and instruction following in one system.

🍞 Top Bread (Hook): Think of sculpting from a block of marble—start rough, then refine until it looks real.

🥬 The Concept: Diffusion Transformers combine diffusion (turn noise into images/videos step by step) with Transformers (great at modeling sequences and context).

How it works:
1. Start with noisy video latents.
2. A Transformer learns which parts of noise to remove using context (text, images, audio).
3. Over many steps, it denoises into a clean, coherent video.
Why it matters: Without diffusion Transformers, the model either lacks fine detail (noisy) or loses global story flow (incoherent). 🍞 Anchor: Like un-blurring a foggy movie a little each step until it’s crisp—and still follows the script and soundtrack.

The gap this paper fills: a single, unified model that can take multiple kinds of inputs—images, video, audio, and text—and produce high-quality, temporally consistent, identity-preserving, instruction-following videos. It also handles video continuation with professional shot transitions and long, accurate talking avatars.

Real stakes? This affects everyday creativity: influencers want quick, consistent shorts; teachers need clear explainers; stores need product demos; filmmakers test storyboards; anyone making multi-shot scenes wants continuity. When identity, motion, and sound stay aligned, viewers trust and enjoy the video—no uncanny jolts, no off-sync lips, no random scene swaps. SkyReels-V3 moves us from frame-by-frame painting to story-level filmmaking by an AI that can listen, look, and follow directions—all at once.

02Core Idea

The “Aha!” in one sentence: Put all the clues (text, images, video, audio) into one big brain (a diffusion Transformer) so it can learn to make coherent, controllable videos across tasks—without switching models.

Analogy 1—Movie Director’s Clipboard: A director keeps notes for actors (identity from images), script lines (text), soundtrack beats (audio), and last shot (starter video). SkyReels-V3 is that director-in-a-box, reading everything together to run a smooth scene.

Analogy 2—Cooking With All Ingredients on the Counter: Instead of fetching salt later and forgetting the pepper, the chef lays out every ingredient at once. The model sees all conditions at once, so timing, taste, and texture come out right.

Analogy 3—GPS With Traffic and Weather: A simple GPS (text-only) can route you, but adding live traffic (video), weather (audio cues), and landmarks (images) gives a safer, smoother trip. SkyReels-V3 uses every signal to plan the best video path.

Before vs After:

Before: Separate models or fragile pipelines: identity drifting, lip-sync off, scene changes jarring, instructions half-followed. Fine-tuning for each task bloated complexity.
After: One architecture supports three major tasks—reference images-to-video, video extension, and talking avatar—with strong temporal coherence, identity preservation, and instruction following. No heavy tool-switching.

Why it works (intuition):

Unified conditioning: When the model sees references, text, audio, and prior frames together, it learns the relationships—who the subject is, how they should move, what the scene should look like, and when to align mouth shapes with sounds.
Diffusion + Transformer synergy: Diffusion polishes details step by step; Transformers keep the global story and cross-modal connections straight.
Training that matches reality: Mixing image and video data teaches appearance and motion; multi-resolution teaches flexibility; curated reference pairs teach identity consistency without copy–paste.

Building Blocks (each with a Sandwich):

🍞 Hook: You know how a photo of your friend helps you spot them in a crowd? 🥬 Reference Images-to-Video Synthesis: It’s turning one or more reference images plus a prompt into a moving, identity-true video.

How:
1. Encode 1–4 reference images.
2. Mix them with a text prompt in the model’s context.
3. Generate a video that keeps the subject’s look and follows the instructions.
Why: Without it, the person or product could subtly change each frame. 🍞 Anchor: Give a model photo and “walk through a garden at sunrise,” and the same person walks, with consistent clothes and face.

🍞 Hook: Like adding a new chapter that matches the style of the book you just read. 🥬 Video-to-Video Extension: Continue a video clip so the story flows naturally.

How:
1. Encode the input video segment.
2. Use text to guide what happens next.
3. Generate new frames that match motion, style, and scene.
Why: Without careful extension, cuts feel jarring or the style shifts. 🍞 Anchor: A runner turns a corner; the model continues the run smoothly into the next street.

🍞 Hook: Dancing to a song’s beat helps you move in sync. 🥬 Audio-Guided Video Generation: Create a video (like a talking avatar) that matches an audio track’s timing and content.

How:
1. Encode audio into phoneme and rhythm cues.
2. Align face and body motions to those cues.
3. Generate frames with accurate lip-sync and expressions.
Why: Without audio guidance, lips and emotions won’t match speech. 🍞 Anchor: A portrait plus speech audio yields a 24 fps video with perfect mouth shapes and expressions.

🍞 Hook: Planning a group photo needs knowing where everyone stands. 🥬 Multi-Reference Conditioning: Guide generation using up to four references (people, objects, scenes).

How:
1. Encode each reference into latents.
2. Concatenate with video latents.
3. Let the model compose them into one coherent scene.
Why: Without it, multi-subject scenes tangle identities or backgrounds. 🍞 Anchor: Two friends and a café interior as references -> both friends chat at that café.

🍞 Hook: Learning from both textbooks (images) and documentaries (videos) gives deeper understanding. 🥬 Hybrid Image-Video Training: Train on images for appearance and videos for motion.

How:
1. Mix large image and video datasets.
2. Alternate or blend batches in training.
3. Optimize so the model learns detail and dynamics.
Why: Without this mix, the model either lacks crisp details or believable motion. 🍞 Anchor: It can render sharp fabric textures while animating them naturally in wind.

🍞 Hook: A good story keeps places and times consistent. 🥬 Spatiotemporal Consistency Modeling: Keep space (who/where) and time (when/how) steady across frames.

How:
1. Use positional encodings across frames.
2. Model motion trajectories, not isolated pictures.
3. Penalize flicker and identity drift.
Why: Without it, faces warp, objects jump, and viewers get confused. 🍞 Anchor: A cup stays on the same table as the camera pans.

🍞 Hook: Key scenes in a movie anchor the plot. 🥬 Key-Frame-Constrained Generation: Pick important frames and smoothly connect them.

How:
1. Decide key frames (starts, changes).
2. Generate them first.
3. Fill in transitions to keep motion stable.
Why: Without anchors, long videos wobble or drift. 🍞 Anchor: In a speech, key frames at sentence starts guide steady expressions throughout.

03Methodology

At a high level: Inputs (text prompt ± 1–4 reference images ± starter video ± audio) → Encode each modality into tokens/latents → Pack them into one context with positions → Diffusion Transformer denoises a noisy video latent sequence step-by-step → Video VAE decodes latents to high-resolution frames.

Step-by-step (with Sandwich blocks for the clever pieces):

Data Curation and Reference Pairing 🍞 Hook: If you practice basketball with bent hoops, your shot gets worse. 🥬 Reference-Preserving Data Construction: Build clean, diverse training pairs so the model learns identity and motion without cheating.

How:
1. Filter large in-house videos for high quality and meaningful motion.
2. Use cross-frame pairing to pick reference frames that are different enough to be useful but still consistent.
3. Apply image editing to extract subjects and complete backgrounds, then semantic rewriting to avoid trivial copy–paste.
4. Filter out distorted or inconsistent references.
Why: Without careful data, the model memorizes or copies, causing artifacts. 🍞 Anchor: Training on a dancer clip with well-chosen reference frames teaches steady identity across twirls.

Visual Encoding and Latent Space 🍞 Hook: Shrinking a big map lets you plan a road trip faster. 🥬 Video VAE (Variational Autoencoder for Video): Compress frames into latents for efficient processing and decode back later.

How:
1. Encode images/videos into compact latents.
2. Run the diffusion process in latent space (faster, more stable).
3. Decode latents back into crisp frames at the end.
Why: Without a VAE, computation explodes and training becomes unstable. 🍞 Anchor: A 720p clip gets compressed, edited in latent form, then restored with details.

Multi-Reference Conditioning 🍞 Hook: Organizing people into clear rows before a photo. 🥬 Multi-Reference Token Fusion: Ingest up to four references and keep them distinct yet composable.

How:
1. Encode each reference image via the video VAE.
2. Concatenate their latents with the current video latents.
3. Use attention to let the model pull the right details at the right time.
Why: Without proper fusion, identities blend or get dropped. 🍞 Anchor: Two pets plus a living room reference guide a play scene without swapping fur patterns.

Text and Audio Conditioning 🍞 Hook: Reading subtitles while listening helps you understand a documentary. 🥬 Cross-Modal Conditioning: Align text and audio with visuals so actions follow instructions and lips match sounds.

How:
1. Encode text (prompt/instructions) into contextual tokens.
2. Encode audio into phoneme/rhythm features.
3. Joint attention lets visuals follow meaning and timing.
Why: Without this, videos ignore prompts or go off-sync. 🍞 Anchor: “Zoom in on the sneakers” shifts camera framing while the narrator’s words match mouth shapes.

Diffusion Transformer Denoising 🍞 Hook: Cleaning a window a little more each wipe until it’s clear. 🥬 Diffusion with a Transformer Backbone: Remove noise step-by-step while keeping story-wide consistency.

How:
1. Start from noisy latent frames.
2. At each step, the Transformer predicts the noise to subtract, using all inputs.
3. Repeat until a clean video emerges.
Why: Without iterative refinement, you lose either details or the big picture. 🍞 Anchor: A snowy mountain scene sharpens over steps while preserving the skier’s identity and motion.

Training Strategies for Robustness 🍞 Hook: Training with both drills and scrimmages makes a stronger team. 🥬 Hybrid Image–Video + Multi-Resolution Training: Learn detail and motion across many sizes and aspect ratios.

How:
1. Mix large-scale images (appearance) and videos (dynamics) in training.
2. Randomize resolutions and aspect ratios.
3. Jointly optimize so the model can natively output 1:1, 3:4, 16:9, 9:16, etc.
Why: Without this, models overfit to one size or get blurry at others. 🍞 Anchor: The same outfit looks sharp in portrait TikTok and widescreen film shots.

Spatiotemporal Consistency and Positional Encoding 🍞 Hook: Page numbers and timestamps keep a documentary organized. 🥬 Unified Multi-Segment Positional Encoding: Label where/when each token belongs—even across multiple shots.

How:
1. Assign positions for frames and segments.
2. Train on hierarchical, multi-segment data.
3. Encourage smooth transitions across cut boundaries.
Why: Without positions, the model confuses time and shots. 🍞 Anchor: A cut-in from wide to close-up keeps the actor’s pose and lighting consistent.

Video Extension with Shot Switching 🍞 Hook: Editors follow rules for cuts (cut-in, cut-out, reverse shot) to guide viewers. 🥬 Shot Switching Detector: Find and label shot types to build better training and inference.

How:
1. Analyze long videos for transitions.
2. Classify into single shot, cut-in, cut-out, multi-angle, shot/reverse shot, cut-away.
3. Use labels to train smooth, cinematic switching.
Why: Without detecting shot types, transitions feel awkward. 🍞 Anchor: From a medium shot to a close-up, the model maintains eye-lines and lighting.

Talking Avatar: Audio–Visual Alignment and Key Frames 🍞 Hook: Hitting the drum on the beat keeps the band tight. 🥬 Region Masking + Audio Alignment: Focus learning on mouth/face to match phoneme timing.

How:
1. Mask regions (mouth/face) during training to emphasize sync.
2. Align phonemes to frame times.
3. Optimize perceptual realism and sync metrics.
Why: Without focus, the model spreads effort and lip-sync slips. 🍞 Anchor: Fast rap segments show precise mouth shapes on every syllable.

🍞 Hook: Sketching key poses before animating in-between frames. 🥬 Key-Frame-Constrained Generation: Stabilize long videos.

How:
1. Generate first/last frames and other key poses.
2. Inference fills smooth transitions in between.
3. Maintain identity and motion flow over minutes.
Why: Long videos drift without anchors. 🍞 Anchor: A 60-second news monologue keeps the same face, lighting, and steady motions.

Secret Sauce:

Unified conditioning lets the model learn cross-modal cause-and-effect (words cue actions; audio shapes lips; images lock identity).
Hierarchical positional encoding and spatiotemporal modeling preserve narrative structure, not just pretty frames.
Data curation breaks copy–paste habits, teaching the model to genuinely synthesize consistent motion and composition.

04Experiments & Results

The Test: The team evaluated three main abilities—(1) Reference Consistency (do faces, clothes, objects, backgrounds stay the same?), (2) Instruction Following (does the video follow the text directions?), and (3) Visual Quality (is it sharp, smooth, aesthetic, with good motion?). For talking avatars, they also tested Audio–Visual Sync (do lips match sound?) and Character Consistency (does the identity stay stable?).

The Competition: SkyReels-V3 was compared with strong systems like Vidu Q2, Kling, and PixVerse V5 for reference image-to-video. For talking avatars, it was compared to OmniHuman, KlingAvatar, and HunyuanAvatar. These are among the most capable public or widely cited systems.

Scoreboard with Context:

Reference Images → Video: SkyReels-V3 achieved top or near-top scores. For example, its Reference Consistency (~0.67) means it’s better at keeping the same person and clothes over time than others in the table (which were around ~0.60–0.65). Its Visual Quality (~0.81) is also strong—think getting an A when many get B or B+.
Instruction Following (~27.2) shows the model can parse and obey prompts well—like a student who follows a multi-step lab procedure without skipping parts.
Talking Avatars: On Audio–Visual Sync, SkyReels-V3 led or tied with the best, while Visual Quality and Character Consistency were competitive (top-tier or near the top). In plain terms: crisp faces, steady identities, and mouths that match the words—even during fast speech or singing.

Qualitative Results: The figures show complex, multi-subject scenes (e.g., a man playing with a dog; multiple people interacting), product showcases from a single image (e.g., clothing and bottles), and long talking-avatar clips at 720p/24 fps. Shot-switching demos (cut-in, cut-out, multi-angle, shot/reverse shot, cut-away) look cinematic, with stable lighting and geometry across transitions.

Surprising Findings:

Multi-reference inputs (up to four images) work well without manual compositing; the model learns to place subjects and backgrounds coherently.
The hybrid image–video training and multi-resolution strategy greatly improve out-of-distribution robustness—scenes and aspect ratios not seen often during training still generate cleanly.
Key-frame-constrained generation stabilizes minute-long outputs better than naive frame-by-frame methods, reducing identity drift and motion wobble.

What the Numbers Mean for Non-Experts:

Reference Consistency around ~0.67 vs. ~0.60 competitors is like recognizing your friend in a crowd even when the camera moves—fewer mistakes.
Visual Quality ~0.81 vs. ~0.79 means fewer blurry frames and more natural motion—like going from 720p to sharper 720p with better motion flow.
Higher sync scores in avatars mean words and lips line up so well you stop noticing the technology and just listen to the person.

Generalization: Tests spanned film/TV, e-commerce, and ads, with references including people, animals, objects, and scenes. The model handled rapid motion, multiple subjects, and various aspect ratios (1:1 to 9:16) with steady performance. This breadth suggests the architecture scales across creative and commercial use cases.

05Discussion & Limitations

Limitations:

Extremely complex scenes with many fast-moving subjects can still cause subtle identity drift or motion jitter.
Long, unbroken shots beyond several minutes may slowly accumulate off-by-a-little errors without carefully placed key frames.
Highly unusual styles or rare object types (not well represented in training) can reduce instruction following or cause minor artifacts.
Audio with extreme tempo changes or heavy background noise can slightly degrade lip precision or expression timing.

Required Resources:

Training requires large-scale GPU clusters, curated multi-modal datasets, and careful data filtering/editing tools.
Inference at 720p/24 fps is practical on modern GPUs, but longer or higher-resolution outputs demand more memory and time.

When NOT to Use:

If you only need a single, static, photorealistic image, a dedicated image model may be faster and cheaper.
For precise physics or safety-critical predictions (e.g., medical procedures, autonomous driving decisions), use specialized simulation or validated systems—this is a generative storyteller, not a physics engine.
If lip-reading accuracy must be legally certified (e.g., forensic analysis), this model’s outputs are perceptually strong but not for certification.

Open Questions:

Can we design even better cross-modal attention that learns causal timing (e.g., breath, emphasis) for richer performances?
How to scale to 4K video with minute-long consistency without exploding compute?
Can we self-label shot types robustly in the wild to reduce dependency on hand-curated detectors?
How to reduce data bias so rare languages, accents, and visual cultures receive equal fidelity?
Can on-device or streaming variants keep low latency for live avatars and interactive editing?

06Conclusion & Future Work

In three sentences: SkyReels-V3 is a unified diffusion-Transformer system that reads text, images, audio, and video together to create coherent, controllable videos. It supports three cornerstone tasks—reference images-to-video, video extension with cinematic shot switching, and long-form talking avatars with accurate lip-sync—while preserving identity and visual quality. Through curated data, hybrid training, multi-resolution learning, and key-frame constraints, it reaches state-of-the-art or near state-of-the-art performance across multiple metrics.

Main Achievement: Proving that one multimodal in-context framework can handle diverse video-generation tasks at high quality, reducing fragmentation and tool-switching in practical workflows.

Future Directions: Push to higher resolutions and longer durations; strengthen cross-modal timing and emotion control; improve robustness in fast, crowded scenes; advance self-supervised data labeling for broader coverage; and optimize for real-time and edge deployment.

Why Remember This: It marks a shift from frame-by-frame trickery to narrative-level, multimodal filmmaking by AI—one model that listens, looks, and follows directions while keeping identities and scenes steady. This unification lowers barriers for creators, researchers, and businesses, paving the way for scalable, controllable, and cinematic-quality video generation.

Practical Applications

•Create product demo videos from a single product photo and a short prompt for e-commerce.
•Extend short clips into longer, coherent scenes with smooth, cinematic shot transitions for film pre-visualization.
•Generate talking-head lectures from slides and recorded audio for rapid educational content.
•Produce brand avatars that speak scripted lines across multiple languages with accurate lip-sync.
•Draft storyboards-to-animatics by turning reference stills and text into moving previews.
•Design social media shorts where multiple characters interact in a referenced setting without identity drift.
•Localize marketing videos by reusing the same spokesperson avatar with new audio tracks.
•Create game cutscenes by extending gameplay captures into narrative transitions.
•Prototype commercials by composing multiple reference images (product, spokesperson, location) into one video.
•Assist live commerce with fast, style-consistent videos showcasing clothing, gadgets, or food.

Version: 1