SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
Key Summary
- •SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.
- •It uses two teamwork lanes (one for video, one for audio) that constantly talk to each other so lips, actions, and sounds line up.
- •A smart language-and-vision brain (an MMLM) reads mixed instructions—text, images, video clips, masks, and audio examples—to guide both lanes.
- •A simple mask-and-channels trick treats many jobs (text-to-video, image-to-video, extension, and editing) as one inpainting problem.
- •For speed and quality, it first creates a full low-resolution video plus a few high-resolution keyframes, then a Refiner upsamples and fills the in-between frames.
- •Special attention layers let video and audio share timing cues, and RoPE scaling matches their different time scales for better sync.
- •The model reaches up to 1080p, 32 FPS, and 15 seconds, supporting multi-shot scenes with synchronized audio.
- •It ranked third on a public arena for video-with-audio and scored highest overall on the team’s human evaluation benchmark.
- •This approach is practical for creators: it can add or remove objects, change styles, or match a video to a given voice or music sample.
- •SkyReels-V4 shows that one model can handle many video and audio tasks together without falling apart.
Why This Research Matters
SkyReels-V4 brings a “one-stop studio” to creators, students, and businesses by making video and audio together under one roof, so timing and mood stay aligned. It speeds up production by unifying text-to-video, editing, and inpainting with the same interface, saving time and reducing tool-switching. Teachers can craft richer lessons—like historical reenactments with matching narration—without expert video skills. Global teams benefit from multilingual prompts, voice guidance, and consistent character references across multiple shots. Marketers can iterate quickly on product videos, swapping backgrounds, styles, or voices in minutes. Accessibility improves too: you can sync sign language, captions, voiceovers, and described audio more reliably. Overall, it lowers the barrier to making polished, cinema-like content that follows complex instructions.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine making a short movie for class. You have to film scenes, record voices, add music, and fix mistakes. Doing all that one piece at a time is slow and things can easily fall out of sync.
🥬 The Concept (Video basics): What it is: A video is a flipbook of pictures called frames; audio is a wavy line that represents sound over time. How it works: 1) A camera captures frames per second (like 32 FPS). 2) A microphone captures a continuous sound wave. 3) A player shows the right picture at the right moment while playing the matching sound. Why it matters: If the frames and sound don’t line up, lips won’t match voices and crashes won’t match bang sounds. 🍞 Anchor: When you watch a cartoon and a character claps, you expect the clap sound to land exactly when the hands meet.
🍞 Hook: You know how you learn by doing lots of examples—like practicing math problems again and again?
🥬 The Concept (Machine learning): What it is: Machine learning teaches computers patterns from many examples so they can create or predict new things. How it works: 1) Feed in many labeled examples. 2) The model tries to guess. 3) Compare its guess to the real answer. 4) Nudge the model to be a little less wrong. 5) Repeat until it gets good. Why it matters: Without training on lots of examples, a model can’t make believable videos or sounds. 🍞 Anchor: After seeing thousands of pictures of dogs and cats, a model learns the difference—just like you do.
🍞 Hook: Imagine trying to make a music video by first making the music one day, then filming the video weeks later, with no plan to match them up.
🥬 The Concept (Old video/audio generators): What it is: Older systems often made video and audio separately. How it works: 1) Make a video from text. 2) Then make audio from the same or similar text. 3) Try to glue them together. Why it matters: This leads to mismatched lips, late sound effects, and a sloppy feel. 🍞 Anchor: It’s like dubbing a movie where the voice comes half a second late.
🍞 Hook: Think of giving directions with words plus pointing to a map—much clearer than words alone.
🥬 The Concept (Multimodal inputs): What it is: Multimodal means using different kinds of inputs at once—text, images, videos, masks, and audio. How it works: 1) Text describes the plan. 2) Images or video clips act as visual examples. 3) Masks say which parts to keep or change. 4) Audio snippets show voice or music style. Why it matters: One signal can be vague; together, they give precise guidance. 🍞 Anchor: “Make this dog (image) run like this athlete (video) while using this drumbeat (audio). Don’t touch the background (mask).”
🍞 Hook: Picture cleaning a messy drawing by adding tiny dots until a clean picture appears.
🥬 The Concept (Diffusion models): What it is: A diffusion model learns to turn random noise into a clear picture, video, or sound. How it works: 1) Start with noise. 2) Take small steps to remove noise. 3) Each step uses learned hints to reach the target. Why it matters: This step-by-step cleanup makes high-quality generations possible. 🍞 Anchor: Like focusing a blurry photo until it’s sharp.
🍞 Hook: In a group chat, everyone listens, then replies based on what matters most.
🥬 The Concept (Transformers): What it is: Transformers are models that use attention to decide which parts of the input to focus on. How it works: 1) Look at all tokens (pieces). 2) Score how relevant each is. 3) Mix them using those scores. 4) Repeat in layers. Why it matters: Without attention, the model would treat every detail equally and miss what’s important. 🍞 Anchor: When asked “Capital of France?”, it focuses on “capital” and “France,” not filler words.
The world before this paper had strong text-to-video and some audio-to-video tools, but none truly unified: mixing many input types, making video and matching audio together, and handling generation, inpainting (filling or changing parts), and editing—all inside one model. People tried add-on adapters or simple cross-attention, but audio and video still didn’t sync tightly and editing felt like a separate tool. What was missing was a single “studio” where vision and sound learn together from the start, can be steered by mixed inputs, and can treat many tasks as one simple recipe.
Why we should care: That unified studio saves time and confusion. It helps creators, teachers, and everyday users turn ideas into polished clips where actions and sounds match—like a mini movie studio in your backpack.
02Core Idea
🍞 Hook: Imagine a two-person bicycle: one rider is in charge of steering (video), the other keeps the rhythm (audio). They share a headset with directions (the instruction model), so they move as one.
🥬 The Concept (Key insight): What it is: Train video and audio together in a dual-stream Diffusion Transformer that shares a powerful multimodal instruction encoder, and treat many video tasks as one inpainting problem with a simple mask-and-channels trick. How it works: 1) A shared multimodal language-vision-audio encoder reads rich instructions. 2) Two synchronized transformer streams—one for video, one for audio—constantly exchange signals. 3) For video tasks, add condition frames and a mask as extra channels so many jobs look the same to the model. 4) For efficiency and quality, generate a low-res full clip plus high-res keyframes, then refine. Why it matters: This turns scattered tools into one dependable studio that follows complex directions and keeps sight and sound in sync. 🍞 Anchor: “Make the boy from this photo skateboard like this clip, keep the city background, and use this breeze sound.” It outputs a 1080p, 32 FPS, 15-second video with matching audio.
🍞 Hook: Think of a dance duo practicing together from day one.
🥬 The Concept (Dual-stream MMDiT): What it is: Two synchronized transformer branches—one for video, one for audio—built with diffusion steps and attention. How it works: 1) Early layers align text with each modality. 2) Later layers share parameters for speed. 3) Cross-attention lets video and audio trade timing cues. 4) Both end up agreeing on “what happens when.” Why it matters: Separate training leads to off-beat moves; joint learning keeps perfect timing. 🍞 Anchor: A door slam in video fires the “slam” sound right then, not three frames late.
🍞 Hook: Imagine a friendly librarian who understands books, photos, and songs—and can explain them all together.
🥬 The Concept (Shared MMLM instruction encoder): What it is: A multimodal large language model that reads mixed inputs—text, images, videos, audio—and turns them into one set of guiding features. How it works: 1) Combine all instructions (e.g., text + references). 2) Encode into embeddings. 3) Feed these to both video and audio streams. Why it matters: Without one shared brain, the two streams might follow different stories. 🍞 Anchor: “Use @image_1’s person, @video_2’s camera angle, and @audio_3’s calm music” becomes one consistent plan.
🍞 Hook: Think of a coloring book where a mask tells you which parts to paint and which to leave alone.
🥬 The Concept (Channel-concatenation inpainting): What it is: A simple way to unify video tasks by stacking (concatenating) noisy video, condition frames, and a mask into one input. How it works: 1) Encode known frames with a VAE. 2) Put them next to the noisy video latents. 3) Add a mask saying where to keep or change. 4) The model fills or edits only where needed. Why it matters: Many different tasks now look like the same job, making training and usage simpler. 🍞 Anchor: Image-to-video is just “first frame fixed; rest to generate.” Editing is “these regions fixed; others to change.”
🍞 Hook: When baking a cake, it helps to have example photos on the counter.
🥬 The Concept (In-context visual references via temporal concatenation): What it is: Put reference image/video latents in front of the generation tokens so the model can look at them during self-attention. How it works: 1) Encode references. 2) Prepend them before the video latents. 3) Use special positional offsets so the model knows they are references. 4) Attend over them while generating. Why it matters: Without direct references, tiny details (like a character’s freckles) may be lost. 🍞 Anchor: The model copies a jacket’s exact pattern from the reference image.
🍞 Hook: A drummer and guitarist need the same beat count so they hit notes together.
🥬 The Concept (RoPE scaling for temporal alignment): What it is: Adjust the audio’s positional encoding so its many tokens line up with the fewer video frames in time. How it works: 1) Use Rotary Positional Embeddings for both. 2) Scale audio’s time frequencies to match video’s slower pace. 3) Now cross-attention lines up moments correctly. Why it matters: Otherwise, one second of audio might attend to the wrong video moment. 🍞 Anchor: A five-syllable word lines up with five mouth shapes.
🍞 Hook: First sketch the full comic in small thumbnails, then draw a few panels in high detail, and finally clean up the whole page.
🥬 The Concept (Low-res full + high-res keyframes + Refiner): What it is: A speed-and-quality trick: generate the whole clip at low-res and a subset of keyframes at high-res, then upscale and interpolate. How it works: 1) Base model outputs low-res all frames + high-res keyframes. 2) Refiner upsamples and fills in-betweens. 3) Sparse attention (VSA) keeps compute low. Why it matters: Going directly to full 1080p, long clips is too slow and memory-heavy. 🍞 Anchor: You quickly preview the entire scene, perfect a few hero shots, then polish everything to cinematic quality.
Before vs. After: Before, you’d juggle many tools and risk audio-video mismatch. After, one model follows complex directions, edits precisely, and keeps sight and sound locked together. Why it works: attention shares what matters across time and modalities; masks turn many jobs into one; scaled positions keep moments aligned; and the two-stream design lets video and audio learn to dance in sync. Building blocks: the MMLM encoder, dual-stream MMDiT, channel-concatenation inpainting, in-context references with offsets, bidirectional cross-attention, RoPE scaling, and the Refiner with VSA.
03Methodology
At a high level: Inputs (text + images + video clips + masks + audio refs) → Shared multimodal encoder (MMLM) → Prepare latents (VAEs) → Dual-stream MMDiT (video/audio) with cross-attention and RoPE scaling, trained by flow matching → Decode to low-res full video + high-res keyframes + audio → Refiner does super-resolution and frame interpolation → Final 1080p, 32 FPS, up to 15 s video with synchronized audio.
🍞 Hook: Think of a recipe card that lists ingredients from different stores—words, pictures, clips, sounds.
🥬 The Concept (Multimodal instruction encoding): What it is: One encoder (MMLM) turns mixed instructions into guiding features for both streams. How it works: 1) Pack text plus references together. 2) The MMLM produces embeddings. 3) Both video and audio transformers read those embeddings. Why it matters: With one shared plan, both streams stay on the same story. 🍞 Anchor: “A rainy city chase; keep this actor; use this sax riff; don’t change the sky (mask).”
Step A — Latent preparation with VAEs 🍞 Hook: Shrinking a giant poster before mailing it makes shipping easier.
🥬 The Concept (VAEs for video/audio latents): What it is: VAEs compress frames and audio into smaller, learnable latents. How it works: 1) Encode images/frames/audio to latents. 2) The diffusion transformer operates on latents. 3) Decode at the end to pixels/waveforms. Why it matters: Working in latent space is faster and cheaper than on raw pixels or 44.1 kHz audio directly. 🍞 Anchor: A 1080p frame becomes a small latent grid; a 5-second sound becomes a compact sequence.
Step B — Dual-stream MMDiT blocks (early dual-stream, later single-stream) 🍞 Hook: In early rehearsals, each musician has their own coach; later, they practice as one band.
🥬 The Concept (Hybrid block design): What it is: Early layers keep text separate for strong alignment; later layers merge for efficiency. How it works: 1) Dual-stream early: separate norms and projections, joint self-attention. 2) Single-stream later: shared parameters on concatenated tokens. 3) Extra text cross-attention in video blocks prevents drifting from the prompt. Why it matters: Purely separate wastes time; purely merged can blur meanings. This mix balances both. 🍞 Anchor: The video still obeys “nighttime, neon-lit alley” even in late layers.
Step C — Bidirectional audio-video cross-attention and RoPE scaling 🍞 Hook: Two runners glance at each other’s strides to keep pace.
🥬 The Concept (Bidirectional cross-attention): What it is: Audio attends to video and video attends to audio at each layer. How it works: 1) Video queries audio features. 2) Audio queries video features. 3) Shared latent sizes avoid clumsy adapters. Why it matters: If only one side listened, timing would drift. 🍞 Anchor: A glass drop aligns with a tink sound exactly.
🍞 Hook: Matching calendars in weeks vs. days needs a conversion chart.
🥬 The Concept (RoPE temporal scaling): What it is: Scale audio’s positional timing so it corresponds to video frames. How it works: 1) Apply Rotary Positional Embeddings to both. 2) Multiply audio’s time frequencies by a small factor so segments align. Why it matters: Misaligned positions break lip-sync and sound effects timing. 🍞 Anchor: A 21-frame clip lines up with hundreds of audio tokens without confusion.
Step D — Unified video inpainting via channel concatenation 🍞 Hook: Label your canvas with “paint here” and “do not paint here.”
🥬 The Concept (Mask + condition frames): What it is: Stack noisy video latents, condition latents, and a mask along channels. How it works: 1) The mask marks fixed vs. to-be-generated regions. 2) Different tasks become different masks. 3) Training sees all as one inpainting job. Why it matters: A single interface covers text-to-video, image-to-video, extension, interpolation, and editing. 🍞 Anchor: To extend a video, set early frames masked as fixed and generate the rest.
Step E — In-context visual/audio references 🍞 Hook: Keep reference photos and a music sample on the desk while you work.
🥬 The Concept (Temporal concatenation with offsets): What it is: Prepend visual refs to attention with negative time indices; supply audio refs similarly for the audio stream. How it works: 1) Encode refs via VAE. 2) Prepend before generation tokens. 3) Use offset RoPE so the model knows they’re references. Why it matters: Fine details and style carry over with fidelity. 🍞 Anchor: The hero’s freckles and the choir’s timbre persist in the output.
Step F — Training with flow matching 🍞 Hook: Imagine guiding a paper boat from noisy waters to a calm harbor along the best current.
🥬 The Concept (Flow matching objective): What it is: Learn a velocity field that pushes noise to data for both streams together. How it works: 1) Mix data with noise by a time t. 2) Predict the velocity to move toward clean targets. 3) Train both audio and video jointly to encourage sync. Why it matters: It’s stable and matches diffusion quality with efficient training. 🍞 Anchor: Over steps, random static turns into a sharp, moving scene with clean audio.
Step G — Decoding and efficient refinement 🍞 Hook: Rough cut first, hero shots second, polish last.
🥬 The Concept (Low-res full + high-res keyframes + Refiner with VSA): What it is: Generate full low-res sequences plus selected high-res keyframes; then a Refiner upsamples and interpolates with sparse attention. How it works: 1) Base model outputs low-res all frames and high-res keyframes. 2) Linearly upscale low-res latents; replace keyframe spots with high-res latents. 3) Concatenate with noisy high-res latents and run Refiner. 4) VSA focuses attention only where it matters. Why it matters: Cuts compute ~3× while keeping cinematic quality. 🍞 Anchor: A 15-second 1080p, 32 FPS clip becomes feasible on modern GPUs.
Step H — Data and prompts 🍞 Hook: A good cookbook matters.
🥬 The Concept (Structured captions and diverse data): What it is: Mix real and synthetic data; use structured tags like <dialogue>, <sfx>, <bgm> to guide learning. How it works: 1) Collect and clean massive image/video/audio sets. 2) Caption thoroughly (short, long, structured). 3) Filter for sync (e.g., SyncNet for lip-audio alignment). 4) Train progressively from images → videos → audio → joint. Why it matters: Clear labels and variety teach the model to follow complex, multi-shot, multilingual instructions. 🍞 Anchor: “<dialogue>Hello</dialogue> with soft <bgm>jazz</bgm> and rain <sfx>” leads to accurate speech timing and mood.
Secret sauce:
- One interface for many tasks (channel-concatenation + masks).
- Constant audio-video communication (bidirectional cross-attention + RoPE scaling).
- Fast cinematic output (low-res full + high-res keyframes + Refiner with VSA).
- Rich, structured instructions (MMLM + tagged captions).
04Experiments & Results
🍞 Hook: If five teams cook the same dish, which tastes best? You need judges, rules, and a scoreboard.
🥬 The Concept (The test): What it is: Evaluate overall quality of video+audio generations, focusing on instruction following, sync, and aesthetics. How it works: 1) Public arena (Artificial Analysis) uses pairwise user votes to compute Elo scores. 2) A new human benchmark (SkyReels-VABench) rates five dimensions with experts. Why it matters: Numbers alone don’t show if lips match words or if music fits the mood—humans can. 🍞 Anchor: It’s like a science fair with judges checking both how it looks and how it sounds.
The competition: The model was compared to strong systems like Veo 3.1, Kling 3.0, grok-imagine-video, Sora-2, Vidu-Q3, and Wan 2.6 in the public arena; and against Veo 3.1, Kling 2.6, Seedance 1.5 Pro, and Wan 2.6 in detailed human studies.
Scoreboard with context:
- Artificial Analysis Video Arena (text-to-video with audio): SkyReels-V4 ranked 3rd as of 2026-02-24. That’s like getting a high A when the class average is a B.
- SkyReels-VABench (expert human ratings across 2000+ prompts): Highest overall average score among competitors. It especially excelled in Prompt/Instruction Following and Motion Quality; Visual Quality was neck-and-neck with top peers; Audio-Visual Sync and Audio Quality were strong and competitive.
🍞 Hook: When something unexpected pops up in an experiment, scientists pay attention.
🥬 The Concept (Surprising findings): What it is: Where the model performed better or differently than expected. How it works: 1) Multi-shot scenes: the model held cross-shot consistency well. 2) Mixed-language prompts: robust following thanks to multilingual captions and TTS data. 3) Rich editing tasks: unified inpainting made unusual edits consistently stable. Why it matters: Multi-shot and multilingual stability are hard; succeeding shows the model really understands instructions and timing. 🍞 Anchor: A story with three cuts—close-up, wide, and over-the-shoulder—kept the same character look and matching voice tone throughout.
Takeaways per dimension:
- Instruction Following: The shared MMLM and extra text cross-attention in video blocks helped it stick to details (subjects, style, camera moves).
- Audio-Visual Synchronization: Bidirectional cross-attention plus RoPE scaling reduced lip-slip and late sound effects.
- Visual Quality: The keyframe+Refiner pipeline delivered crisp textures at 1080p without excessive flicker.
- Motion Quality: Flow matching and progressive video training improved smooth camera motion and physical plausibility.
- Audio Quality: Broad speech/music/sfx pretraining produced clean, on-theme soundtracks, with strong speaker traits.
In short, public votes and expert ratings both point to a model that’s reliable, synchronized, and good at following complex, multi-shot, multimodal instructions.
05Discussion & Limitations
🍞 Hook: Even great tools have limits—like a bike that’s amazing on roads but not for mountain rock climbing.
🥬 The Concept (Limitations): What it is: Clear cases where the model isn’t perfect. How it works: 1) Length and resolution: built for up to 15 s and 1080p; longer or 4K+ needs more compute or new tricks. 2) Complex multi-speaker conversations can still stress perfect lip-ID matching in crowds. 3) Extremely precise spatial audio (full 3D localization) is not the main focus. 4) Out-of-distribution styles or edge-case edits may wobble. 5) Strong dependence on caption and mask quality; bad inputs lead to bad outputs. 6) Data bias and safety concerns remain and require governance. Why it matters: Knowing boundaries helps you apply it wisely. 🍞 Anchor: A 5-minute 4K concert with moving audience mics is beyond its current sweet spot.
🥬 The Concept (Required resources): What it is: What you need to run it well. How it works: 1) Modern GPUs with ample VRAM for inference (especially the Refiner). 2) Fast storage and RAM for long sequences. 3) For training or fine-tuning, multi-GPU clusters. 4) Good prompt and mask tooling for best edits. Why it matters: Underpowered machines will be slow or run out of memory. 🍞 Anchor: A gaming laptop can preview; a workstation or cloud GPU is better for full 1080p outputs.
🥬 The Concept (When not to use): What it is: Situations where another tool might be better. How it works: 1) Real-time streaming with ultra-low latency. 2) Forensic or medical precision where any artifact is unacceptable. 3) Exact licensed music replication without rights. 4) Long-form 30+ minute narration where TTS + traditional editing pipelines may be safer. Why it matters: Matching the tool to the task avoids frustration. 🍞 Anchor: Broadcasting live sports highlights with instant generation isn’t its target.
🥬 The Concept (Open questions): What it is: Next puzzles to solve. How it works: 1) Scaling to 4K and minutes-long stories efficiently. 2) Even tighter speaker tracking across multi-person scenes. 3) Richer 3D-aware audio (room acoustics, precise localization). 4) Stronger safety filters and watermarking. 5) Better controllability (storyboards, beat-sheets, shot lists) and editing reversibility. Why it matters: These steps move from “great demo” to “everyday studio standard.” 🍞 Anchor: Think storyboard-to-screen with guaranteed sync, style, and safety—like a full film pipeline in one box.
06Conclusion & Future Work
Three-sentence summary: SkyReels-V4 is a unified model that makes video and audio together while also handling inpainting and editing, all guided by rich multimodal instructions. It keeps sight and sound aligned with a dual-stream diffusion transformer, shared multimodal encoder, and a simple mask-based interface that turns many tasks into one. An efficient low-res + keyframe + Refiner design delivers cinematic 1080p, 32 FPS, 15-second clips at practical speeds.
Main achievement: Showing that one architecture can reliably accept text, images, videos, masks, and audio references to generate synchronized, editable video+audio with strong instruction following and motion quality.
Future directions: Scale to longer durations and 4K, advance multi-speaker lip-ID tracking, add richer spatial audio, improve safety controls and watermarking, and deepen controllability with storyboards and shot plans.
Why remember this: It’s a turning point from separate, fragile pipelines to a single studio that listens to multimodal instructions, edits precisely, and keeps everything in sync—bringing cinema-quality creation closer to everyday makers.
Practical Applications
- •Create product demos from text plus a few reference images and a sample music track.
- •Edit existing videos: remove watermarks/logos, change colors or materials, or swap backgrounds with masks.
- •Extend a shot: keep the first seconds fixed and smoothly generate the next moments while matching ambient sound.
- •Make multilingual announcements: use the same visuals but change speakers or languages via audio references.
- •Storyboard-to-video: feed in style frames and a scripted prompt to generate multi-shot sequences with matched audio.
- •Motion/style transfer: animate a person from a photo using the motion from a reference clip and a chosen soundtrack.
- •Education content: generate lab demos with narrated explanations and synchronized sound effects.
- •Social media: quickly produce themed clips (e.g., LEGO style or paper-cutting) with aligned music or voiceover.
- •Film previsualization: rough low-res scenes plus keyframe detail, then refine to 1080p for pitch or planning.
- •Game trailers or cutscenes: keep character identity consistent while changing camera moves and audio mood.