VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Longbin Ji; Xiaoxiong Liu; Junyuan Shang; Shuohuan Wang; Yu Sun; Hua Wu; Haifeng Wang

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Intermediate

Longbin Ji, Xiaoxiong Liu, Junyuan Shang et al.1/9/2026

arXiv PDF

Key Summary

•VideoAR is a new way to make videos with AI that writes each frame like a story, one step at a time, while painting details from coarse to fine.
•It splits space and time: inside each frame it uses a multi-scale recipe (coarse-to-fine), and across frames it predicts the next frame causally, like flipping through a comic book.
•A special 3D multi-scale tokenizer chops videos into compact, LEGO-like pieces so the model can work faster without losing important motion and detail.
•To keep long videos steady, it adds two safety nets: Cross-Frame Error Correction (practice fixing inherited mistakes) and Random Frame Mask (don’t over-depend on past frames).
•Multi-scale Temporal RoPE helps the model know where and when things are happening, so objects stay in the right place as time moves forward.
•On UCF-101, VideoAR beats prior autoregressive methods, improving gFVD from 99.5 to 88.6, while cutting inference steps by over 10× for much faster sampling.
•On the VBench suite, the larger VideoAR reaches 81.74 overall—competitive with diffusion systems that are an order of magnitude larger.
•It can extend videos (image-to-video and video continuation) with good text alignment and smooth motion, producing 4-second clips at 384×672 resolution today.
•The method scales with model size and shows clear gains as the Transformer backbone grows.
•Overall, VideoAR narrows the gap with diffusion models, offering a faster, more scalable, and temporally consistent path for future video generation.

Why This Research Matters

VideoAR shows that we can generate high-quality, steady videos quickly without relying on heavy diffusion pipelines. That means more creators, teachers, and developers can afford to make and iterate on videos with less hardware and time. Its natural ability to extend clips and follow text closely helps with storyboarding, marketing, education, and game content. The approach plugs into LLM-style infrastructure, opening a path to unified tools that handle text, images, audio, and video together. As the method scales to higher resolution and FPS, it can power practical, controllable video tools. Training tricks that teach the model to fix its own mistakes make long videos more reliable. Altogether, this is a step toward everyday, fast, and flexible video generation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how making a flipbook takes time if you draw every page at once? But if you draw one page, then the next, you can go faster and keep your story straight.

🥬 Filling (The Actual Concept): Before this paper, most top video AIs used diffusion and flow-matching. They work like very careful erasers: start with noisy frames and clean them up step by step across the whole video chunk. This gives pretty results but is slow and heavy on computers. Meanwhile, autoregressive (AR) models—great for text—were getting better at images by predicting what comes next, especially with Visual Autoregressive (VAR) modeling that paints images from coarse to fine scales. But in video, AR was still clumsy: it tried to predict tokens in a flat order, mixing up space and time, which caused errors to pile up over long clips.

Why it matters: Without a faster, scalable method, high-quality video generation stays expensive, slow, and hard to control for length and motion.

🍞 Bottom Bread (Anchor): Imagine a classroom making a class movie. Diffusion is like the whole class cleaning up the entire movie at once—hard! AR is like each student drawing their page in order—simpler. This paper teaches the students a better way to draw and follow each other, so the movie looks right and finishes faster.

The World Before:

Diffusion-led video models (like Sora/Veo-family) delivered stunning fidelity and smooth motion by denoising whole clips. But they were costly to run and awkward to scale for longer or variable-length videos.
AR for images was rising: tokenizers turned pictures into discrete tokens; LLM-style models predicted tokens in order; and new VAR methods learned to generate images coarse-to-fine (next-scale prediction) in fewer steps.

The Problem:

Spatial vs temporal mismatch: 2D image structure (which benefits from coarse-to-fine painting) isn’t the same as 1D time (which benefits from causal next-step storytelling). A single "next-token in a long line" strategy struggles to respect both.
Error propagation: In long videos, tiny mistakes early on snowball into wobble, blur, or drift.
Limited control: It’s hard to finely control motion strength, video length, or how much to trust past frames.

Failed Attempts:

Flat next-token schemes (rasterizing pixels/tokens) balloon the sequence length, slow generation, and miss 2D spatial relationships.
Token-based AR video models improved some aspects but still suffered drift and low resolution due to long sequences and weak spatial correlation modeling.
AR-diffusion hybrids improved quality but kept much of diffusion’s heavy inference cost.

The Gap:

A video method that respects images’ 2D nature (paint coarse-to-fine) and time’s 1D arrow (predict the next frame) at the same time.
A robust tokenizer that compresses both space and time without losing motion cues.
Training tricks that prepare the model to face and fix its own mistakes during generation.

Real Stakes:

Faster, cheaper video creation for education, marketing, games, and storytelling.
Better control over length and motion for practical workflows (e.g., extending a clip, matching camera moves, or continuing a scene).
A path that plugs into LLM-style infrastructure—useful as we unify text, images, audio, and video.

New Concepts Introduced with Sandwich Pattern:

Visual Autoregressive (VAR) modeling 🍞 Hook: Imagine sketching a drawing lightly first, then adding details in layers. 🥬 Concept: VAR generates visuals by predicting them from coarse layers to finer layers, one “scale” at a time. It works by first making a rough version, then refining it with more details, step by step. Without it, the model wastes effort predicting tiny pixels too soon and can’t capture big shapes well. 🍞 Anchor: Like building a sandcastle: shape the big mound first, then carve windows and patterns.
Diffusion models (for context) 🍞 Hook: Think of cleaning a foggy window little by little to reveal a scene. 🥬 Concept: Diffusion starts with noisy images and repeatedly denoises them to get a clean result. It runs many iterative steps over whole clips. Without it, we wouldn’t have today’s super-clean images/videos, but it’s slow and resource-heavy. 🍞 Anchor: It’s like wiping a whiteboard smudged with marker until the picture is clear.
Error propagation 🍞 Hook: If the first pancake is burnt, and you stack more on top, the whole stack is unappetizing. 🥬 Concept: Small mistakes early in generation can carry into later frames and grow. Without addressing it, long videos drift or flicker. 🍞 Anchor: In a relay race, one bad handoff messes up the whole team’s time.

02Core Idea

🍞 Top Bread (Hook): You know how a comic artist first blocks out panels (big shapes), then inks details, and finally flips the pages to check the flow? That’s the heart of this paper.

🥬 Filling (The Actual Concept): The "Aha!" is to combine two strengths at once: inside each frame, generate images coarse-to-fine (VAR), and across frames, predict the next frame causally. This respects how space (2D) and time (1D) really work.

What it is: VideoAR is an autoregressive video generator that does next-frame prediction across time while doing next-scale prediction within each frame.
How it works: (1) A 3D multi-scale tokenizer turns video into compact, discrete tokens that keep spatial and temporal structure. (2) A Transformer predicts each new frame conditioned on the text and past frames, adding details scale by scale. (3) Multi-scale Temporal RoPE tells the model “where and when” tokens belong. (4) Cross-Frame Error Correction and Random Frame Mask train the model to handle and fix its own mistakes. (5) Adaptive guidance and re-encoding enable controllable motion and duration extension.
Why it matters: Without splitting space (coarse→fine) and time (next frame) this way, models either get slow (diffusion) or drift and blur (flat AR). This design keeps videos sharp, steady, and fast to sample.

🍞 Bottom Bread (Anchor): Think of making a nature documentary: you outline each shot (coarse), add feathers and leaves (fine), and then play shots in order (next frame). You also rehearse fixing errors so the final cut flows smoothly.

Multiple Analogies:

Movie Studio: The storyboard (coarse) comes first, then detailed scenes (fine), then editing scenes in sequence (next frame). Safety checks fix continuity errors.
LEGO Builder: Build the base layer (coarse bricks), stack detailed pieces (fine), then add layers one after another (time). If a piece is off, you learn to spot and correct it.
Cooking Course: Prep base flavors (coarse), layer spices (fine), serve courses in order (time). Practice catching mistakes early so the later courses still taste great.

Before vs After:

Before: AR video tried to predict a flat line of tokens and got tangled; diffusion was clean but slow. Tokenizers often made sequences too long, hurting resolution.
After: VideoAR handles images like images (multi-scale) and time like time (next frame), making shorter sequences, faster sampling, and steadier long videos.

Why It Works (Intuition):

Space loves hierarchy: big shapes first, details later. Time loves causality: the next moment depends on the past. Marrying these reduces confusion and makes the model’s job well-posed.
A compact 3D tokenizer preserves motion patterns so the model doesn’t have to relearn them.
Training with controlled noise and masks is like scrimmage: practice under pressure makes real games easier.

Building Blocks (each with Sandwich):

Next-Frame Prediction 🍞 Hook: When you guess what happens in the next scene of a movie, you use what you just watched. 🥬 Concept: The model predicts the next frame using past frames and text. Step-by-step, it rolls forward in time. Without it, time order gets fuzzy and future frames don’t align with the story. 🍞 Anchor: Like a weather forecast for tomorrow based on today.
3D Multi-scale Tokenizer 🍞 Hook: Cutting a big cake into neat slices makes it easier to share. 🥬 Concept: It compresses a video into small, discrete tokens across time and space, at multiple scales, while staying causal (only using the past). Without it, sequences would be too long and motion details would get lost. 🍞 Anchor: It’s like packing a suitcase smartly so everything fits and is easy to unpack later.
Multi-scale Temporal RoPE 🍞 Hook: A map grid tells you where you are; a clock tells you when. 🥬 Concept: It encodes position in height, width, and time, plus which scale the token belongs to, so the model knows “what goes where and when.” Without it, objects might jump around or lose alignment across frames. 🍞 Anchor: Like labeling shelves (top/middle/bottom) and days (Mon/Tue) so you always find things.
Cross-Frame Error Correction 🍞 Hook: Practice makes perfect—especially practicing how to fix mistakes. 🥬 Concept: During training, it injects controlled bit flips (small errors) that grow over time and carry into the next frame, forcing the model to learn to self-correct. Without it, small mistakes snowball in long videos. 🍞 Anchor: A choir practices on a windy day so they can stay on pitch during a real outdoor show.
Random Frame Mask 🍞 Hook: Don’t lean on training wheels forever. 🥬 Concept: It randomly hides some past frames in attention, so the model can’t over-memorize and must generalize. Without it, the model might cling to distant frames and become brittle. 🍞 Anchor: Like practicing basketball without looking at the floor, so your dribble is robust.
Temporal-Spatial Adaptive Classifier-Free Guidance 🍞 Hook: Like a video director adjusting how strictly actors must follow the script. 🥬 Concept: It tunes guidance strength across time and scales: higher for stronger details/dynamics, lower for smoother, stable motion. Without it, you can’t easily balance crisp details against steady movement. 🍞 Anchor: Turning a music mixer’s knobs to balance vocals (clarity) and background (smoothness).

03Methodology

At a high level: Text Prompt + (optional starting frames) → 3D Multi-scale Tokenizer → Autoregressive Transformer (Time: next frame; Space: next scale) with Multi-scale Temporal RoPE → Training with Cross-Frame Error Correction + Random Frame Mask → Adaptive Guidance at sampling → Decoded video.

Step-by-step, like a recipe:

Input and Tokenization

What happens: A video clip (or an empty start for text-to-video) and a text prompt enter a causal 3D tokenizer. The tokenizer compresses T frames into a smaller set of latent tokens arranged over time and over multiple spatial scales (e.g., 5 scales, 8×8 at the coarsest). Each frame is quantized independently (temporal-independent quantization), staying causal so no future info leaks in.
Why this step exists: Raw pixels are huge and redundant; compressing them preserves motion patterns and details while making sequences short enough for efficient AR modeling.
Example: Suppose we want 4 seconds at 8 FPS = 32 frames. The tokenizer might turn each frame into 5 residual maps from coarse to fine, each at an 8×8 grid, so each frame is represented compactly.

Building the Context for Generation

What happens: The Transformer sees text tokens plus video tokens. For the very first scale of the very first frame, a special SOS token seeds generation. For later frames, the first scale receives a summary of the previous frame’s features (accumulated features), injecting temporal context before adding details.
Why this step exists: It’s how the model remembers what just happened, so the new frame continues the story.
Example: Frame 1 starts from SOS; Frame 2’s coarse scale starts from Frame 1’s final scale features.

Multi-scale Temporal RoPE

What happens: Every token gets a position tag in time (t), height (h), width (w), and a scale tag, so attention knows how tokens relate across space-time and across coarse→fine.
Why this step exists: Without clear “where/when/which-scale,” attention can mix unrelated spots, causing jitter or misplaced details.
Example: A dog’s ear at (h=5,w=6) in frame t gets embeddings consistent across frames, so the ear doesn’t jump.

Autoregressive Prediction Order

What happens: The model loops over time t=1…T. For each frame t, it loops over scales k=1…K from coarse to fine. At each (t,k), it predicts the residual tokens conditioned on text, past frames’ tokens (causally), and already-predicted coarser scales for the current frame. Causal masking ensures it never peeks into the future.
Why this step exists: Time wants next-frame causality; space wants coarse→fine detail layering. This schedule respects both realities.
Example with data: For frame 10, the model first predicts scale 1 (8×8 rough sketch), then scale 2, …, up to scale 5 for fine textures.

Temporal-Consistency Enhancements A) Cross-Frame Error Correction (training time)

What happens: The training pipeline randomly flips bits in token labels with a flip rate that grows over time, and carries the end-of-frame noise into the next frame’s start. The targets are self-corrected (re-quantized) so the model learns to recover.
Why this step exists: It simulates the real generation setting where mistakes can happen and flow forward.
Example: If frame 7 ended with flip rate 0.18, frame 8 starts at least that noisy at its first scale, so the model must stabilize quickly.

B) Random Frame Mask (training time)

What happens: Within a sliding attention window, some past frames are randomly dropped from attention. Text tokens always remain.
Why this step exists: It prevents over-dependence on far history and reduces overfitting, encouraging robust short-horizon reasoning.
Example: With window size 8, at t=20 the model may see {t−1, t−3, t−4, t−7} depending on random masking.

Multi-Stage Training Pipeline

What happens: Train in stages: (I) joint pretraining on images and low-res videos to learn basic spatial-temporal skills, (II) higher-resolution data for sharp details and smoother motion, (III) fine-tuning on longer clips for long-range dynamics.
Why this step exists: It’s easier to learn fundamentals first, then scale up in resolution and duration without instability.
Example: Start at 128px, move to 256px, then 384×672 with longer frame counts.

Sampling with Temporal-Spatial Adaptive CFG

What happens: At inference, the model decodes causally with cached states for speed. Guidance strength is tuned over time and scales: higher guidance can boost detail/motion; lower guidance smooths transitions and adds diversity.
Why this step exists: Different prompts and scenes need different balances of crispness vs. steadiness.
Example: On an action clip, start with stronger CFG at early scales for vivid motion; for a calm landscape, keep CFG lower to avoid flicker.

Duration Extension (Continuation)

What happens: To keep generating, the model re-encodes recent frames, uses them as new context, and continues predicting next frames.
Why this step exists: It allows variable-length videos beyond the initial window.
Example: Given a 4-second clip, extend by another 4 seconds repeatedly to surpass 20 seconds.

The Secret Sauce:

Respect the physics of space and time: paint within a frame coarse→fine; move across frames causally.
Train the model to thrive under imperfection (controlled noise + masks) so it’s stable in the wild.
Use a compact 3D tokenizer so the Transformer focuses on meaning and motion, not raw pixels.

Sandwich callouts inside the recipe:

3D Multi-scale Tokenizer 🍞 Hook: Big puzzles are easier when you sort pieces by color and size. 🥬 Concept: Compresses video into multi-scale tokens without peeking at the future. Without it, training and sampling would be painfully long. 🍞 Anchor: Like packing your backpack so the stuff you need first is on top.
Cross-Frame Error Correction 🍞 Hook: Train in the rain so game day’s drizzle won’t scare you. 🥬 Concept: Inject time-ramped noise and inherit errors across frames to practice recovery. Without it, long videos go off-track. 🍞 Anchor: Practicing piano with a metronome that sometimes speeds up a little, so you learn to adapt.
Random Frame Mask 🍞 Hook: Sometimes you solve a maze faster when you can’t see the whole map. 🥬 Concept: Hide some past frames to encourage robust local reasoning. Without it, the model memorizes and becomes fragile. 🍞 Anchor: Studying with some hints covered so you truly understand, not just copy.

04Experiments & Results

The Test: What did they measure and why?

Reconstruction quality (rFVD): Measures how well the tokenizer can rebuild original videos. A lower rFVD means the compressed tokens kept important details.
Generation quality (gFVD): Measures realism and temporal coherence of generated videos. Lower is better; think of it like a teacher’s stricter grading for overall video believability.
VBench: A rich suite of perceptual and temporal metrics, including semantic alignment (does the video match the text?), motion smoothness, background/subject consistency, and aesthetics.

🍞 Sandwich for FVD/VBench 🍞 Hook: Imagine a talent show with judges scoring singing, dancing, and stage presence. 🥬 Concept: FVD/gFVD score how believable and steady a video is; VBench scores many dimensions like semantics and aesthetics. Without these, we can’t fairly compare different models. 🍞 Anchor: It’s like getting an overall grade plus sub-scores in a report card.

The Competition:

Autoregressive baselines: MAGVIT-v2-AR, PAR-4×, OmniTokenizer/TATS-based systems.
Diffusion-based leaders: Step-Video, CogVideo, VideoCrafter, Hunyuan-Video, Kling, Gen-3.

The Scoreboard with Context:

Tokenizer (UCF-101, rFVD): VideoAR-L’s tokenizer uses aggressive 16× spatial compression with a compact 5×8×8 token grid per frame, achieving rFVD ≈ 61, on par with MAGVIT’s 58 but with 4× fewer tokens. In school terms: same A-level comprehension with a much shorter answer sheet.
UCF-101 generation (gFVD): VideoAR-XL hits 88.6 and VideoAR-L 90.3 versus the previous best AR model PAR-4× at 99.5. That’s like moving from a solid B to an A− while writing 10× fewer steps—faster and cleaner.
Speed: VideoAR-L generates with just 30 decoding steps (over 10× fewer) and around 0.86 seconds per video in the test setup—about 13× faster than PAR-4×. That’s the difference between a quick sketch and a long painting session—yet quality improves.
VBench (real-world text-to-video): VideoAR-4B scores 81.74 overall, competitive with much larger models (e.g., ~13–30B parameters). It sets a new high in Semantic Score (77.15), signaling outstanding text-faithfulness, while staying strong on aesthetics and consistency. Think: a smaller orchestra playing as beautifully as a much bigger one—and following the conductor’s (text prompt’s) notes precisely.

Surprising Findings:

Scaling works: Enlarging the Transformer backbone clearly boosts quality, confirming AR’s scalability in video.
Guidance scheduling matters a lot: Adjusting guidance over time and scales gives an easy control knob for either punchier dynamics or calmer stability depending on the prompt.
Continuation is natural: Because the model is next-frame by design, image-to-video and video extension feel like native tasks—no special finetuning needed.

Ablations (what changed what):

Multi-scale Temporal RoPE alone improves FVD (e.g., from ~96.0 to ~94.95), showing better spatial-temporal alignment.
Time-dependent corruption further improves (to ~93.57), simulating real inference conditions.
Adding error inheritance yields the full benefit (~92.50), proving cross-frame correction is key.
Random Frame Mask at the 256px stage improves VBench overall (76.22→77.00), indicating better long-term robustness.

Takeaway: VideoAR doesn’t just edge out prior AR methods; it redefines the speed–quality tradeoff and approaches diffusion-level performance with far fewer steps.

05Discussion & Limitations

Limitations (be specific):

Resolution and FPS: Current public setting is about 384×672 at 8 FPS—fine for demos, not yet for cinema or broadcast (e.g., 720p/1080p at 24–30 FPS). Main causes are sequence length limits and the cost of full VAR attention.
High-dynamic motion drift: Fast or complex motion (e.g., intricate human actions) can still drift over long horizons, a classic AR challenge despite new corrections.
Compute during training: While inference is fast, training large AR Transformers with a full VAR mask and long contexts remains resource-intensive.
Sequence length at training time: Early models were trained with modest contexts, so ultra-long temporal coherence (minutes) is not yet fully explored.

Required Resources:

A strong 3D tokenizer pretrained from a solid image tokenizer baseline.
Multi-GPU training with mixed precision and memory optimizations (e.g., gradient checkpointing) to handle long sequences and multi-scale tokens.
Large, diverse video datasets for progressive staging from low to higher resolution and longer durations.

When NOT to Use:

If you need very high resolution and high FPS today (e.g., 4K@60fps), diffusion upscaling pipelines may currently be more practical.
If you demand guaranteed minute-long temporal consistency with complex human choreography right now, specialized long-context training or post-processing may be needed.
If interactive latency is not an issue and maximum single-sample fidelity is the only goal, diffusion might still be preferred.

Open Questions:

How far can sequence length scale before we need new sparse or chunked attention for AR video?
Can reinforcement learning or iterative rollouts further tame long-horizon drift?
What’s the best joint curriculum for images and videos to reach broadcast-quality resolution and FPS?
How to blend AR video with audio, 3D geometry, or physics constraints for even more realism?
Can we unify continuation, editing, and multi-shot planning in one AR framework with reliable scene memory?

06Conclusion & Future Work

Three-Sentence Summary:

VideoAR treats time and space the way they want to be treated: predict the next frame in time, and paint each frame from coarse to fine.
With a compact 3D tokenizer, Multi-scale Temporal RoPE, and training tricks like Cross-Frame Error Correction and Random Frame Mask, it generates videos that are sharp, steady, and much faster to sample.
It reaches new AR state-of-the-art scores on UCF-101 and competitive VBench results with far fewer steps, narrowing the gap with diffusion.

Main Achievement:

A practical, scalable AR blueprint for video that couples next-frame and next-scale prediction, delivering strong quality and speed while staying temporally consistent.

Future Directions:

Scale sequence length and explore sparse attention to unlock higher resolution and FPS.
Add iterative rollouts and reinforcement learning to further reduce drift in complex motion.
Strengthen multi-modality (audio, depth, 3D cues) and editing/continuation controls for production workflows.

Why Remember This:

VideoAR shows that autoregression—long the hero of language—can power video too, if we respect the different rules of space and time. By training to fix its own mistakes and using the right positional clues, it becomes fast, controllable, and steady. This is a promising foundation for everyday video creation tools that are both efficient and high quality.

Practical Applications

•Text-to-video storyboarding for films, ads, and educational explainers with controllable motion and length.
•Image-to-video animation of product shots or illustrations, adding camera moves and subject motion.
•Video continuation to extend short clips into longer scenes while keeping subjects consistent.
•Rapid prototyping for game cinematics and cutscenes with fast iteration cycles.
•Marketing content generation where quick versions are needed for A/B testing across platforms.
•Science and classroom demos that visualize processes (e.g., plant growth, physics setups) from text prompts.
•Social media content creation with prompt-driven variations for different audiences.
•Previsualization of camera trajectories and blocking for live-action shoots.
•Data augmentation for training downstream vision models with controllable motion patterns.
•Accessible creative tools that run on modest hardware by leveraging faster AR inference.

Version: 1