Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li; Yang Jin; Hao Jiang; Yadong Mu; Yang Song; Kun Xu

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner

Jinghan Li, Yang Jin, Hao Jiang et al.12/24/2025

arXiv PDF

Key Summary

•This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.
•The key idea is to separate (decouple) two jobs: learning good meanings (semantics) and drawing the next frame, so each job can be done better.
•A context-isolated autoregressive predictor uses past frames to predict a rich summary (z_t) of what the next frame should be like.
•A conditioned flow-matching decoder then uses that summary to generate the next frame’s latent features with high quality and variety.
•Special attention masks keep time order and stop information from leaking between frames or noisy targets, which protects semantic learning.
•An EMA-updated reference encoder provides a stable target so the predicted summary z_t learns the right semantics without being pulled into decoding.
•NExT-Vid is pretrained on a massive mixed dataset of 2.4M hours of video plus 1.28M images, and evaluated using an attentive probe with the encoder frozen.
•On K400, IN1K, SSv2, and Diving48, NExT-Vid beats previous generative pretraining methods, with the ViT-G model reaching 83.1% on K400 and 69.5% on SSv2.
•Ablations show masking is crucial, the small decoder is enough, and flow-matching with spatially aligned conditioning improves both generation quality and representations.
•Limitations include reliance on masking and a trade-off between great generation and the toughest objectives that yield the strongest representations.

Why This Research Matters

Videos are everywhere: education, sports, safety, entertainment, and daily communication. A model that truly understands what happens over time can better recognize actions, summarize events, and find key moments without needing tons of labeled data. By separating understanding from drawing, NExT-Vid learns cleaner, stronger features that transfer well to many tasks. The flow-matching decoder gives high-quality, diverse generations that make training more effective. This approach can improve video search, content moderation, coaching tools, and robot perception, making these systems smarter and more reliable. In the long run, it lays groundwork for AI that can reason about causes and effects in the physical world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how watching a movie is different from looking at a single photo? In movies, what happens next depends on what just happened, and timing matters a lot. Before this research, many vision models learned from images using a style called masked modeling (inspired by BERT for language). These models covered up parts of an image and asked the AI to fill in the missing pixels. This worked well for images but didn’t fully use the magic of time in videos. When people tried to apply the same idea to videos, the model often treated frames like a stack of separate pictures and missed the motion and cause-and-effect between frames.

In language, something different happened. Autoregressive models like GPT learned to predict the next word after seeing past words, which is perfect for sequences. That idea—“what comes next depends on what came before”—helped language models understand stories, logic, and context. Naturally, researchers wondered: can we do something similar for videos, which are also sequences over time?

Some early tries did bring autoregression into vision. iGPT predicted pixels one after another, and later works tried predicting patch tokens or compressed frame tokens. But two big problems showed up. First, the model’s understanding of meaning (semantics) often ended up buried deep inside the model’s middle layers, making it hard to extract clean features for other tasks. Second, when the goal was to directly draw or regress patches, the model tended to average things out in complex scenes, leading to blurry, low-diversity outputs—and weak semantics.

Another tricky thing about videos is redundancy. Neighboring frames can be very similar. A lazy generator can just copy most of the previous frame to make the next one and still look okay. That’s great for making a plausible video but bad for learning rich meaning. So straight next-frame generation isn’t automatically a good lesson—unless we make the task harder in the right way.

This paper’s gap is here: we needed a training setup that truly learns time-based meaning from videos while still using the strong generative training signals that teach models a lot about the world. The missing piece was a way to separate the job of understanding (semantics) from the job of drawing (decoding), so that learning meaning doesn’t get overshadowed by the tricks a generator can learn.

The proposed method, NExT-Vid, fills this gap by doing masked next-frame prediction (so it can’t just copy), and by cleanly decoupling the semantic pathway from the decoding pathway. It uses a context-isolated autoregressive predictor to forecast a high-level representation of the next frame (not the pixels), and a conditioned flow-matching decoder to generate the next frame’s latent features from that high-level hint. Flow-matching is like a smart cleaner that turns noisy inputs into clean samples through gradual denoising; it tends to produce high-quality, diverse generations—ideal for a strong learning signal.

Why should anyone care? Strong video representations power things we see and use daily: safer content moderation, better sports highlights, smarter home robots, improved video search, and more accessible educational videos. If a model truly understands what’s happening across time—who passed the ball, which way the car turned, or how someone mixed chemicals—it can help with tasks that people do, faster and more reliably. NExT-Vid shows that when you train on “what happens next,” and you keep the brain that understands separated from the hand that draws, you get features that work better across many video and image tasks.

02Core Idea

Aha! Moment in one sentence: Separate the brain (semantic understanding) from the hand (image drawing) during next-frame training, and use the hand only as a high-quality helper, so the brain learns clearer, stronger video meanings.

Multiple analogies:

Librarian and illustrator: The librarian (predictor) decides what the next page should say; the illustrator (decoder) draws the pictures. Don’t let the illustrator rewrite the story—the librarian must stay in charge of meaning.
Coach and player: The coach (predictor) plans the next move; the player (decoder) executes it with style. If the player also changes the plan mid-game, strategy gets messy.
GPS and driver: The GPS (predictor) gives a route summary; the driver (decoder) follows it smoothly. Keep the route plan separate from steering details.

Before vs After:

Before: Models either masked frames without truly using time, or they did autoregression but mixed semantics and drawing. Semantics got buried; generations got blurry; representations were hard to use.
After: NExT-Vid masks the past to make prediction challenging, cleanly separates semantic prediction from drawing with a context-isolated predictor, and uses a flow-matching decoder that generates with diversity and quality. Representations bubble up cleanly and help many tasks.

Why it works (intuition):

Making the task tough in the right way (masking) forces the model to learn real motion and cause-and-effect, not just copy-paste.
Keeping the predictor’s context frozen and separate prevents the decoder’s tricks from reshaping the semantic space, so meanings stay crisp and extractable.
Flow-matching’s gradual denoising supplies a rich, stable learning signal, encouraging diverse and realistic next-frame latents.

Building blocks, each explained with the Sandwich pattern:

🍞 Hook: You know how when you tell a story, you think about what comes next based on what you already said? 🥬 The Concept: Autoregressive Generative Pretraining means learning by predicting the next thing in a sequence from the past. How it works: (1) Read past items, (2) Build an internal summary, (3) Predict the next item, (4) Compare with truth and learn. Why it matters: Without it, the model won’t learn how events unfold over time. 🍞 Anchor: Predicting the next video frame after watching earlier frames.

🍞 Hook: Imagine a comic where some panels have holes punched out. 🥬 The Concept: Masked Next-Frame Prediction hides pieces of the past and asks the model to guess the next frame anyway. How it works: (1) Mask the same spots across several past frames, (2) Encode what remains, (3) Predict a representation of the next frame, (4) Decode to a target and learn. Why it matters: Without masking, the model could just copy, learning weak meanings. 🍞 Anchor: If the person’s hand is covered in prior frames, the model must infer motion from other clues.

🍞 Hook: Think of a student who can use notes but isn’t allowed to rewrite them. 🥬 The Concept: Context-Isolated Predictor is a module that reads the encoder’s frozen context and predicts the next-frame representation without changing the context. How it works: (1) The encoder produces context tokens from masked frames, (2) The predictor uses cross-attention with these tokens as keys/values, (3) It outputs z_t for the next frame, (4) It can’t alter the encoder outputs. Why it matters: If the predictor could modify context, meanings would drift and get tangled with decoding. 🍞 Anchor: The predictor points where the ball will go next without editing the game replay.

🍞 Hook: Sorting toys into boxes makes cleanup faster. 🥬 The Concept: Semantic Representation Decoupling separates meaning-making from image drawing. How it works: (1) Predict z_t for meaning, (2) Use a different decoder to generate targets, (3) Align z_t to a stable reference so it stays semantic. Why it matters: If you mix the two jobs, you risk blurry meanings or overfit drawing tricks. 🍞 Anchor: The outline (z_t) guides the painting, but the final brushstrokes don’t change the outline.

🍞 Hook: A chef follows a recipe to turn raw ingredients into a dish. 🥬 The Concept: Conditioned Flow-Matching Decoder is a denoising generator that turns noise into a clean next-frame latent, guided by z_t. How it works: (1) Start with noisy latent, (2) Concatenate z_t and noisy target (aligned per patch), (3) Predict a cleaning velocity step-by-step, (4) End with a high-quality latent. Why it matters: Without it, outputs get blurry and the learning signal is weaker. 🍞 Anchor: The recipe (z_t) guides cooking; each step cleans up the dish until it’s ready.

🍞 Hook: Wiping fog off a window lets you see the view. 🥬 The Concept: Denoising Models clean noisy data in small steps to recover a real sample. How it works: (1) Add noise to a target, (2) Train a model to remove it piece by piece, (3) Repeat until clear. Why it matters: One-shot guessing is hard and often dull; stepwise cleanup enables quality and diversity. 🍞 Anchor: Clearing static from a radio signal to hear the song.

03Methodology

At a high level: Input video → Encoder with masked, time-aware attention → Autoregressive Predictor creates next-frame representation z_t → Reference Encoder provides stable alignment target → Conditioned Flow-Matching Decoder denoises to the next-frame latent → Output.

Step-by-step, with why each step exists and a concrete example:

Input preparation and masking

What happens: Split the video into 3D patches (space and time). Apply temporally consistent masking (same spatial spots masked across nearby frames). Images are treated as 1-frame videos for unification.
Why this step exists: Without masking, videos are too redundant; the model might copy the previous frame. Masking raises the difficulty, pushing the model to learn real motion and semantics.
Example: In a pouring-water clip, mask the pitcher’s spout area across several past frames so the model must reason about motion from the arm and cup.

Encoder with frame-wise causal attention

What happens: A ViT encoder with 3D Rotary Position Embedding reads the masked video. Frame-wise causal attention ensures each token can only attend to its own frame and earlier frames, not the future.
Why this step exists: Time must flow forward. If the encoder saw the future, the task would be too easy and unrealistic.
Example: Frame 5 tokens can look at frames 1–5, never 6.

🍞 Hook: Imagine reading a diary where you can only read today and previous entries. 🥬 The Concept: Frame-Wise Causal Attention limits attention so tokens see only their frame and the past. How it works: (1) Build attention masks per frame index, (2) Forbid looking ahead, (3) Encode with temporal order respected. Why it matters: Without it, the model could peek at the future and cheat. 🍞 Anchor: Reading page 5 without glimpsing page 6.

Reference encoder with EMA for alignment

What happens: A momentum (EMA) copy of the encoder reads the full, unmasked video. Its outputs serve as stable targets for semantic alignment.
Why this step exists: EMA smooths training noise and gives a reliable representation of "what the next frame should mean."
Example: For frame t, the reference encoder processes frames 0..t unmasked and produces c′_t.

🍞 Hook: A slowly-updated teacher keeps the class calm. 🥬 The Concept: EMA Reference Encoder averages weights over time to provide stable targets. How it works: (1) Keep a shadow copy updated by EMA, (2) Feed it the full sequence, (3) Use its features as alignment targets with stop-gradient. Why it matters: Without a stable teacher, alignment can wobble and crash training. 🍞 Anchor: A mentor’s steady advice anchors a student’s learning.

Context-isolated autoregressive predictor

What happens: Using cross-attention blocks and learnable query tokens, the predictor reads the encoder’s frozen context (as keys/values) under an autoregressive mask to output z_t, the next-frame representation.
Why this step exists: This keeps the context unchanged, so the semantic space isn’t distorted by decoding needs. The autoregressive mask ensures the predictor only uses information up to time t−1.
Example: The predictor infers that the hand will tilt more and water will flow faster in the next frame.

🍞 Hook: A student can look at the notes but cannot write on them. 🥬 The Concept: Autoregressive Mask + Cross-Attention Queries let the predictor attend to past context without altering it. How it works: (1) Use learnable queries as placeholders for the missing frame, (2) Attend to frozen context K/V from the encoder, (3) Block access to future frames. Why it matters: Without this, context could be rewritten or future leaked, breaking semantics. 🍞 Anchor: Asking questions about the past notes to guess the next line.

Representation alignment loss

What happens: Align z_t to c′_t (from the EMA reference) with an MSE loss using stop-gradient on c′_t.
Why this step exists: It nudges z_t to carry the right semantics about the next frame, while ensuring the predictor doesn’t pull the encoder toward the decoder’s space.
Example: If c′_t encodes “water now hitting the cup rim,” z_t learns to encode that concept too.

🍞 Hook: Calibrating a compass to a reliable map keeps you on course. 🥬 The Concept: Representation Alignment makes the predicted z_t match a stable semantic reference. How it works: (1) Compute c′_t from the EMA teacher, (2) Minimize distance between z_t and stop-grad(c′_t), (3) Keep the student honest. Why it matters: Without alignment, z_t could drift toward whatever is easiest for the decoder to draw. 🍞 Anchor: Matching your sketch to a blueprint.

Conditioned flow-matching decoder (DiT)

What happens: The decoder receives concatenated inputs: (a) noisy next-frame latent from a VAE and (b) the condition z_t, aligned per spatial position. It predicts the denoising velocity at many time steps, gradually cleaning the noise to produce the next-frame latent.
Why this step exists: Flow-matching supplies a rich, stable training signal and supports high-quality, diverse generations. Spatially aligned concatenation ensures each patch is denoised with the matching local condition.
Example: The patch at the cup rim uses the local z_t patch that predicts where the water splash should appear.

🍞 Hook: Cleaning a picture with a soft eraser, guided by a sketch of what it should look like. 🥬 The Concept: VAE Latent Targets + Spatially Aligned Concatenation let the decoder denoise efficiently and precisely. How it works: (1) Compress frames with a VAE to latents, (2) Concatenate z_t and noisy latents channel-wise with spatial match, (3) Predict velocity to remove noise step-by-step. Why it matters: Pixels are heavy and regression is blurry; latents plus denoising are efficient and sharp. 🍞 Anchor: Using a coloring outline and cleaning marks until a clear image pops out.

🍞 Hook: Rooms in a hotel shouldn’t hear each other’s conversations. 🥬 The Concept: Frame-Isolated Attention in the decoder prevents attention across frames during generation. How it works: (1) Allow full attention within a frame, (2) Block attention between different frames, (3) Generate each frame independently given z_t. Why it matters: Without isolation, noisy targets could leak information and make the task too easy, hurting semantics. 🍞 Anchor: Each room (frame) has its own walls.

Secret sauce:

Decoupling semantics (z_t) from decoding prevents the generator’s tricks from polluting the representation space.
Masking plus autoregression makes the task meaningfully hard, focusing learning on motion and cause-and-effect.
Flow-matching with spatially aligned conditioning produces both better generations and better features.

Putting it together as a recipe: Input → Masked ViT Encoder (causal) → z_t via Context-Isolated AR Predictor → Align z_t to EMA-Reference c′_t → Conditioned Flow-Matching DiT (with VAE latents, spatial concat, frame-isolated attention) → Next-frame latent (training signal).

04Experiments & Results

The test: The authors evaluate how good the learned representations are for classification. They freeze the encoder and train only a light “attentive probe” (a single cross-attention pooling layer plus a linear classifier) on four benchmarks: ImageNet-1K (objects/scenes), Kinetics-400 (actions + scenes), Something-Something-V2 (fine-grained human actions), and Diving48 (specialized motion). They report top-1 accuracy, which is like asking, “Did the model’s first guess match the correct label?”

The competition: They compare with both generative and discriminative pretraining. Generative baselines include MAE/VideoMAE variants, iGPT, and Toto (an autoregressive video pretrainer). Discriminative baselines include DINOv2, SigLIP2, VJEPA/VJEPA2, InternVideo2, and VideoPrism. This shows where NExT-Vid stands among methods that learn meanings via different objectives.

The scoreboard, with context:

NExT-Vid-G (about 1.1B params) achieves: 83.1% on K400, 81.4% on IN1K, 69.5% on SSv2, and 87.2% on Diving48. That’s state-of-the-art among generative pretraining methods in most video-centric settings. Think of 83.1% on K400 as earning an A when earlier generative methods were closer to a B.
Compared to VideoMAEv2 (a strong masked autoencoder baseline), NExT-Vid shows consistent gains on K400, IN1K, and SSv2, suggesting that autoregressive next-frame training with flow-matching gives richer temporal semantics than direct regression.
Versus Toto (another autoregressive approach), NExT-Vid improves by large margins (e.g., +8.7 on K400, +6.1 on IN1K in the reported table), indicating that next-frame generation with decoupled semantics and flow-matching is more effective than token-by-token schemes used there.
While discriminative image pretraining can edge ahead on pure image tasks (e.g., SigLIP2 on IN1K), NExT-Vid shines on video motion tasks like SSv2 and Diving48, showcasing the value of explicit temporal modeling.

Surprising and insightful findings:

Masking is essential. Without masking, video redundancy makes prediction too easy, and representation quality drops dramatically (the ablation shows an average accuracy plunge from ~66.9 to ~22.9 with early-stop settings).
The small flow decoder (tens of millions of parameters) is enough to produce high-quality samples during representation learning; you don’t need a giant generator to gain strong features.
Multi-fold τ sampling (training the flow-matching decoder more per sample) can slow early semantic gains but leads to stronger potential later. The authors adopt a staged schedule: exploit multi-τ early, then switch to single-τ for stability.
A “cool-down” stage with longer clips (64 frames) noticeably boosts action-heavy datasets (SSv2, Diving48), implying that exposure to longer temporal context helps the encoder solidify motion understanding.
Generation target choice matters. VAE latents specialized for images (VAVAE) and even pixels do better than a certain video VAE baseline in ablations, likely because of reconstruction fidelity differences.
Information isolation helps. Frame-isolated attention in the decoder and avoiding self-attention among noisy targets prevent leakage that would make generation too easy and harm semantic learning.

Scaling trends:

More data helps quickly at first, then plateaus, with further gains after the cool-down stage—especially on motion-centric tasks.
Bigger encoders help, especially moving from ViT-L (300M) to ViT-H (600M). Gains from ViT-H to ViT-G (1.1B) are smaller unless combined with the cool-down strategy, which then unlocks more improvements.

Takeaway: With the encoder frozen and only a simple attentive probe trained, NExT-Vid’s features outperform prior generative pretraining and are competitive with strong discriminative video pretraining—particularly on tasks that demand temporal understanding.

05Discussion & Limitations

Limitations:

Masking reliance: The training efficiency isn’t as simple as GPT-style next-token prediction because masking and special attention masks are required to prevent trivial copying in videos.
Generation vs representation trade-off: The hardest objectives often learn the best semantics but can make training a great generator tougher, and vice versa. Balancing both goals remains tricky.
Compute and data: The approach benefits from very large-scale data (millions of hours) and heavy compute (e.g., 96 H100s), which may not be accessible to all practitioners.
Target dependence: Using specific VAEs or latent spaces affects results; weaker reconstruction can hurt learning signals.

Required resources:

A large, diverse video+image dataset helps the model generalize across scenes and motions.
Significant GPU compute is needed to train at the reported scales and to run the multi-stage schedule stably (warm-up, stable phases, cool-down).
Engineering for EMA reference encoders, attention masks, and efficient VAE pipelines is necessary.

When NOT to use:

If you need a tiny on-device model with minimal compute or power, this training recipe is too heavy.
If your downstream task demands precise pixel-level restoration or high-resolution generation quality as the primary goal, a bigger, specialized generator may be better than a compact decoder focused on representations.
If your data is mostly static images with little need for motion understanding, a strong image-only pretrainer (e.g., DINOv2, SigLIP2) could be simpler and sufficient.

Open questions:

Can we reduce or remove masking while keeping the semantic challenge high (e.g., smarter curriculum or harder temporal corruptions)?
How far can we push decoupling—could alternative conditioning or modular training further protect semantics while improving generation?
Can we unify image and video training even more tightly so that both benefit without trade-offs, perhaps via multi-task objectives?
What are the best latent targets for videos (new VAEs or learned tokenizers) to maximize both fidelity and semantic learning?
How to extend beyond next-frame to longer-horizon prediction without information leakage and without making training unstable?

06Conclusion & Future Work

In three sentences: NExT-Vid teaches a video model by predicting the next frame while parts of the past are hidden, and it cleanly separates the job of understanding meaning from the job of drawing the frame. A context-isolated autoregressive predictor creates a semantic summary of what should happen next, and a conditioned flow-matching decoder turns that summary into a high-quality next-frame latent. This design yields stronger, more accessible representations that outperform prior generative pretraining on major video and image benchmarks under attentive probing.

Main achievement: Proving that masked next-frame autoregression with strict context isolation and flow-matching decoding can deliver state-of-the-art generative pretraining for video understanding—strong semantics without letting the generator’s tricks contaminate the representation space.

Future directions: Explore lighter-weight training that preserves difficulty without heavy masking; design better video-specific latent targets; extend to longer-horizon prediction; and combine this with contrastive or JEPA-style objectives for even richer features. Investigating data-efficient pretraining schedules and improved decoders could further enhance both representation quality and generation fidelity.

Why remember this: It shows that “train on what happens next” truly works for videos when you separate the thinker from the drawer and denoise with care. This simple but powerful separation, plus smart masking and conditioning, can set a new default path for learning robust video representations that help many real-world tasks.

Practical Applications

•Action recognition for sports analytics (e.g., detecting serves, goals, dives) with minimal task-specific training.
•Video content moderation and safety filtering that better understands motion-based cues.
•Smart video search and retrieval that finds moments (e.g., “when the person opens the door”) across huge libraries.
•Surveillance anomaly detection that notices unusual motion patterns without exhaustive labels.
•Assistive tools for education that segment and summarize key steps in lab demos or how-to videos.
•Robotics perception that anticipates next motions in human-robot collaboration settings.
•Video editing aids (e.g., highlight extraction) that understand temporal context for cleaner cuts.
•Medical or sports form analysis systems that track and classify fine-grained movements.
•Traffic and driving analytics that infer next-frame motions for safer planning and monitoring.
•Pretraining a general video backbone to reduce labeled data needs in many downstream tasks.

Version: 1