Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu; Ziqiao Ma; Wenhao Chai; Xuweiyi Chen; Weiyang Jin; Joyce Chai; Saining Xie; Stella X. Yu

Next-Embedding Prediction Makes Strong Vision Learners

Beginner

Sihan Xu, Ziqiao Ma, Wenhao Chai et al.12/18/2025

arXiv PDF

Key Summary

•This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.
•Instead of rebuilding pixels or comparing against many negative examples, the model only learns to guess the next embedding using a causal (left-to-right) Transformer and a cosine-similarity loss.
•A stop-gradient trick keeps training stable so the targets don’t collapse, and a small set of modern stabilizers (RoPE, LayerScale, QK-Norm, SwiGLU) make scaling easier.
•Pretrained on ImageNet-1K without labels, NEPA reaches 83.8% top-1 (ViT-B) and 85.3% (ViT-L) after fine-tuning—competitive with popular self-supervised methods.
•On semantic segmentation (ADE20K), NEPA transfers well, scoring 48.3% mIoU (ViT-B) and 54.0% (ViT-L), despite never doing pixel reconstruction during pretraining.
•Ablations show three essentials: autoregressive shifting is necessary, causal masking clearly helps, and stop-gradient prevents collapse.
•Random masking (like in MAE) actually hurts here because autoregressive prediction already makes the task non-trivial.
•NEPA is architecture-light (no decoder, no contrastive head, no tokenizer) and could be extended to multiple modalities or even to generation with a decoder later.
•The approach works well but struggles with tricky physical effects (reflections, shadows) and may inherit dataset biases, highlighting future work on data and reasoning.
•Overall, NEPA suggests vision pretraining can be as simple and scalable as next-token prediction in language—just done in embedding space.

Why This Research Matters

If we can pretrain vision models with a simple, scalable recipe, more people can build strong systems without complicated toolkits. NEPA shows we don’t need decoders, discrete tokenizers, or contrastive negatives to learn powerful features—prediction alone can teach a model rich scene understanding. That means faster iteration, fewer stability issues, and easier transfer to tasks like recognition and segmentation. In real products, simpler training can cut costs and speed up deployment while maintaining top performance. NEPA’s embedding-level prediction could also unify learning across images, text, audio, and video, since all can be framed as “predict the next embedding.” Finally, the approach opens a doorway to generation with minimal extra parts, promising flexible systems that can both understand and create visuals.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re reading a comic strip panel by panel. Even before you turn the page, your brain is already guessing what the next panel will show based on the story so far.

🥬 The Concept (Generative Pretraining): It’s a way to train models by having them predict what comes next, not just describe what they see. How it works: (1) Feed the model a sequence, (2) ask it to guess the next item, (3) compare its guess to the real next item, (4) repeat millions of times. Why it matters: Without this, models might memorize snapshots instead of learning how things unfold.

🍞 Anchor: Language models learn by predicting the next word in a sentence; this paper asks, “What if vision models learned by predicting the next ‘thing’ in an image?”

The World Before: For years, self-supervised vision methods chased a different goal: create excellent representations (features) from pixels without labels. Three big families dominated: (1) contrastive learning (make two views of the same image close, and different images far), (2) self-distillation (train a student model to match a teacher’s features), and (3) masked reconstruction (hide pixels or tokens and rebuild them). These approaches worked well but often needed extra parts: decoders to reconstruct pixels, clever data augmentations, negative pairs, momentum encoders, or discrete visual tokenizers. It was powerful—but complicated and sometimes brittle.

🍞 Hook: You know how building an elaborate Rube Goldberg machine can do a simple task, but it’s fragile and hard to scale?

🥬 The Concept (Predictive Representation Learning): Instead of rebuilding raw pixels, predict more semantic features (embeddings) that summarize what’s important. How it works: (1) Turn patches into embeddings, (2) use context to forecast the next patch’s embedding, (3) train so the forecast and the real embedding match. Why it matters: Predicting “meaning” is often more useful than predicting raw pixels, which can be noisy and heavy.

🍞 Anchor: Rather than repainting every brushstroke of a hidden part of a painting, you guess the painting’s theme and style for the next section.

The Problem: Vision didn’t fully adopt the simple, causal “next-thing” training that made language models soar. Early attempts like iGPT predicted pixels or discrete tokens, which led to long sequences and weak semantics. JEPA moved to latent targets but still kept separate encoders and a complex head—more about learning a representation than training a single predictive model.

🍞 Hook: Think of learning to ride a bike: it’s easier when you learn by moving forward smoothly rather than rebuilding the ground behind you.

🥬 The Concept (Attention Mechanism): It lets models focus on the most relevant parts of what they’ve seen so far. How it works: (1) Compare the current query to all past pieces, (2) give higher weights to important ones, (3) mix information using those weights. Why it matters: Without attention, the model treats every past patch equally, missing key context.

🍞 Anchor: When you look at a face, your eyes focus more on the eyes and mouth than the background.

The Gap: Could we train a vision model as simply as a language model—just predict the next part—while staying in an embedding space (more semantic than pixels) and avoiding extra machinery (decoders, tokenizers, contrastive negatives)? That’s the gap this paper fills.

Real Stakes: Simpler, scalable training means faster research, fewer moving parts to break, and models that can transfer well across tasks like classification and segmentation. In everyday life, that can mean better photo organizing, safer robots that understand scenes, and more accessible tools for people without massive compute budgets.

🍞 Hook: Suppose you pack a suitcase by predicting the next item you’ll need on your trip—socks after shoes, shirt after pants—rather than rebuilding your closet every time.

🥬 The Concept (Vision Transformer, ViT): A model that treats an image as a sequence of patches and processes them with attention, like words in a sentence. How it works: (1) Split image into patches, (2) turn patches into vectors, (3) let attention share information across patches, (4) output features. Why it matters: Without ViT’s sequence view, predicting the “next” patch wouldn’t fit naturally.

🍞 Anchor: It’s like reading a picture book one tile at a time, remembering what came before.

In short, before this paper, vision pretraining mostly aimed to craft strong features with extra scaffolding. The authors ask: what if we ditch that scaffolding and just practice predicting the next meaningful piece, like language models do? The answer—NEPA—shows you can learn strong vision skills by simply predicting the next embedding in a causal sequence.

02Core Idea

🍞 Hook: You know how when you listen to a song, you can often hum the next note because the melody guides you?

🥬 The Concept (NEPA – Next-Embedding Predictive Autoregression): NEPA trains a vision model to guess the next patch’s embedding from the patches it has already seen. How it works: (1) Split an image into patches, (2) turn each patch into an embedding, (3) use a causal Transformer to predict the next embedding, (4) compare the prediction to the true next embedding with a cosine-similarity loss, (5) stop gradients through the target so training stays stable. Why it matters: Without this next-embedding prediction, we’d need extra tools (decoders, tokenizers, negatives) to force learning; with NEPA, prediction itself supplies the learning signal.

🍞 Anchor: Like predicting the next beat in music, NEPA learns the rhythm of images—what patch likely comes next.

The "Aha!" Moment in one sentence: Treat an image as a sequence of embeddings and train the model, just like a language model, to predict the next embedding causally—simple, semantic, and scalable.

Multiple Analogies:

Storytelling: Given the beginning of a comic strip, guess what the next panel shows; learn storytelling rules of scenes.
Jigsaw Strategy: Place pieces left-to-right by predicting what edge shape comes next rather than redrawing the picture.
Trail Hiking: Follow trail markers (embeddings); from the current marker and memory of earlier ones, predict where the next marker will be.

Before vs. After:

Before: Vision pretraining often needed reconstructor decoders, momentum teachers, heavy augmentations, or discrete tokenizers; training targeted features that downstream heads would later consume.
After: NEPA uses one Transformer, a simple next-embedding objective, no pixel decoding or negatives, and still reaches top results after standard fine-tuning.

🍞 Hook: Imagine highlighting the words in a sentence you’ve already read while covering the next word; you try to guess it using only what’s revealed.

🥬 The Concept (Causal Masking): Only allow attention to previous patches, never future ones, so predictions can’t peek. How it works: (1) Mask future positions, (2) attend to all earlier patches, (3) predict the next embedding, (4) shift and repeat. Why it matters: Without causal masking, the model could cheat by looking ahead, turning prediction into reconstruction and weakening the skill of true foresight.

🍞 Anchor: It’s like taking a test where you can’t look at the answer key until after you’ve guessed.

🍞 Hook: Think of tracing a ruler line—if the ruler keeps moving under your pencil, it’s hard to draw a straight line.

🥬 The Concept (Stop-Gradient): Freeze the target embedding so it stays still while the predictor learns to match it. How it works: (1) Compute target embeddings, (2) detach them from gradients, (3) train predictor toward these targets, (4) update the shared embedding layer via other paths but not through the target match. Why it matters: Without stop-grad, both sides could drift together into a trivial collapse (all vectors the same), making the task meaningless.

🍞 Anchor: It’s like practicing to match a steady tone from a tuning fork; if the fork changed pitch with you, you’d never learn.

Why It Works (intuition without equations):

Predicting the next embedding forces the model to understand object parts and scene context to make a good guess.
Cosine similarity on unit-length embeddings rewards semantic alignment rather than pixel-perfect copying.
Causal masking ensures the signal is honest: only use the past to anticipate the future.
Stop-gradient keeps the target steady enough to learn a real mapping instead of collapsing.

Building Blocks explained with mini-sandwiches:

🍞 Hook: You know how a nickname is a short way to capture someone’s identity. 🥬 The Concept (Embedding): A compact vector that summarizes a patch’s meaning. How: pass pixels through a small encoder. Why: Without embeddings, the model would juggle heavy pixel details instead of useful cues. 🍞 Anchor: A sky patch’s embedding clusters with other sky patches.
🍞 Hook: Think of tapping out a beat, then predicting the next tap. 🥬 The Concept (Autoregression): Make a sequence guess using only what came before. How: shift one step and predict next. Why: Without it, the model doesn’t learn temporal/spatial ordering. 🍞 Anchor: After tree trunk patches, foliage patches are likely next.
🍞 Hook: When comparing two songs, you care about how closely their melodies point, not their loudness. 🥬 The Concept (Cosine Similarity): Compare the angle between two vectors after normalizing length. How: normalize both, take their dot. Why: Without normalization, scale tricks could game the loss. 🍞 Anchor: Two different-sized arrows pointing the same way get a high score.
🍞 Hook: Imagine scanning a crowd and paying extra attention to your friend’s shirt color. 🥬 The Concept (Attention in a ViT): Weigh past patches by relevance to the current guess. How: compute attention weights and mix information. Why: Without attention, the model dilutes focus across irrelevant areas. 🍞 Anchor: Predicting a cat’s ear patch focuses on nearby fur and face patches.

In essence, NEPA reframes vision pretraining as the simplest powerful game: guess the next meaningful piece and learn the picture’s logic as you go.

03Methodology

High-level recipe: Image → Patchify + Embed → Causal Transformer → Shifted Next-Embedding Prediction → Cosine-Similarity Loss (with stop-grad) → Pretrained backbone → Fine-tune for tasks.

Step 1: Patchify and Embed

What happens: The image is cut into non-overlapping patches (like tiles). Each patch goes through a shared embedding layer to become a D-dimensional vector.
Why this step exists: Patches make images act like sequences; embeddings compress meaning into manageable vectors.
Example: A 224×224 image with patch size 14 gives T = 16×16 = 256 patches. Each becomes a 1024-D vector.
Mini-sandwich (Embedding): 🍞 Hook: A map’s legend turns symbols into meanings. 🥬 The Concept: An embedding is a vector that captures a patch’s meaning. How: one small encoder shared for all patches. Why: Without embeddings, we’d predict heavy pixels instead of semantics. 🍞 Anchor: Patches of blue sky end up near each other in embedding space.

Step 2: Add Positions and Use a Causal Transformer

What happens: Positional info is added so the model knows which patch comes first. A Transformer with causal masking only looks at earlier patches for each prediction.
Why this step exists: The model must know order, and must not peek ahead.
Example: For patch t, attention can use patches 1..t, but not t+1..T.
Mini-sandwich (Causal Masking): 🍞 Hook: Cover the next comic panel so you can’t peek. 🥬 The Concept: Only attend to past patches when predicting the next. How: apply a mask in attention. Why: Without it, prediction becomes reconstruction, weakening foresight learning. 🍞 Anchor: When at patch 10, the model never sees patch 11 during prediction.

Step 3: Predict the Next Embedding (Autoregressive Shift)

What happens: The model predicts the embedding for position t+1 from positions ≤ t. We shift predictions by one and align them with the next true embeddings.
Why this step exists: Predicting the next step creates a non-trivial target; predicting the current input would be a copy.
Example: Use outputs [1..T−1] to match targets [2..T]. Removing the shift made fine-tuning diverge in ablations.
Mini-sandwich (Autoregression): 🍞 Hook: After hearing “knock, knock,” you can guess “who’s there?” 🥬 The Concept: Predict the next item using only what came before. How: shift by one position. Why: Without shifting, the model learns identity mapping. 🍞 Anchor: After “wheel” patches on a car, “door” or “window” patches often follow.

Step 4: Compute the Loss with Stop-Gradient and Cosine Similarity

What happens: Normalize predictions and targets; compute negative cosine similarity at each step; sum over positions. Targets are detached (stop-grad) to avoid collapse.
Why this step exists: Cosine focuses on direction (semantics) over magnitude; stop-grad stabilizes learning so targets don’t drift with the predictor.
Example: If predicted and true embeddings point similarly, loss is low (good). Without stop-grad, training collapsed (all embeddings identical).
Mini-sandwich (Cosine Similarity): 🍞 Hook: Two compasses pointing north are aligned, even if one is bigger. 🥬 The Concept: Measure alignment of two vectors after normalizing. How: dot product of unit vectors. Why: Without it, the model could game the loss by scaling. 🍞 Anchor: Predicting a patch on a dog’s fur yields a vector close in angle to the real fur patch.
Mini-sandwich (Stop-Gradient): 🍞 Hook: Practice singing to a fixed note from a tuning app. 🥬 The Concept: Freeze target embeddings so only the predictor chases them. How: detach targets in the graph. Why: Without it, both could move together and collapse. 🍞 Anchor: Targets stay steady while the predictor adjusts to match.

Step 5: Stabilizers in the Transformer

What happens: Use a few modern components to make optimization smoother.
Why this step exists: Deep Transformers can be finicky; these parts help scale.
Examples and mini-sandwiches:
- RoPE (Rotary Position Embedding): 🍞 Hook: Turning a globe lets you keep track of where places are relative to each other. 🥬 The Concept: Encode relative positions with rotations so attention generalizes across lengths. How: apply rotations to queries/keys. Why: Without good positions, order understanding is weaker. 🍞 Anchor: The model better handles different image sizes or patch counts.
- LayerScale: 🍞 Hook: A volume knob saves your ears from sudden loud bursts. 🥬 The Concept: Learn tiny per-channel scales on residual paths. How: start near zero, let them grow. Why: Without it, gradients can wobble or explode in deep nets. 🍞 Anchor: Training curves become smoother and converge faster.
- SwiGLU: 🍞 Hook: A gate lets the right amount of water flow through a canal. 🥬 The Concept: A gated activation in the MLP improves expressiveness. How: split/weight/activate. Why: Without gating, gains can be slightly smaller. 🍞 Anchor: Small bumps in accuracy over GeLU in some settings.
- QK-Norm: 🍞 Hook: Normalizing handwriting size makes letters easier to compare. 🥬 The Concept: Normalize queries/keys to stabilize attention. How: apply LayerNorm-like steps. Why: Without it, attention can spike and training can diverge. 🍞 Anchor: Prevents gradient explosions, especially with SwiGLU.

Step 6: Pretraining Loop

What happens: For each batch: embed patches → predict next embeddings → compute cosine loss with stop-grad → backprop → update. Optionally track an EMA copy for evaluation stability.
Why this step exists: Repeating the game teaches the model strong contextual priors.
Example: Trained on ImageNet-1K without labels; large batches (e.g., 4096) and many epochs.
Mini-sandwich (EMA): 🍞 Hook: Averaging many photos makes a sharper picture. 🥬 The Concept: Keep an exponential moving average of weights for steadier evals. How: new_avg = α·old + (1−α)·current. Why: Without EMA, evals can bounce around. 🍞 Anchor: Validation accuracy is smoother with high-decay EMA.

Step 7: Fine-tuning for Tasks

Classification Head:
- What: Use the last token’s hidden state; add a linear layer; train with cross-entropy.
- Why: The last state summarizes the sequence nicely.
- Example: Fine-tune on ImageNet-1K; sometimes bidirectional attention helps during this stage.
Segmentation Head (UPerNet):
- What: Add a standard decoder that fuses multi-scale features; fine-tune with pixel-wise cross-entropy.
- Why: Dense labeling needs full spatial context; use bidirectional attention during fine-tuning.
- Example: ADE20K with 512×512 crops.
Mini-sandwich (UPerNet): 🍞 Hook: Building a city map requires combining neighborhood maps at different zoom levels. 🥬 The Concept: UPerNet fuses multi-scale features for pixel-wise labels. How: pyramid pooling + FPN-like fusion. Why: Without multi-scale fusion, small and large objects are harder to segment. 🍞 Anchor: Better boundaries and region consistency on ADE20K.

Secret Sauce:

Predict in embedding space, not pixels—lighter targets, more semantic.
Keep it causal—no peeking, honest prediction.
Use stop-grad—stable targets, no collapse.
No decoders, no negatives, no tokenizers—just a single, scalable predictive Transformer.

04Experiments & Results

The Test: The authors measure how well NEPA’s pretrained backbones transfer to (1) image classification on ImageNet-1K (top-1 accuracy) and (2) semantic segmentation on ADE20K (mIoU). They also run ablations to test which ingredients truly matter: autoregressive shift, causal masking, stop-grad, and architectural stabilizers (RoPE, LayerScale, QK-Norm, SwiGLU).

The Competition: They compare against well-known self-supervised approaches—contrastive (MoCo v3), self-distillation (DINO/iBOT), and masked prediction (MAE/BEiT), plus JEPA-style latent prediction. Many of these need extra parts (decoders, negative pairs, or momentum teachers). NEPA keeps a single-stream predictive Transformer with no decoder.

The Scoreboard (with context):

ImageNet-1K classification (fine-tuned):
- ViT-B: 83.8% top-1. That’s like scoring an A when most solid methods also sit in the A– range, but NEPA does it with a simpler training game.
- ViT-L: 85.3% top-1. That’s an A+ territory, competitive with strong masked or distillation approaches.
ADE20K segmentation (fine-tuned with UPerNet):
- ViT-B: 48.3% mIoU; ViT-L: 54.0% mIoU. For a model that never reconstructed pixels during pretraining, these are strong, showing learned embeddings carry dense, spatial meaning.

Key Ablations (what moves the needle):

Autoregressive Shift (predict t+1, not t): Removing it turns learning into identity mapping; fine-tuning later diverges. Necessary.
Causal Masking: Without it (bidirectional during pretraining), the task becomes reconstruction-like and top-1 drops notably in short training (e.g., ~73.6% vs 76.8% in early runs). Helpful.
Stop-Gradient: Without it, representations collapse (loss saturates at −1, all vectors same). Essential.
Random Masking: Unlike MAE, masking hurts here because autoregressive prediction is already non-trivial; masking corrupts structure and creates a mismatch between training and inference.

Architectural Components:

RoPE: Notable accuracy gains; strengthens positional reasoning for sequences of patches.
QK-Norm: Stabilizes attention; prevents gradient spikes, especially with SwiGLU.
LayerScale: Makes pretraining smoother and faster to converge, though can slightly slow fine-tuning unless you freeze early layers smartly.
SwiGLU: Small gains; kept for compatibility with modern Transformer designs.

Fine-tuning Attention Type:

For classification, enabling bidirectional attention during fine-tuning can help (the model can use full context when not predicting the future). Still, even with causal attention at fine-tune time, NEPA remains competitive, which hints its embeddings capture strong global cues.
For segmentation, bidirectional attention during fine-tuning is standard and used here.

Surprising Findings:

Masking that is crucial for MAE is counterproductive for NEPA. Because NEPA’s task is already predictive and causal, adding random holes removes useful order and harms learning.
Linear probing is weak: because the raw output is close to the shallow embeddings, a linear probe doesn’t reveal the full predictor’s power. But end-to-end fine-tuning unlocks strong results.
Despite the minimalism (no decoder, no negatives), NEPA scales well and keeps improving with more training—similar to language-model scaling behavior.

Takeaway: NEPA reaches top-tier results using a lighter recipe, proving that “next-embedding prediction” can be a strong, scalable alternative to reconstruction and contrastive pipelines.

05Discussion & Limitations

Limitations:

Reasoning-intensive visuals: NEPA struggles with reflections, strong shadows, backlighting, and scenes crowded with tiny or overlapping objects. Predicting the next embedding may not fully enforce physical reasoning.
Linear probing: Shallow probes underperform because the best knowledge lives in the predictive Transformer dynamics, not a frozen early representation.
Dataset bias: Training on ImageNet-1K can pass along spurious correlations or biases to downstream tasks.

Required Resources:

Compute: Large-batch training (e.g., 4096) and many epochs benefit results; authors used multi-GPU setups (e.g., 8× H100) and training times of several days.
Memory: ViT backbones with long patch sequences need careful memory and stability features (RoPE, QK-Norm, LayerScale).
Data: Broad, diverse image data improves generalization; limited datasets can cap performance.

When NOT to Use:

If you need pixel-perfect reconstruction (e.g., super-resolution training signals), a reconstruction objective may fit better.
If compute is tiny and only linear probing is allowed, NEPA’s strengths might not show without fine-tuning.
If your problem has no natural sequence order or depends on absolute pixel geometry, NEPA’s causal sequence framing might not be ideal without adaptations.

Open Questions:

Generation Bridge: What’s the best way to attach an image decoder or diffusion head to turn NEPA into a full image generator/editor?
Modality-Agnostic Training: Can the same next-embedding objective unify audio, video, and text with minimal changes?
Order and Layout: Does scan order (raster, spirals, space-filling curves) affect learning? Could learned orders help?
Scaling Laws: How do performance, compute, and data scale for NEPA compared to MAE and contrastive learning?
Hybrid Losses: Are there sweet-spot combinations—mostly NEPA plus a sprinkle of masked prediction or contrastive anchors—for even stronger transfer?

06Conclusion & Future Work

Three-Sentence Summary: This paper shows that vision models can be pretrained as simply as language models by predicting the next embedding of image patches, not pixels. With a causal Transformer, stop-gradient, and a cosine loss—no decoders, tokenizers, or negatives—NEPA learns rich, transferable features. The result is competitive accuracy on ImageNet-1K and strong ADE20K segmentation, proving prediction in embedding space is a powerful, scalable alternative to reconstruction and contrastive pipelines.

Main Achievement: Turning vision pretraining into a single, minimal next-embedding prediction task that matches or beats mainstream methods after fine-tuning, while removing architectural complexity.

Future Directions: Attach a decoder or diffusion head to make NEPA generative; extend to video, audio, and multimodal settings; explore learned patch orders; study scaling laws and combinations with light auxiliary losses.

Why Remember This: NEPA reframes “learn good features” as “predict the next meaningful piece,” bringing vision closer to the simple, proven recipe behind language models. It suggests a unifying view: embeddings as the shared currency across modalities, with causally predicting the future as the core learning game.

Practical Applications

•Pretrain vision backbones for classification with a single, decoder-free objective to simplify pipelines.
•Build strong segmentation systems by fine-tuning NEPA backbones with a standard UPerNet head.
•Use NEPA-pretrained features in vision-language models to provide robust visual context without contrastive pretraining.
•Adopt NEPA as a lightweight pretraining method when compute or engineering time is limited.
•Prototype robotics perception stacks that need reliable scene understanding but benefit from simple training loops.
•Warm-start generative pipelines by pairing NEPA with a decoder (or diffusion model) to enable image synthesis or editing.
•Leverage NEPA for domains where pixel targets are awkward (medical or satellite imagery), focusing on semantics over raw intensity.
•Scale training to larger ViT models with stabilizers (RoPE, QK-Norm, LayerScale) to improve performance steadily.
•Reduce reliance on heavy data augmentations and negative pairs in self-supervised setups, easing hyperparameter tuning.
•Explore multimodal extensions by treating audio or video frames as sequences of embeddings and applying the same next-embedding objective.

Version: 1