End-to-End Test-Time Training for Long Context

Arnuv Tandon; Karan Dalal; Xinhao Li; Daniel Koceja; Marcel Rød; Sam Buchanan; Xiaolong Wang; Jure Leskovec; Sanmi Koyejo; Tatsunori Hashimoto; Carlos Guestrin; Jed McCaleb; Yejin Choi; Yu Sun

End-to-End Test-Time Training for Long Context

Intermediate

Arnuv Tandon, Karan Dalal, Xinhao Li et al.12/29/2025

arXiv PDF

Key Summary

•This paper shows how a language model can keep learning while you use it, so it handles very long inputs without slowing down.
•Instead of remembering every single detail (full attention), the model learns from the current document at test time and stores the important parts in its weights.
•They use a normal Transformer with sliding-window attention, plus Test-Time Training (TTT) that updates only some MLP layers as it reads.
•During training, meta-learning prepares the model to be especially good at this kind of test-time learning.
•With 3B-parameter models, the method scales with context length like full attention but keeps constant latency like RNNs.
•At 128K tokens, it’s about 2.7× faster to prefill than full attention while keeping strong accuracy.
•Ablations show key choices matter: window size k=8K, TTT mini-batch size b=1K, and updating the last 1/4 of blocks.
•It wins on language modeling over long contexts but loses to full attention on pure recall tests like Needle-in-a-Haystack.
•Training is slower (because of gradients of gradients), but inference uses standard infrastructure.
•Overall, the big idea is to treat long-context modeling as continual learning, not just architecture design.

Why This Research Matters

Many real tasks involve very long inputs: legal cases, entire books, long chats, huge codebases, and medical records. This method lets models “learn while reading” so they can use long context effectively without slowing down. It keeps the familiar Transformer structure, making it easier to deploy on today’s infrastructure. It helps language modeling over long contexts while staying fast, which lowers cost and improves user experience. Although pure recall tasks still favor full attention, most real tasks benefit from compression and adaptation. Over time, combining small amounts of exact recall with efficient TTT could bring the best of both worlds.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re reading a huge book. You don’t remember every single word, but you keep learning the main ideas as you go, and that helps you understand the rest faster.

🥬 The Concept (Transformer architecture): A Transformer is a type of AI brain for language that looks at words and figures out which ones matter for predicting the next word. How it works:

It turns words into numbers (embeddings).
It uses attention to decide which earlier words to focus on.
It stacks many layers to refine understanding. Why it matters: Without Transformers, modern language models wouldn’t be so good at reading, writing, and answering questions. 🍞 Anchor: When an AI answers “Paris” to “What’s the capital of France?”, the Transformer helped it focus on the right parts.

🍞 Hook: You know how trying to reread the whole book every time you turn a page would be super slow?

🥬 The Concept (Full attention): Full attention lets the model look at all previous words every time it predicts the next one. How it works:

Save keys and values for all past tokens.
For the next token, compare against all saved tokens.
Weigh them and combine to predict. Why it matters: It’s very accurate for recall, but its cost grows with input length; long inputs get slow and expensive. 🍞 Anchor: If your context is 128,000 tokens, full attention must scan all of them for each new step—like re-scanning an entire bookshelf to find a quote.

🍞 Hook: What if you skim your notes only around the page you’re on, instead of the whole book?

🥬 The Concept (Sliding-window attention): Sliding-window attention looks back only a fixed number of tokens (a window) instead of the whole history. How it works:

Pick a window size k (like 8K tokens).
For each new token, compare against only the last k tokens.
Slide the window forward as you read. Why it matters: It keeps speed steady even when the text is very long, but it can miss older details outside the window. 🍞 Anchor: Reading with a bookmark: you check only the last few pages, not the entire book from the start.

🍞 Hook: Think about learning during a test by reading the exact chapter you’re tested on—it’s like updating your brain on-the-fly.

🥬 The Concept (Test-Time Training, TTT): Test-Time Training means the model continues to learn from the new text you give it right now, while it’s making predictions. How it works:

Read the current text chunk.
Predict the next word and measure the error.
Take a small learning step to improve. Why it matters: It helps the model adapt to the specific document or topic without retraining from scratch. 🍞 Anchor: If a paper uses unusual terms, TTT lets the model pick them up as it reads that very paper.

🍞 Hook: Imagine you practiced not just solving problems—but practicing how to get better fast during the test.

🥬 The Concept (Meta-learning): Meta-learning teaches the model to be good at learning quickly during test time. How it works:

During training, pretend each sequence is a test.
Do TTT on it (inner loop), then measure final loss.
Update the starting weights to make future TTT more effective (outer loop). Why it matters: Without meta-learning, TTT may be weak or unstable; with it, TTT is powerful and ready. 🍞 Anchor: Like training to be a sprinter who accelerates quickly, not just to run long distances.

🍞 Hook: You know how people don’t remember every word they read but keep the big ideas in their minds?

🥬 The Concept (Compression into weights): Instead of caching every detail, the model compresses what matters from the context into its weights. How it works:

Predict next tokens on the given context.
Nudge the model’s MLP layers to store useful patterns.
Use this updated model to predict future tokens. Why it matters: This keeps inference fast while still benefiting from long context. 🍞 Anchor: Skimming a chapter and adding a few sticky notes with the main points, so you can answer questions quickly later.

The world before: Full attention was accurate but slow on very long inputs—its cost grows with context length. Sliding-window attention and RNNs kept cost constant but tended to underuse long context, getting worse as the text got longer. Many new architectures tried to fix this, but most either slowed down or didn’t make great use of longer inputs.

The problem: How can we get the “uses long context well” benefits of full attention while keeping constant speed, like RNNs? And can we do this without inventing a totally new architecture that’s hard to train and deploy?

Failed attempts: Pure RNNs (like Mamba 2 and Gated DeltaNet) had constant cost but lost effectiveness as context grew. Sliding-window only models stayed fast but couldn’t benefit as much from information outside the window. Dynamic evaluation (old TTT) helped, but it didn’t train the model to be good at TTT itself.

The gap: No approach treated long-context reading as a continual learning problem and then fully aligned training with how the model would actually behave at test time—end-to-end.

What this paper adds: It keeps a standard sliding-window Transformer and adds TTT via next-token prediction at test time, plus meta-learning at training time to make TTT strong and stable. The model updates only selected MLP layers in the last quarter of blocks and adds a “safe” static MLP in those blocks to keep pre-trained knowledge. With a window of 8K and mini-batches of 1K, it scales with context like full attention while keeping constant latency.

Real stakes: Long documents, codebases, medical records, legal contracts, scientific papers, and days-long chat histories are common. We need models that “learn while reading” to sum up, reason, and continue writing—without stalling or forgetting the plot.

02Core Idea

🍞 Hook: Imagine you’re solving a mega jigsaw puzzle. You can’t stare at every piece all the time, but you can keep learning the picture as you go.

🥬 The Concept (Aha!): Treat long-context language modeling as continual learning: keep a normal sliding-window Transformer, but also keep learning at test time via next-token prediction, and train the model to be great at this with meta-learning. How it works:

Use sliding-window attention for short-term memory.
During test time, do mini-batch TTT on the current chunk: predict next tokens, compute loss, take a gradient step.
Only update the MLPs in the last 1/4 of blocks to store compressed context; keep another MLP per block static as safe storage.
During training, meta-learn the initialization so the model adapts quickly and stably at test time. Why it matters: You get the context-using power of full attention’s scaling, but with constant inference latency like RNNs. 🍞 Anchor: Reading a 128K-token book, the model learns chapter by chapter while staying fast.

Three analogies:

Study notebook: You don’t memorize the whole textbook; you write key notes as you read new chapters (TTT updates). Your notebook format was designed ahead of time to make quick note-taking easy (meta-learning).
Chef tasting: A chef adjusts seasoning as they cook (TTT), having trained to react quickly to tastes and smells (meta-learning), instead of reading every recipe step every time (full attention).
Backpack and window: You look out a small window (sliding-window) but pack the essentials into your backpack as you go (updated MLP weights), so you’re ready for what’s next.

Before vs After:

Before: Choose between “great recall but slow” (full attention) or “fast but forgetful” (sliding-window/RNNs). Dynamic evaluation helped a bit but wasn’t trained for.
After: Same standard Transformer backbone, but it learns while reading. It scales with context length like full attention, stays fast like RNNs, and beats other efficient baselines.

Why it works (intuition):

Focusing on the present: The model doesn’t need to be perfect for all possible futures at once. It just needs to be good for the current mini-batch, because it will relearn for the next one.
Compression not caching: Instead of storing every key/value forever, it compresses important patterns into a fixed, trainable “fast memory” (the updated MLP weights).
Prepared to learn: Meta-learning shapes the initial weights so small test-time gradient steps reliably help instead of destabilize.

Building blocks (with mini “sandwich” intros):

🍞 Hook: Like looking at the last few pages while taking notes. 🥬 Sliding-window attention: Keeps local context available at low cost; window k=8K in the paper. If you drop it, mini-batch TTT has no short-term memory to lean on. 🍞 Anchor: Predicting the next sentence mostly depends on the last several pages.
🍞 Hook: You highlight the current section as you read it. 🥬 Mini-batch TTT: Do gradient updates after reading batches of b tokens (b=1K). If b is too big, you miss within-batch memory; if too small, training becomes unstable/slow. 🍞 Anchor: Every 1,000 words, update your notes.
🍞 Hook: Keep some shelves for new notes and some for your core knowledge. 🥬 Update only last 1/4 blocks’ MLPs + add a static MLP: Updated MLPs store fresh context; static MLPs keep pre-trained knowledge safe. Without this, you risk forgetting or waste compute. 🍞 Anchor: Bottom shelves are for permanent books; top shelves get sticky notes you can swap quickly.
🍞 Hook: Practice to learn fast during the real exam. 🥬 Meta-learning: Train the model’s starting point so small TTT steps help a lot. Without it, test-time learning can be weak or noisy. 🍞 Anchor: Speed drills before race day make you accelerate smoothly the moment the whistle blows.
🍞 Hook: Skim only recent notes when writing the next paragraph. 🥬 Decoding multiple tokens: Only take TTT steps after enough new decoded tokens fill a mini-batch, keeping decode fast most of the time. 🍞 Anchor: Write a page, then revise your notes, then continue.

Net effect: For 3B models trained on 164B tokens, the method matches full attention’s scaling with context length while keeping constant prefill latency and being 2.7× faster than full attention at 128K.

03Methodology

At a high level: Input tokens → Sliding-window prefill → Mini-batch TTT updates of selected MLPs → Predict next tokens (decode, with occasional updates) → Output.

Step 1: Set up the backbone and memory plan 🍞 Hook: Imagine you read with a window and keep a notebook for key ideas. 🥬 What it is: A standard Transformer with sliding-window attention (window k=8K) and a plan to update only the last quarter of blocks’ MLPs during test time. How it works:

Keep attention local (k=8K) for short-term context.
Choose a TTT mini-batch size b=1K.
Mark the last 1/4 of blocks as “fast memory” (their MLPs get updated during TTT) and add a second, static MLP in those blocks as safe storage. Why it matters: Without a short-term window, the model has nothing to latch onto between updates; without selective updates, it’s too slow or forgetful. 🍞 Anchor: While reading a 128K-token book, you always see the last 8K tokens and write notes in the top-quarter shelves.

Step 2: Prefill the context with sliding-window attention 🍞 Hook: You skim through the text chunk by chunk, keeping the last few pages fresh in mind. 🥬 What it is: Prefill means pass through the whole input once to compute states and caches. How it works:

Process tokens sequentially with k-limited attention.
Keep computation per token constant regardless of total length.
Store activations needed for TTT updates (with gradient checkpointing through time in training). Why it matters: Full attention would grow more expensive with longer texts; here, cost per token stays the same. 🍞 Anchor: Reading 128K tokens costs about the same per-step as 8K, just repeated more times.

Step 3: Test-Time Training (TTT) on the given context 🍞 Hook: Every so often, you pause and improve your notes based on what you just read. 🥬 What it is: While reading, the model trains itself on next-token prediction for the current mini-batch (b=1K), then uses the updated weights. How it works:

For a mini-batch of b tokens, compute next-token prediction loss (cross-entropy) at each step.
Average gradients across the b steps.
Update only the selected MLPs in the last 1/4 of blocks (freeze embeddings, norms, attention). Why it matters: This compresses what was read into the weights so the model can use the bigger context, while staying fast. 🍞 Anchor: Every 1,000 words, update your sticky notes and continue.

Step 4: Meta-learned initialization (training time) 🍞 Hook: Train to be great at learning quickly when it counts. 🥬 What it is: During training, pretend each sequence is a test; run the same TTT inner loop, then optimize the model’s starting weights to do well after TTT. How it works:

Inner loop: perform TTT on a sequence using next-token prediction loss.
Outer loop: compute “loss after TTT,” then backpropagate through the inner loop (gradients of gradients) to improve the initialization.
Repeat across many sequences. Why it matters: Without this, TTT can be wobbly; with it, small updates at test time reliably help. 🍞 Anchor: Practicing sprints that include accelerations so you start fast on race day.

Step 5: Decoding multiple tokens efficiently 🍞 Hook: Write a paragraph, then pause to update your notes, then continue. 🥬 What it is: Generate several tokens; only when they fill a mini-batch do you take a TTT step. How it works:

Decode next tokens normally with sliding-window attention (fast).
Once you have b new decoded tokens, take a TTT step on them.
Continue decoding with the updated weights. Why it matters: Keeps decode fast most of the time, only occasionally pausing to learn. 🍞 Anchor: Draft a page, then tighten your outline, then draft the next page.

Implementation details that are the “secret sauce”:

Update only MLP layers during TTT: This keeps the outer loop stable and reduces compute. If you update attention and norms too, the outer loop training can become unstable.
Update only last 1/4 of blocks: Enough capacity to store long-context info but not too much compute; ablations show 1/4 is a sweet spot.
Two MLPs per updated block: One stays static to preserve pre-trained knowledge; the other is the fast memory updated by TTT.

Concrete data example:

Context length T=128K, window k=8K, TTT batch b=1K.
The model reads the first 1K tokens, computes next-token losses across them, and updates the selected MLPs once.
It repeats this across the whole 128K context, keeping latency per token constant.

What breaks without each step:

Without sliding-window attention: Within a mini-batch, the model has no short-term context; performance drops.
Without mini-batches (or with too large b): Instability (too small) or within-batch forgetting (too large) hurts accuracy.
Without selective MLP updates: Either too slow (update too much) or too little capacity (update too little).
Without meta-learning: TTT helps much less; can even hurt in places.

Overall pipeline: Input → Sliding-window prefill → For each mini-batch: compute losses → Update last 1/4 blocks’ MLPs → Continue → Decode tokens (occasional updates) → Output.

04Experiments & Results

The test: Measure how well models use longer context while staying fast. Main metrics: test loss (lower is better) and latency (seconds per 1K tokens). Compare methods across context lengths up to 128K and sizes up to 3B parameters.

Competition (baselines):

Transformer with full attention (gold standard for recall, but slow).
Sliding-Window Attention (SWA) only.
Hybrid SWA and full (5:1) layers.
Mamba 2 and Gated DeltaNet (RNN-style hybrids).
TTT-KVB (a prior TTT variant with key–value binding).

Scoreboard with context scaling (3B models, up to 128K):

Accuracy: TTT-E2E scales with context like full attention (i.e., keeps improving as context grows), while SWA-only and RNN baselines degrade at very long contexts.
Speed: Prefill latency stays constant with context length for TTT-E2E, like SWA/RNNs, and is about 2.7× faster than full attention at 128K on an H100.
Where the gains come from: Token-level breakdown shows TTT-E2E’s loss is below full attention across the whole context; most of the average gain comes from earlier tokens. Intuition: TTT-E2E focuses on the present mini-batch, while full attention’s weights must be ready for all possible futures at once.

Ablations (what matters most):

Window size k: Larger is better for everyone; set to 8K by default for speed/benefit balance.
TTT mini-batch b: Smaller b helps performance; b=1K is the sweet spot balancing accuracy, stability, and hardware.
Number of TTT-updated layers: Updating 1 or 3 layers doesn’t scale well with context; updating 6 or 12 layers does. Since 6 ≈ 12 in quality but costs less, they pick last 1/4 of blocks.
Architecture-only changes (without TTT) don’t move the needle; TTT is the main driver.

Scaling with compute:

With more model size and more training tokens, TTT-E2E tracks full attention’s scaling trend once past a moderate compute threshold (e.g., around 760M params or ~48B tokens in these runs). Below that, differences can appear due to undertraining effects.
Data and tokenizer quality matter: Newer tokenizers (e.g., Llama 3) and higher-quality data (e.g., DCLM) improved margins in anecdotal tests.

Recall tests (Needle-in-a-Haystack, 128K):

Full attention wins clearly for pure recall: when the task is to retrieve a specific string buried in a long text, full attention’s near-lossless memory is best.
TTT-E2E is competitive with efficient baselines but not with full attention here—consistent with the method’s philosophy: compress what matters, don’t remember every detail.

Decoding long sequences:

Using Qwen-8B as an evaluator, TTT-E2E had lower evaluator loss than full attention over 16K-token generations following 8K-token prefills. Both methods showed a spike at the prefill/decode boundary (evaluator adjusting to new text style), which then decayed as more generated text accumulated.

Training efficiency:

Inference: TTT-E2E uses standard training infrastructure at test time; no custom kernels needed beyond standard attention.
Training: Outer-loop meta-learning requires gradients of gradients and gradient checkpointing through time. At 8K, training latency is ~3.4× slower than full attention; at 128K it’s ~1.2× faster (on H200). FLOPs per token are constant, but latency increases with context due to extra checkpointing.
Future speed-ups: FlashAttention-like support for second-order gradients and initializing from a pre-trained non-TTT model could cut training overhead substantially.

Bottom line with context: At 128K, TTT-E2E turns the “worst line” (sliding-window only) into the best among efficient methods, matching full attention’s scaling trend while keeping constant latency—and beating it on speed.

05Discussion & Limitations

Limitations:

Pure recall tasks: If you must retrieve an exact string anywhere in 128K tokens (like strict Needle-in-a-Haystack), full attention still dominates. TTT-E2E trades exact recall for fast, strong compression.
Training overhead: Meta-learning with gradients of gradients and through-time checkpointing makes training slower, especially at short contexts; engineering work is needed to close the gap.
Sensitivity to hyperparameters: b (TTT mini-batch size), k (window size), and number of updated layers meaningfully affect results.
Risk of forgetting: Although mitigated by updating only some MLPs and adding a static MLP, careless settings could overwrite useful knowledge.
Data/tokenizer quality: Results improved with newer tokenizers and higher-quality data; weaker setups may reduce gains.

Required resources:

GPUs with good memory bandwidth (e.g., H100/H200/GB200) and frameworks that support second-order gradients (JAX/PyTorch XLA/etc.).
Long-sequence datasets for both pretraining and extension fine-tuning.
Engineering to manage gradient checkpointing through time and stable meta-learning runs.

When NOT to use it:

If your task is exact retrieval across huge contexts (e.g., pinpointing a UUID in 128K reliably), full attention or hybrid models with more full-attn layers may be preferable.
If you have very short contexts (e.g., 2K–4K) and care only about quickest training, the extra training complexity may not pay off.
Extremely resource-constrained training environments where second-order gradients aren’t feasible.

Open questions:

Can we get the same benefits with cheaper training? (e.g., pretrain a standard model, then short meta-learn TTT only at the end.)
How far can this scale (256K, 1M context) with the same simple recipe? What fraction of blocks should be updated at 1M?
Can we mix in small doses of exact recall (occasional full-attn layers) to improve NIAH without losing speed?
What are the best self-generated signals for TTT during long decoding (e.g., reviewer summaries or distilled notes between batches)?
Can smarter, learned optimizers or gating improve what gets written into “fast memory” and prevent spurious updates?

06Conclusion & Future Work

Three-sentence summary: This paper reframes long-context language modeling as continual learning: use a standard sliding-window Transformer, but keep training it at test time on the very text you’re reading (TTT) and train the model to be great at that with meta-learning. The result matches full attention’s ability to benefit from longer contexts while keeping constant latency like RNNs, and runs 2.7× faster than full attention at 128K during prefill. It outperforms efficient baselines on language modeling but concedes pure recall to full attention.

Main achievement: Proving that a standard architecture plus end-to-end TTT (next-token prediction at test time and meta-learning at train time) can deliver full-attention-like context scaling with constant inference latency.

Future directions:

Reduce training cost via better kernels for second-order gradients and by initializing from non-TTT pretrains.
Explore hybrid models that blend a touch of exact recall for NIAH-like tasks without losing speed.
Scale to million-token contexts and refine how much of the network gets updated at test time.
Use self-generated summaries or reviews as TTT targets during long decoding.

Why remember this: It shifts the mindset from “invent a new architecture” to “learn while you read,” offering a simple, practical path to efficient long-context reasoning: compress what matters now into fast, updateable weights—and keep going.

Practical Applications

•Summarize long documents (books, reports) efficiently by adapting to each document’s style as it reads.
•Maintain long customer support chats while staying fast, adapting to each customer’s terminology on the fly.
•Analyze large codebases by learning local project conventions as it parses files, without re-indexing everything.
•Review long legal contracts, compressing key clauses into fast memory for quick follow-up questions.
•Process longitudinal medical notes, adapting to patient-specific terms and histories while keeping latency low.
•Do research over many papers: as it reads new sections, it updates its understanding and improves the next predictions.
•Generate long-form content (chapters, technical docs) with periodic TTT updates on the just-written text to keep style and facts consistent.
•Handle streaming logs or transcripts by continuously adapting to new patterns in real time.
•Power educational tools that adapt to a student’s writing over long assignments, giving tailored feedback quickly.
•Support enterprise knowledge assistants that read and adapt to internal documents without heavy retraining.

Version: 1