Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Dawid J. Kopiczko; Sagar Vaze; Tijmen Blankevoort; Yuki M. Asano

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Intermediate

Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort et al.2/11/2026

arXiv

Key Summary

•The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.
•Under the same total number of training updates, more epochs on fewer samples gave much higher scores on tough tests like AIME and GPQA.
•Improvements rose with repetition until the model perfectly remembered the training tokens; after that, gains stopped growing.
•A simple, cheap signal—training token accuracy—can tell you when to stop repeating because benefits have saturated.
•Repeating data also helped the model finish its thoughts (terminate) much more often, which strongly tracked better scores.
•Despite heavy repetition, there was no extra catastrophic forgetting compared to training once on much larger datasets.
•Data quality and teacher model strength still matter, but the repetition advantage held across different models and datasets.
•Even training on incorrect reasoning traces did not hurt and sometimes helped on certain benchmarks, though overall less than correct traces.
•Validation loss on held-out SFT data looked worse with more repetition, yet real reasoning performance kept improving, so loss is a bad guide here.
•This gives a practical recipe: use a small, random SFT subset, train many epochs, and stop when training token accuracy reaches 100%.

Why This Research Matters

This work gives teams a cheaper, faster way to improve reasoning models: repeat a small, random SFT subset for many epochs and stop when training token accuracy saturates. That means better math, science, and logic help for students and professionals without needing to collect mountains of expensive data. It also reduces compute and energy costs, making strong reasoning models more accessible to smaller labs and companies. The finding that termination rises with repetition translates to clearer, more complete answers in real tools. Even when typical validation loss looks worse, real reasoning improves—so this changes how we monitor and stop training. Overall, it reshapes best practices for building reliable, step-by-step AI assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re practicing piano for a big recital. You could try to play every song in the world once, or you could pick a small set of hard songs and practice them again and again until your fingers know them by heart.

🥬 The Concept: Supervised Fine-Tuning (SFT) is the part of training where we show a language model examples of how we want it to think and answer, especially with detailed step-by-step reasoning called Chain-of-Thought (CoT).

What it is: SFT uses example questions and the exact desired responses to shape how the model behaves.
How it works: 1) Pick example prompts and desired responses, 2) Ask the model to predict the next token in those responses, 3) Adjust the model whenever it’s wrong, 4) Repeat many times.
Why it matters: Without SFT, a model that knows a lot of facts might still struggle to show its work or follow the structure we want.

🍞 Anchor: Like a music teacher playing a phrase and asking you to mimic it exactly until you get the rhythm and timing right.

The world before: For years, the common wisdom in machine learning was “more unique data is better.” Each fresh example was thought to add new information and reduce overfitting. This matches success stories in pretraining: bigger, more diverse corpora helped models learn wide world knowledge. So in post-training (like SFT), the recipe often became: collect massive instruction or CoT datasets and do a light pass (one or two epochs).

The problem: Long, high-quality CoT data is expensive. It either needs careful humans or careful distillation from bigger teacher models. Getting millions of perfect step-by-step examples is slow and pricey. If we have a fixed training budget (we can only do so many updates), is it really best to spread those updates across as many unique samples as possible?

Failed attempts: Teams typically scaled SFT by adding more and more unique samples and limited epochs to just one or a few. This aligns with statistical learning theory under i.i.d. data, but practical results for reasoning were mixed, especially when models failed to finish their long answers (they didn’t terminate) or didn’t internalize the structure of reasoning.

The gap: We lacked a careful, controlled comparison of “many unique samples, few epochs” versus “few unique samples, many epochs,” under the exact same total number of updates for long-CoT SFT.

🍞 Hook: You know how repeating a tough math problem until you nail every step makes you faster and more confident than doing a hundred different easy problems just once?

🥬 The Concept: Data Repetition Advantage says that, for long-CoT SFT, going over the same smaller set many times can work better than using a much larger set only once (when total updates are the same).

What it is: A surprising flip of the usual rule—repeated practice on a small set helps reasoning more.
How it works: 1) Choose a modest random subset of SFT data, 2) Train for many epochs, 3) Watch token-level accuracy on the training set rise to near 100%, 4) Stop when gains saturate.
Why it matters: Without repetition, the model often doesn’t absorb the structure of long reasoning or reliably finish with a final answer.

🍞 Anchor: Like drilling a short playlist of hard piano pieces until you can play them perfectly, which then makes you better at similar new pieces in a concert.

Real stakes: This matters for how we build and deploy reasoning models in real life. If repetition on small, carefully chosen data can beat huge datasets, we can cut costs and still get better performance. That affects homework helpers, coding assistants, science tutoring, and any job where a model must think in steps and conclude clearly. It also helps labs with smaller budgets make strong models without hoarding massive, expensive datasets.

New twist: Even though repeating a small set until the model fully memorizes it sounds like “overfitting,” the paper finds that real-world reasoning performance still rises and then plateaus, rather than collapsing. This suggests SFT for reasoning is more like teaching the model to express capabilities it already has, not just stuffing in new facts.

02Core Idea

🍞 Hook: Think about practicing free throws. If you take 1,000 random shots from all over the court, you’ll get some practice. But if you take 300 free throws and repeat them day after day, your free-throw form becomes automatic—and your game-time shots get way better.

🥬 The Concept: The key insight is that, for long Chain-of-Thought SFT under a fixed update budget, many epochs on a smaller dataset beat one epoch on a much larger dataset.

What it is: Given the same number of total training steps, repetition focuses learning on how to structure and complete long reasoning.
How it works: 1) Hold total updates constant, 2) Trade unique samples for more epochs, 3) The model rapidly increases training token accuracy, 4) Termination and benchmark scores climb together, 5) Gains level off once token accuracy hits ~100%.
Why it matters: Without repetition, the model often under-learns the rhythm of long reasoning and fails to finish answers, capping real performance.

🍞 Anchor: Like perfecting a few hard drills until your body remembers them, which then helps you play better in real games with new plays.

Multiple analogies:

Music: Repeating a tricky passage until your fingers “know it” gives you flawless control you can reuse in a new song.
Sports: Shooting the same free throw many times engrains posture, release, and follow-through—useful in any game.
Cooking: Practicing a base sauce over and over makes your timing and heat control automatic, so you can handle new recipes confidently.

Before vs After:

Before: Teams often favored giant SFT datasets with few epochs, expecting diversity to generalize best.
After: This paper shows that, for reasoning with long CoT, repetition builds strong internal habits (finish your thoughts, follow structure) that transfer better than a one-pass tour of a massive dataset.

🍞 Hook: You know how teachers sometimes say, “Show your work and box your final answer”?

🥬 The Concept: Termination Rate is how often the model actually concludes its reasoning and outputs a final answer (instead of trailing off or getting cut).

What it is: A simple count of finished generations.
How it works: 1) Track if the model reaches an end token, 2) More repetition → more consistent endings, 3) More endings → more answers scored.
Why it matters: If the model doesn’t finish, it can’t be correct—even if the middle steps were good.

🍞 Anchor: Like handing in a test: if you don’t write the final answer, you can’t get full credit.

Why it works (intuition):

Repetition encourages the model to internalize the structure of long answers: start, reason in steps, box the final answer, stop. This “procedural fluency” seems more important than seeing more unique problems just once.
Token Accuracy on the training set acts like a gas gauge: once it hits full (near 100%), extra laps don’t add mileage—the generalization gains plateau.
Even though validation loss on held-out SFT data may get worse (a classic overfitting sign), real-world reasoning gets better because the model is solidifying a useful behavior pattern, not just memorizing trivia.

Building blocks (mini lessons): 🍞 Hook: Imagine practicing the same poem until you can say it word for word. 🥬 The Concept: Token Accuracy measures how often the model’s next-token guess matches the training token.

What it is: A per-token correctness score.
How it works: Compare the model’s top guess to the true next token for each step; average across tokens.
Why it matters: When token accuracy on the train set hits ~100%, repetition has “taught the habit,” and gains usually stop growing. 🍞 Anchor: Like reciting every word of the poem exactly right without peeking.

🍞 Hook: Think of lap counts on a track. 🥬 The Concept: An Epoch is one full pass through the training dataset.

What it is: How many times you’ve seen the whole set.
How it works: Repeat the same examples again; each pass refines the model’s habits.
Why it matters: More epochs on a smaller set = deeper mastery of structure. 🍞 Anchor: Running the same track multiple times builds smoother, more controlled strides.

🍞 Hook: You have a fixed amount of practice time before dinner. 🥬 The Concept: Update Budget means the total number of training steps you can spend.

What it is: A hard cap on effort (epochs × samples at batch size one).
How it works: Keep this fixed, then swap between more samples/fewer epochs vs fewer samples/more epochs.
Why it matters: The paper shows the latter wins for long-CoT SFT. 🍞 Anchor: With only 60 minutes, doing 6 perfect drills 10 times each beats doing 60 different drills once.

03Methodology

At a high level: Input (a long-CoT SFT dataset) → Choose a fixed update budget → Pick a samples–epochs pair that multiplies to that budget → Train the model only on response tokens → Evaluate on reasoning benchmarks with multiple generations per problem → Record accuracy, Pass@N, and termination → Repeat for many pairs, compare.

🍞 Hook: Picture a cooking contest where every team gets the same amount of cooking time, but they can choose whether to practice lots of different recipes once, or a small set many times.

🥬 The Concept: Fixed Update Budget ensures a fair comparison between “more data, fewer epochs” and “less data, more epochs.”

What it is: The total number of gradient updates is the same for each training run.
How it works: 1) Set a budget (e.g., 51,200 updates), 2) Create many configs whose epochs × unique samples = budget, 3) Train each from the base checkpoint.
Why it matters: Without this, we might confuse the benefit of more total practice with the benefit of repetition itself.

🍞 Anchor: Every chef gets 4 hours total; some practice 8 recipes once, others practice 2 recipes repeatedly. Now we can compare fairly.

Recipe steps:

Models and starting point

What happens: Use base (pre-instruction) checkpoints: Olmo3-7B, Qwen3-8B, and Qwen3-4B, each with its chat template.
Why this step exists: Ensures a clean view of SFT dynamics without earlier instruction tuning muddying the results.
Example: Start Olmo3-7B at its base, then run SFT variants from scratch for each config.

Training data prep

What happens: Use long-CoT SFT data (Dolci SFT 7B). Keep only first turns with full CoT (<think>…</think>), remove extra-long samples (>10k tokens), and build nested subsets: 200 → 400 → … → 51,200.
Why this step exists: Guarantees long, structured reasoning signals and controlled, comparable subsets.
Example: A 1,600-sample subset is fully contained within the 3,200-sample subset.

Training settings

What happens: bfloat16 weights, Unsloth kernels, 8-bit Adam, cosine LR, 10% warmup, batch size 1, loss on response tokens only, LR tuned on the big single-epoch run then reused.
Why this step exists: Keep training stable, fast, and comparable across configs.
Example: With budget 51,200, if you choose 3,200 samples, that’s 16 epochs; if you choose 51,200 samples, that’s 1 epoch.

Evaluation benches and metrics

What happens: Test on AIME 2024, AIME 2025 (30 math problems each), and GPQA (graduate-level science multiple-choice). For each problem: request a boxed final answer, sample long generations (up to 30k tokens) with multiple tries per question.
Why this step exists: These tasks require multi-step reasoning, long chains, and clean endings—perfect to see if repetition helps structure.
Example: For AIME we do 16 generations per problem; for GPQA we do 4.

🍞 Hook: Like grading not just your final exam score but also whether you actually finished the exam!

🥬 The Concept: Pass@N and Termination Rate capture different success flavors beyond simple average accuracy.

What it is: Pass@N = solved at least once across N tries; Termination = fraction of runs that cleanly end.
How it works: 1) Make several attempts per problem, 2) Track if any reached the correct answer (Pass@N), 3) Track if outputs ended properly (Termination).
Why it matters: Reasoning can be streaky; finishing cleanly and getting at least one right answer in several tries are both important signals.

🍞 Anchor: If you take 16 shots at a puzzle, Pass@16 asks, “Did you solve it at least once?” Termination asks, “Did you actually complete your attempt or leave it hanging?”

The secret sauce: token accuracy as a stop sign

What happens: While training, measure token accuracy on a small, fixed training subset.
Why this step exists: It rises mainly with more epochs and stops improving around perfect memorization; that’s where downstream gains also stop rising.
Example: Olmo3-7B reaches nearly 100% train token accuracy by ~16–32 epochs, and benchmark improvements level off right there.

Probes for understanding

Memorization: Track train token accuracy vs benchmark results.
Termination: See how often models finish long answers.
Overfitting signals: Compare train/validation loss and validation prediction entropy.
Forgetting: Check MMLU (broad knowledge) to see if general skills fade.
Why this step exists: To separate what looks like overfitting from what actually helps reasoning.
Example: Validation loss gets worse with epochs, yet AIME/GPQA scores improve—so validation loss isn’t a reliable north star here.

What breaks without each step:

No fixed budget: You can’t tell if repetition itself helps or if you just trained longer.
No long-CoT data: You won’t test the right behavior (multi-step structure and finishing).
No Pass@N/Termination: You miss the model’s ability to wrap up answers, a big driver of real scores.
No token accuracy probe: You lack a simple, cheap stopping rule.

The clever bit: The paper turns a tricky, expensive choice (“collect more data?”) into a cheap, actionable recipe (“repeat a small random subset, watch token accuracy, stop at saturation”), which also reduces compute and time while improving results.

04Experiments & Results

The test: The authors held total training updates fixed and compared many-epochs-on-small-sets versus one-epoch-on-large-sets across several models (Olmo3-7B, Qwen3-8B, Qwen3-4B) and hard reasoning benchmarks (AIME 2024/2025 and GPQA). They measured Accuracy@N, Pass@N, and Termination Rate.

The competition: The baseline everyone expects to win is “more unique samples, one epoch.” The challenger is “fewer unique samples, many epochs,” matched for the same total updates.

The scoreboard (with context):

Big headline: For Olmo3-7B, training 32 epochs on 1,600 samples got ~39% average accuracy across benchmarks, versus ~17% for 1 epoch on 51,200 samples—like jumping from a D to a strong B.
Even more extreme: In some settings, 128 epochs on just 400 samples beat 1 epoch on 51,200 samples by 12–26 percentage points across AIME’24/’25 and GPQA—going from average to standout.
Consistency: The same “repeat small beats sample big” pattern appeared for Qwen3-8B and Qwen3-4B, and across Accuracy@N and Pass@N, not just one metric.
Saturation: Gains generally leveled off around 32–64 epochs (when train token accuracy neared 100%), signaling a natural stopping point.

Surprising findings: 🍞 Hook: Think of finishing your homework—if you don’t write the final answer, you don’t get the points. 🥬 The Concept: Termination Rate, the share of runs that end cleanly, closely tracked accuracy improvements as epochs increased.

What it is: A measure of whether the model wraps up its long answer.
How it works: More repetition → clearer habit of finishing → higher termination → more answers graded.
Why it matters: Non-terminating outputs can’t score, even if the middle steps are solid. 🍞 Anchor: Like remembering to circle your final answer on a math test—tiny habit, big effect on your grade.
Overfitting paradox: As epochs rose, train loss went down and validation loss (and lower validation prediction entropy) suggested classic overfitting. Yet real reasoning performance kept improving. Conclusion: validation loss on held-out SFT doesn’t reflect reasoning gains here.
Less forgetting than expected: Comparing training strategies with the same updates, multi-epoch training on a tiny 200-sample set caused less catastrophic forgetting (on MMLU) than the single-epoch, huge-data approach—while also giving much better reasoning.

Data properties matter, but the pattern stays:

Teacher size: With better teachers (e.g., Qwen3-8B for distillation), absolute performance rose, and repetition still helped. With weaker teachers (e.g., 0.6B), adding more unique samples could even hurt at larger budgets—echoing weak-to-strong generalization issues—yet repetition still brought gains.
Incorrect traces: Training on negative (incorrect) trajectories did not tank performance. Repetition still helped and sometimes even matched or beat positives on specific benchmarks like AIME’24 and GPQA, though overall positives remained stronger on average.

Takeaway translation: Repetition doesn’t just make the model “memorize answers”; it seems to engrain the structure and habit of long reasoning and finishing, which transfers strongly to new problems. Once that habit is fully internalized (token accuracy ≈ 100% on the train set), you’ve squeezed out the gains—time to stop.

05Discussion & Limitations

Limitations:

Picking the subset size: The best small subset size depends on your data and model. Too tiny wastes capacity; too big spreads practice too thin per sample at a fixed budget. There isn’t yet a principled formula to choose it ahead of time.
Mystery mechanism: We know repetition works and correlates with train token accuracy and termination, but we don’t yet fully understand the causal story of why memorization boosts generalization in long-CoT SFT.
Validation misguidance: Standard validation loss and entropy on held-out SFT examples mislead; they look worse while real reasoning gets better. Teams will need alternative early-stopping and selection signals.
Domain dependence: This paper focuses on long-CoT reasoning. The same repetition edge may not hold for all tasks, especially those where breadth of content exposure is more critical than procedural fluency.

Required resources:

Compute: Modest. The recipe often uses far fewer unique samples and can be trained on a single high-memory GPU (e.g., H100 94GB per run) within a day.
Data: A small, random subset of high-quality long-CoT samples, properly filtered to include full reasoning traces and clear endings.
Tooling: Ability to log train token accuracy on a fixed subset, measure termination, and run multi-try evaluations (Pass@N) on your target benchmarks.

When not to use:

If your task is fact-retrieval heavy and not about long reasoning structure, more unique data might still be superior.
If you cannot generate long outputs (tight context or token limits), the structural benefits of repetition may not appear.
If your SFT data lacks full reasoning traces or consistent ending conventions, repetition may engrain poor habits.

Open questions:

Why does full memorization of training tokens align with peak generalization in long-CoT SFT? What internal representations change?
Can we predict the best subset size for a given model and domain without a big sweep?
Can we design better training signals than next-token cross-entropy—signals that more directly reward structuring and terminating reasoning?
How do reinforcement learning stages interact with repetition-primed SFT models? Do they amplify or reduce the repetition gains?

06Conclusion & Future Work

Three-sentence summary: The paper finds that, for long Chain-of-Thought SFT under a fixed update budget, repeating a small random subset for many epochs beats training once on a much larger dataset. Improvements rise with epochs and closely track training token accuracy and termination rate, then saturate when training tokens are fully memorized—without causing extra catastrophic forgetting. This flips the usual “more unique data is always better” intuition and gives a simple, cheaper recipe for better reasoning models.

Main achievement: Establishing and carefully characterizing the Data Repetition Advantage in long-CoT SFT—and offering a practical, low-cost stopping rule based on training token accuracy.

Future directions:

Theory that explains why memorization aligns with better generalization in long-CoT SFT.
Methods to pick the best subset size automatically, given a model and domain.
New objectives or schedules that accelerate structural learning (reasoning and termination) even further.
Understanding how repetition-primed SFT interacts with downstream RL fine-tuning and evaluation-time scaling.

Why remember this: It changes how we think about teaching models to reason. Instead of chasing ever-larger SFT datasets, we can repeatedly practice a small, high-quality set, watch a simple signal (token accuracy), and stop when the habit is locked in—saving compute, time, and money while getting better real-world reasoning.

Practical Applications

•Fine-tune a reasoning assistant on a 1–3k random CoT subset for many epochs and stop when training token accuracy nears 100%.
•Use termination rate as a diagnostic: if it’s low, increase epochs on the same subset before collecting more data.
•When budget-limited, prefer more epochs over chasing more unique SFT data, especially for math and science reasoning.
•If you must choose between noisy new data and repeating a clean subset, repeat the clean subset first.
•Track token accuracy on a fixed mini-train set each epoch to decide stopping automatically.
•For new domains, start with a small, well-filtered CoT set (complete traces, clear endings) and scale epochs before scaling data.
•Use Pass@N with multiple generations to evaluate practical success during development, not just single-shot accuracy.
•When using distilled data, prefer stronger teachers, but still apply repetition to lock in structure and endings.
•If correct traces are scarce, don’t fear including some incorrect ones; repetition can still provide benefits.
•Combine this SFT recipe with later RL stages; begin RL only after token accuracy saturates to save compute.

Version: 1