Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Key Summary
- âąThe paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.
- âąUnder the same total number of training updates, more epochs on fewer samples gave much higher scores on tough tests like AIME and GPQA.
- âąImprovements rose with repetition until the model perfectly remembered the training tokens; after that, gains stopped growing.
- âąA simple, cheap signalâtraining token accuracyâcan tell you when to stop repeating because benefits have saturated.
- âąRepeating data also helped the model finish its thoughts (terminate) much more often, which strongly tracked better scores.
- âąDespite heavy repetition, there was no extra catastrophic forgetting compared to training once on much larger datasets.
- âąData quality and teacher model strength still matter, but the repetition advantage held across different models and datasets.
- âąEven training on incorrect reasoning traces did not hurt and sometimes helped on certain benchmarks, though overall less than correct traces.
- âąValidation loss on held-out SFT data looked worse with more repetition, yet real reasoning performance kept improving, so loss is a bad guide here.
- âąThis gives a practical recipe: use a small, random SFT subset, train many epochs, and stop when training token accuracy reaches 100%.
Why This Research Matters
This work gives teams a cheaper, faster way to improve reasoning models: repeat a small, random SFT subset for many epochs and stop when training token accuracy saturates. That means better math, science, and logic help for students and professionals without needing to collect mountains of expensive data. It also reduces compute and energy costs, making strong reasoning models more accessible to smaller labs and companies. The finding that termination rises with repetition translates to clearer, more complete answers in real tools. Even when typical validation loss looks worse, real reasoning improvesâso this changes how we monitor and stop training. Overall, it reshapes best practices for building reliable, step-by-step AI assistants.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre practicing piano for a big recital. You could try to play every song in the world once, or you could pick a small set of hard songs and practice them again and again until your fingers know them by heart.
đ„Ź The Concept: Supervised Fine-Tuning (SFT) is the part of training where we show a language model examples of how we want it to think and answer, especially with detailed step-by-step reasoning called Chain-of-Thought (CoT).
- What it is: SFT uses example questions and the exact desired responses to shape how the model behaves.
- How it works: 1) Pick example prompts and desired responses, 2) Ask the model to predict the next token in those responses, 3) Adjust the model whenever itâs wrong, 4) Repeat many times.
- Why it matters: Without SFT, a model that knows a lot of facts might still struggle to show its work or follow the structure we want.
đ Anchor: Like a music teacher playing a phrase and asking you to mimic it exactly until you get the rhythm and timing right.
The world before: For years, the common wisdom in machine learning was âmore unique data is better.â Each fresh example was thought to add new information and reduce overfitting. This matches success stories in pretraining: bigger, more diverse corpora helped models learn wide world knowledge. So in post-training (like SFT), the recipe often became: collect massive instruction or CoT datasets and do a light pass (one or two epochs).
The problem: Long, high-quality CoT data is expensive. It either needs careful humans or careful distillation from bigger teacher models. Getting millions of perfect step-by-step examples is slow and pricey. If we have a fixed training budget (we can only do so many updates), is it really best to spread those updates across as many unique samples as possible?
Failed attempts: Teams typically scaled SFT by adding more and more unique samples and limited epochs to just one or a few. This aligns with statistical learning theory under i.i.d. data, but practical results for reasoning were mixed, especially when models failed to finish their long answers (they didnât terminate) or didnât internalize the structure of reasoning.
The gap: We lacked a careful, controlled comparison of âmany unique samples, few epochsâ versus âfew unique samples, many epochs,â under the exact same total number of updates for long-CoT SFT.
đ Hook: You know how repeating a tough math problem until you nail every step makes you faster and more confident than doing a hundred different easy problems just once?
đ„Ź The Concept: Data Repetition Advantage says that, for long-CoT SFT, going over the same smaller set many times can work better than using a much larger set only once (when total updates are the same).
- What it is: A surprising flip of the usual ruleârepeated practice on a small set helps reasoning more.
- How it works: 1) Choose a modest random subset of SFT data, 2) Train for many epochs, 3) Watch token-level accuracy on the training set rise to near 100%, 4) Stop when gains saturate.
- Why it matters: Without repetition, the model often doesnât absorb the structure of long reasoning or reliably finish with a final answer.
đ Anchor: Like drilling a short playlist of hard piano pieces until you can play them perfectly, which then makes you better at similar new pieces in a concert.
Real stakes: This matters for how we build and deploy reasoning models in real life. If repetition on small, carefully chosen data can beat huge datasets, we can cut costs and still get better performance. That affects homework helpers, coding assistants, science tutoring, and any job where a model must think in steps and conclude clearly. It also helps labs with smaller budgets make strong models without hoarding massive, expensive datasets.
New twist: Even though repeating a small set until the model fully memorizes it sounds like âoverfitting,â the paper finds that real-world reasoning performance still rises and then plateaus, rather than collapsing. This suggests SFT for reasoning is more like teaching the model to express capabilities it already has, not just stuffing in new facts.
02Core Idea
đ Hook: Think about practicing free throws. If you take 1,000 random shots from all over the court, youâll get some practice. But if you take 300 free throws and repeat them day after day, your free-throw form becomes automaticâand your game-time shots get way better.
đ„Ź The Concept: The key insight is that, for long Chain-of-Thought SFT under a fixed update budget, many epochs on a smaller dataset beat one epoch on a much larger dataset.
- What it is: Given the same number of total training steps, repetition focuses learning on how to structure and complete long reasoning.
- How it works: 1) Hold total updates constant, 2) Trade unique samples for more epochs, 3) The model rapidly increases training token accuracy, 4) Termination and benchmark scores climb together, 5) Gains level off once token accuracy hits ~100%.
- Why it matters: Without repetition, the model often under-learns the rhythm of long reasoning and fails to finish answers, capping real performance.
đ Anchor: Like perfecting a few hard drills until your body remembers them, which then helps you play better in real games with new plays.
Multiple analogies:
- Music: Repeating a tricky passage until your fingers âknow itâ gives you flawless control you can reuse in a new song.
- Sports: Shooting the same free throw many times engrains posture, release, and follow-throughâuseful in any game.
- Cooking: Practicing a base sauce over and over makes your timing and heat control automatic, so you can handle new recipes confidently.
Before vs After:
- Before: Teams often favored giant SFT datasets with few epochs, expecting diversity to generalize best.
- After: This paper shows that, for reasoning with long CoT, repetition builds strong internal habits (finish your thoughts, follow structure) that transfer better than a one-pass tour of a massive dataset.
đ Hook: You know how teachers sometimes say, âShow your work and box your final answerâ?
đ„Ź The Concept: Termination Rate is how often the model actually concludes its reasoning and outputs a final answer (instead of trailing off or getting cut).
- What it is: A simple count of finished generations.
- How it works: 1) Track if the model reaches an end token, 2) More repetition â more consistent endings, 3) More endings â more answers scored.
- Why it matters: If the model doesnât finish, it canât be correctâeven if the middle steps were good.
đ Anchor: Like handing in a test: if you donât write the final answer, you canât get full credit.
Why it works (intuition):
- Repetition encourages the model to internalize the structure of long answers: start, reason in steps, box the final answer, stop. This âprocedural fluencyâ seems more important than seeing more unique problems just once.
- Token Accuracy on the training set acts like a gas gauge: once it hits full (near 100%), extra laps donât add mileageâthe generalization gains plateau.
- Even though validation loss on held-out SFT data may get worse (a classic overfitting sign), real-world reasoning gets better because the model is solidifying a useful behavior pattern, not just memorizing trivia.
Building blocks (mini lessons): đ Hook: Imagine practicing the same poem until you can say it word for word. đ„Ź The Concept: Token Accuracy measures how often the modelâs next-token guess matches the training token.
- What it is: A per-token correctness score.
- How it works: Compare the modelâs top guess to the true next token for each step; average across tokens.
- Why it matters: When token accuracy on the train set hits ~100%, repetition has âtaught the habit,â and gains usually stop growing. đ Anchor: Like reciting every word of the poem exactly right without peeking.
đ Hook: Think of lap counts on a track. đ„Ź The Concept: An Epoch is one full pass through the training dataset.
- What it is: How many times youâve seen the whole set.
- How it works: Repeat the same examples again; each pass refines the modelâs habits.
- Why it matters: More epochs on a smaller set = deeper mastery of structure. đ Anchor: Running the same track multiple times builds smoother, more controlled strides.
đ Hook: You have a fixed amount of practice time before dinner. đ„Ź The Concept: Update Budget means the total number of training steps you can spend.
- What it is: A hard cap on effort (epochs Ă samples at batch size one).
- How it works: Keep this fixed, then swap between more samples/fewer epochs vs fewer samples/more epochs.
- Why it matters: The paper shows the latter wins for long-CoT SFT. đ Anchor: With only 60 minutes, doing 6 perfect drills 10 times each beats doing 60 different drills once.
03Methodology
At a high level: Input (a long-CoT SFT dataset) â Choose a fixed update budget â Pick a samplesâepochs pair that multiplies to that budget â Train the model only on response tokens â Evaluate on reasoning benchmarks with multiple generations per problem â Record accuracy, Pass@N, and termination â Repeat for many pairs, compare.
đ Hook: Picture a cooking contest where every team gets the same amount of cooking time, but they can choose whether to practice lots of different recipes once, or a small set many times.
đ„Ź The Concept: Fixed Update Budget ensures a fair comparison between âmore data, fewer epochsâ and âless data, more epochs.â
- What it is: The total number of gradient updates is the same for each training run.
- How it works: 1) Set a budget (e.g., 51,200 updates), 2) Create many configs whose epochs Ă unique samples = budget, 3) Train each from the base checkpoint.
- Why it matters: Without this, we might confuse the benefit of more total practice with the benefit of repetition itself.
đ Anchor: Every chef gets 4 hours total; some practice 8 recipes once, others practice 2 recipes repeatedly. Now we can compare fairly.
Recipe steps:
- Models and starting point
- What happens: Use base (pre-instruction) checkpoints: Olmo3-7B, Qwen3-8B, and Qwen3-4B, each with its chat template.
- Why this step exists: Ensures a clean view of SFT dynamics without earlier instruction tuning muddying the results.
- Example: Start Olmo3-7B at its base, then run SFT variants from scratch for each config.
- Training data prep
- What happens: Use long-CoT SFT data (Dolci SFT 7B). Keep only first turns with full CoT (<think>âŠ</think>), remove extra-long samples (>10k tokens), and build nested subsets: 200 â 400 â ⊠â 51,200.
- Why this step exists: Guarantees long, structured reasoning signals and controlled, comparable subsets.
- Example: A 1,600-sample subset is fully contained within the 3,200-sample subset.
- Training settings
- What happens: bfloat16 weights, Unsloth kernels, 8-bit Adam, cosine LR, 10% warmup, batch size 1, loss on response tokens only, LR tuned on the big single-epoch run then reused.
- Why this step exists: Keep training stable, fast, and comparable across configs.
- Example: With budget 51,200, if you choose 3,200 samples, thatâs 16 epochs; if you choose 51,200 samples, thatâs 1 epoch.
- Evaluation benches and metrics
- What happens: Test on AIME 2024, AIME 2025 (30 math problems each), and GPQA (graduate-level science multiple-choice). For each problem: request a boxed final answer, sample long generations (up to 30k tokens) with multiple tries per question.
- Why this step exists: These tasks require multi-step reasoning, long chains, and clean endingsâperfect to see if repetition helps structure.
- Example: For AIME we do 16 generations per problem; for GPQA we do 4.
đ Hook: Like grading not just your final exam score but also whether you actually finished the exam!
đ„Ź The Concept: Pass@N and Termination Rate capture different success flavors beyond simple average accuracy.
- What it is: Pass@N = solved at least once across N tries; Termination = fraction of runs that cleanly end.
- How it works: 1) Make several attempts per problem, 2) Track if any reached the correct answer (Pass@N), 3) Track if outputs ended properly (Termination).
- Why it matters: Reasoning can be streaky; finishing cleanly and getting at least one right answer in several tries are both important signals.
đ Anchor: If you take 16 shots at a puzzle, Pass@16 asks, âDid you solve it at least once?â Termination asks, âDid you actually complete your attempt or leave it hanging?â
- The secret sauce: token accuracy as a stop sign
- What happens: While training, measure token accuracy on a small, fixed training subset.
- Why this step exists: It rises mainly with more epochs and stops improving around perfect memorization; thatâs where downstream gains also stop rising.
- Example: Olmo3-7B reaches nearly 100% train token accuracy by ~16â32 epochs, and benchmark improvements level off right there.
- Probes for understanding
- Memorization: Track train token accuracy vs benchmark results.
- Termination: See how often models finish long answers.
- Overfitting signals: Compare train/validation loss and validation prediction entropy.
- Forgetting: Check MMLU (broad knowledge) to see if general skills fade.
- Why this step exists: To separate what looks like overfitting from what actually helps reasoning.
- Example: Validation loss gets worse with epochs, yet AIME/GPQA scores improveâso validation loss isnât a reliable north star here.
What breaks without each step:
- No fixed budget: You canât tell if repetition itself helps or if you just trained longer.
- No long-CoT data: You wonât test the right behavior (multi-step structure and finishing).
- No Pass@N/Termination: You miss the modelâs ability to wrap up answers, a big driver of real scores.
- No token accuracy probe: You lack a simple, cheap stopping rule.
The clever bit: The paper turns a tricky, expensive choice (âcollect more data?â) into a cheap, actionable recipe (ârepeat a small random subset, watch token accuracy, stop at saturationâ), which also reduces compute and time while improving results.
04Experiments & Results
The test: The authors held total training updates fixed and compared many-epochs-on-small-sets versus one-epoch-on-large-sets across several models (Olmo3-7B, Qwen3-8B, Qwen3-4B) and hard reasoning benchmarks (AIME 2024/2025 and GPQA). They measured Accuracy@N, Pass@N, and Termination Rate.
The competition: The baseline everyone expects to win is âmore unique samples, one epoch.â The challenger is âfewer unique samples, many epochs,â matched for the same total updates.
The scoreboard (with context):
- Big headline: For Olmo3-7B, training 32 epochs on 1,600 samples got ~39% average accuracy across benchmarks, versus ~17% for 1 epoch on 51,200 samplesâlike jumping from a D to a strong B.
- Even more extreme: In some settings, 128 epochs on just 400 samples beat 1 epoch on 51,200 samples by 12â26 percentage points across AIMEâ24/â25 and GPQAâgoing from average to standout.
- Consistency: The same ârepeat small beats sample bigâ pattern appeared for Qwen3-8B and Qwen3-4B, and across Accuracy@N and Pass@N, not just one metric.
- Saturation: Gains generally leveled off around 32â64 epochs (when train token accuracy neared 100%), signaling a natural stopping point.
Surprising findings: đ Hook: Think of finishing your homeworkâif you donât write the final answer, you donât get the points. đ„Ź The Concept: Termination Rate, the share of runs that end cleanly, closely tracked accuracy improvements as epochs increased.
-
What it is: A measure of whether the model wraps up its long answer.
-
How it works: More repetition â clearer habit of finishing â higher termination â more answers graded.
-
Why it matters: Non-terminating outputs canât score, even if the middle steps are solid. đ Anchor: Like remembering to circle your final answer on a math testâtiny habit, big effect on your grade.
-
Overfitting paradox: As epochs rose, train loss went down and validation loss (and lower validation prediction entropy) suggested classic overfitting. Yet real reasoning performance kept improving. Conclusion: validation loss on held-out SFT doesnât reflect reasoning gains here.
-
Less forgetting than expected: Comparing training strategies with the same updates, multi-epoch training on a tiny 200-sample set caused less catastrophic forgetting (on MMLU) than the single-epoch, huge-data approachâwhile also giving much better reasoning.
Data properties matter, but the pattern stays:
- Teacher size: With better teachers (e.g., Qwen3-8B for distillation), absolute performance rose, and repetition still helped. With weaker teachers (e.g., 0.6B), adding more unique samples could even hurt at larger budgetsâechoing weak-to-strong generalization issuesâyet repetition still brought gains.
- Incorrect traces: Training on negative (incorrect) trajectories did not tank performance. Repetition still helped and sometimes even matched or beat positives on specific benchmarks like AIMEâ24 and GPQA, though overall positives remained stronger on average.
Takeaway translation: Repetition doesnât just make the model âmemorize answersâ; it seems to engrain the structure and habit of long reasoning and finishing, which transfers strongly to new problems. Once that habit is fully internalized (token accuracy â 100% on the train set), youâve squeezed out the gainsâtime to stop.
05Discussion & Limitations
Limitations:
- Picking the subset size: The best small subset size depends on your data and model. Too tiny wastes capacity; too big spreads practice too thin per sample at a fixed budget. There isnât yet a principled formula to choose it ahead of time.
- Mystery mechanism: We know repetition works and correlates with train token accuracy and termination, but we donât yet fully understand the causal story of why memorization boosts generalization in long-CoT SFT.
- Validation misguidance: Standard validation loss and entropy on held-out SFT examples mislead; they look worse while real reasoning gets better. Teams will need alternative early-stopping and selection signals.
- Domain dependence: This paper focuses on long-CoT reasoning. The same repetition edge may not hold for all tasks, especially those where breadth of content exposure is more critical than procedural fluency.
Required resources:
- Compute: Modest. The recipe often uses far fewer unique samples and can be trained on a single high-memory GPU (e.g., H100 94GB per run) within a day.
- Data: A small, random subset of high-quality long-CoT samples, properly filtered to include full reasoning traces and clear endings.
- Tooling: Ability to log train token accuracy on a fixed subset, measure termination, and run multi-try evaluations (Pass@N) on your target benchmarks.
When not to use:
- If your task is fact-retrieval heavy and not about long reasoning structure, more unique data might still be superior.
- If you cannot generate long outputs (tight context or token limits), the structural benefits of repetition may not appear.
- If your SFT data lacks full reasoning traces or consistent ending conventions, repetition may engrain poor habits.
Open questions:
- Why does full memorization of training tokens align with peak generalization in long-CoT SFT? What internal representations change?
- Can we predict the best subset size for a given model and domain without a big sweep?
- Can we design better training signals than next-token cross-entropyâsignals that more directly reward structuring and terminating reasoning?
- How do reinforcement learning stages interact with repetition-primed SFT models? Do they amplify or reduce the repetition gains?
06Conclusion & Future Work
Three-sentence summary: The paper finds that, for long Chain-of-Thought SFT under a fixed update budget, repeating a small random subset for many epochs beats training once on a much larger dataset. Improvements rise with epochs and closely track training token accuracy and termination rate, then saturate when training tokens are fully memorizedâwithout causing extra catastrophic forgetting. This flips the usual âmore unique data is always betterâ intuition and gives a simple, cheaper recipe for better reasoning models.
Main achievement: Establishing and carefully characterizing the Data Repetition Advantage in long-CoT SFTâand offering a practical, low-cost stopping rule based on training token accuracy.
Future directions:
- Theory that explains why memorization aligns with better generalization in long-CoT SFT.
- Methods to pick the best subset size automatically, given a model and domain.
- New objectives or schedules that accelerate structural learning (reasoning and termination) even further.
- Understanding how repetition-primed SFT interacts with downstream RL fine-tuning and evaluation-time scaling.
Why remember this: It changes how we think about teaching models to reason. Instead of chasing ever-larger SFT datasets, we can repeatedly practice a small, high-quality set, watch a simple signal (token accuracy), and stop when the habit is locked inâsaving compute, time, and money while getting better real-world reasoning.
Practical Applications
- âąFine-tune a reasoning assistant on a 1â3k random CoT subset for many epochs and stop when training token accuracy nears 100%.
- âąUse termination rate as a diagnostic: if itâs low, increase epochs on the same subset before collecting more data.
- âąWhen budget-limited, prefer more epochs over chasing more unique SFT data, especially for math and science reasoning.
- âąIf you must choose between noisy new data and repeating a clean subset, repeat the clean subset first.
- âąTrack token accuracy on a fixed mini-train set each epoch to decide stopping automatically.
- âąFor new domains, start with a small, well-filtered CoT set (complete traces, clear endings) and scale epochs before scaling data.
- âąUse Pass@N with multiple generations to evaluate practical success during development, not just single-shot accuracy.
- âąWhen using distilled data, prefer stronger teachers, but still apply repetition to lock in structure and endings.
- âąIf correct traces are scarce, donât fear including some incorrect ones; repetition can still provide benefits.
- âąCombine this SFT recipe with later RL stages; begin RL only after token accuracy saturates to save compute.