Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Shaotian Yan; Kaiyuan Liu; Chen Shen; Bing Wang; Sinan Fan; Jun Zhang; Yue Wu; Zheng Wang; Jieping Ye

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Intermediate

Shaotian Yan, Kaiyuan Liu, Chen Shen et al.1/14/2026

arXiv PDF

Key Summary

•The paper introduces DASD-4B-Thinking, a small (4B) open-source reasoning model that scores like much larger models on hard math, science, and coding tests.
•It fixes three hidden problems in common distillation (copy-the-teacher) training: poor coverage of teacher answers, student–teacher mismatch, and exposure bias at inference time.
•Temperature-scheduled learning first trains on easy, low-temperature samples, then on diverse, high-temperature samples to cover more of the teacher’s behaviors.
•Divergence-aware sampling (DAS) picks training examples where the teacher is confident and the student is not, which steers learning in the most helpful direction.
•A lightweight mixed-policy distillation step reduces exposure bias by letting the student write part of the answer and the teacher finish it, then training on those blends.
•With only 448K examples (far fewer than many peers), the model reaches 88.5 on AIME24, 83.3 on AIME25, 69.3 on LiveCodeBench v5, and 68.4 on GPQA-Diamond.
•Across multiple domains, the method outperforms other open 4B–8B models and even beats some 32B models, showing excellent efficiency.
•The approach needs only sequence responses and token probabilities for teacher-generated tokens, so it works even when teacher and student tokenizers differ.
•Small additions of mixed-policy data (just a few thousand examples) still bring measurable gains, making the pipeline practical and cost-effective.
•The authors open-source the models and data to help others build compact, strong reasoning systems.

Why This Research Matters

Stronger small models mean everyone can run capable reasoning assistants on everyday devices, not just in big data centers. That lowers costs and expands access to high-quality help in math, science, and coding. Classrooms, startups, and researchers benefit from open, reproducible methods that do not require secret logits or matching tokenizers. The approach is data-efficient, making advanced reasoning more eco-friendly and practical. By reducing exposure bias, answers become steadier during long explanations, which users notice and trust. The open release accelerates community progress and sparks new research on compact, reliable reasoning systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a good tutor doesn’t just hand you the answer sheet—they show many ways to solve problems, from the common tricks to the rare ones, and they let you try on your own too.

🥬 The Concept (Supervised Fine-Tuning, SFT): What it is: SFT is when we teach a model by showing it questions paired with correct answers and asking it to imitate them. How it works:

Collect question–answer pairs.
Feed them to the student model.
Nudge the model to predict the teacher’s answer next time it sees a similar question. Why it matters: Without SFT, the model has no guided examples and learns slowly or badly. 🍞 Anchor: Like practicing with a math workbook that has solutions in the back—copy the steps to learn the pattern.

🍞 Hook: Imagine a master chef writes many full recipes (start to finish). A junior chef learns best by following entire recipes, not just tasting single spoonfuls.

🥬 The Concept (Sequence-Level Distillation): What it is: A way to teach a model by copying entire teacher responses (full sequences), not just individual tokens. How it works:

Ask a powerful teacher model to solve questions and produce full solutions.
Train the student to increase the likelihood of those exact full solutions.
Do this many times so the student’s overall output distribution resembles the teacher’s. Why it matters: Without full-sequence focus, the student may miss the flow and structure of strong reasoning. 🍞 Anchor: Instead of memorizing single cooking steps, the junior chef practices whole recipes to get timing and structure right.

🍞 Hook: If you always do worksheets while the teacher is whispering each next step, test day (when you’re alone) feels very different.

🥬 The Concept (Exposure Bias): What it is: A mismatch where, during training, the student sees perfect teacher prefixes, but during testing it must continue from its own imperfect guesses. How it works:

Train with teacher forcing (feed the true previous tokens).
Test with autoregressive generation (feed the student’s own tokens).
Small mistakes compound because the student never practiced recovering from its own errors. Why it matters: Without addressing exposure bias, long answers drift off course. 🍞 Anchor: Practicing piano with your teacher tapping the rhythm vs. performing solo on stage—the solo feels harder if you never practiced without tapping.

The world before: Researchers found that simply fine-tuning small students on teacher-written solutions (sequence-level distillation via SFT) works surprisingly well for reasoning. This sparked many open projects that gathered huge sets of teacher answers and filtered them for quality. Small models got much better at math, science, and code just by copying good reasoning traces.

The problem: Three hidden cracks showed up.

Poor teacher coverage: Randomly sampled teacher answers don’t cover all the teacher’s “modes” (varied styles/paths). Students learn a narrow slice and miss useful alternatives.
Student–teacher mismatch: SFT pushes the student to raise probabilities on every shown token, even when the teacher would disagree. This can send gradients the wrong way.
Exposure bias: Training uses teacher-forced prefixes; inference uses the student’s own prefixes. The shift causes drift and long-answer errors.

🍞 Hook: Imagine judging a choir by one singer and assuming you know the whole chorus—easy to miss harmonies.

🥬 The Concept (Logit Distillation): What it is: Matching the teacher’s softer probability hints (how likely each token is), not just the one correct token. How it works:

Get teacher probabilities over many token choices.
Train the student to align its probabilities to the teacher’s.
Learn “dark knowledge” about near-miss options. Why it matters: Without soft hints, students overfit to one path and ignore helpful alternatives. 🍞 Anchor: A coach says, “Shot A is 70% good, Shot B is 25%, Shot C is 5%”—you learn more than from just “Do Shot A.”

Failed attempts: Earlier sequence-level methods mostly improved data filtering (remove wrong, too short, too repetitive), but left the core distillation interaction weak: no explicit guidance to cover teacher diversity, no fix for misleading gradients, and no practice recovering from the student’s own mistakes.

The gap: We needed a way to 1) plan which teacher samples to show (to cover more modes without overwhelming the student), 2) select examples that best correct the student’s weaknesses, and 3) let the student practice with its own prefixes while still benefiting from teacher guidance.

Real stakes: In daily life, we want small, fast AI helpers—on laptops or phones—that reason reliably: check math steps, explain science carefully, and write runnable code. If they only copy easy cases, learn wrong signals, or fall apart in long answers, they’ll waste our time. Better distillation means better tutors, coders, and lab partners that fit on everyday devices.

02Core Idea

🍞 Hook: Think of learning to ride a bike: start on a smooth path, try small bumps later, and finally practice steering yourself while a grown-up jogs beside you.

🥬 The Concept (Temperature-Scheduled Learning): What it is: A curriculum that first learns from low-temperature (confident, consistent) teacher answers, then from high-temperature (diverse) answers. How it works:

Stage 1: Train on low-temperature samples to grasp solid patterns.
Stage 2: Train on higher-temperature samples to cover rarer teacher modes.
Keep training steady so the student absorbs diversity without wobbling. Why it matters: Without this schedule, high-diversity data can overwhelm small students; with only low-temp data, they miss useful alternatives. 🍞 Anchor: Start biking on flat ground (easy), then try gentle hills (diverse), so you don’t crash or get stuck only on flat roads.

🍞 Hook: A smart teacher spends extra time where you struggle, not where you’re already strong.

🥬 The Concept (Divergence-Aware Sampling): What it is: A way to pick training examples where the teacher is confident but the student is not (big teacher–student gap). How it works:

For each teacher-generated token, record the teacher’s probability and the student’s probability.
Score examples by how much the teacher’s confidence exceeds the student’s.
Prefer examples rich in those high-gap tokens/sentences. Why it matters: Without DAS, we waste time on examples the student already knows or, worse, push it in wrong directions. 🍞 Anchor: A coach rewatches plays where you missed an easy shot the expert always makes—that’s where you grow fastest.

🍞 Hook: Training wheels help, but you must also practice balancing by yourself.

🥬 The Concept (Mixed-Policy Distillation): What it is: Blend student-generated prefixes with teacher completions, then train on the mix to reduce exposure bias. How it works:

Let the student answer; identify when it drifts or gets cut off.
Chop the student’s answer mid-way and ask the teacher to finish.
Train on this stitched response so the student learns to recover from its own states. Why it matters: Without mixed-policy practice, long answers derail at test time because the student never learned to self-correct. 🍞 Anchor: You start the paragraph; your tutor helps finish it. Next time, you can finish better on your own.

The “Aha!” moment in one sentence: Don’t just copy teacher answers—align the student to the teacher’s sequence distribution with a smart curriculum (temperature scheduling), target the biggest learning gaps (divergence-aware sampling), and practice from the student’s own prefixes (mixed policy).

Three analogies:

Music class: learn the main melody first (low temp), then jazz variations (high temp), while practicing solo with a coach ready to jump in (mixed policy).
Sports: drill fundamentals (low temp), add tricky plays (high temp), then scrimmage with a mentor calling timeouts (mixed policy).
Cooking: master staple recipes (low temp), explore exotic dishes (high temp), and do cook-alongs where you start and the chef steps in when you stall (mixed policy).

Before vs. After:

Before: Random teacher samples, flat SFT, big exposure bias.
After: Temperature-scheduled coverage, gap-focused sampling, and student-prefix training—results jump with fewer samples.

Why it works (intuition):

Low temp builds a stable base (clear signals); high temp broadens coverage (diversity). Together, they approximate more of the teacher’s modes without chaos.
DAS picks the most educational friction—places where the teacher knows and the student doesn’t—so gradients point the right way.
Mixed-policy closes the train–test gap by exposing the student to itself and showing how to recover.

Building blocks:

Data: teacher responses at low/high temperatures.
Signals: token probabilities from teacher and student on teacher-generated tokens.
Selection: DAS to emphasize teacher-strong/student-weak parts.
Training stages: low-temp SFT → high-temp SFT → small mixed-policy finetune.
Quality filters: remove too-long, repetitive, or tool-calling outputs to keep reasoning clean.

03Methodology

High-level flow: Input (hard questions) → Teacher sampling (low temp, then high temp) → Divergence-aware selection + quality filtering → Stage 1 training (low temp) → Stage 2 training (high temp) → Mixed-policy data construction → Final finetune → Output (DASD-4B-Thinking).

Step 1: Collect questions across domains

What happens: Gather challenging math, coding, science, and instruction-following prompts from open datasets.
Why it exists: Rich, varied questions elicit long, step-by-step reasoning the student can learn.
Example: “AIME-style algebra,” “Write a function to parse logs,” “Which molecule is more stable and why?”

Step 2: Sample teacher responses at two temperatures 🍞 Hook: Imagine a camera with zoom (sharp focus) and wide angle (show more scene). 🥬 The Concept (Sampling Temperature): What it is: A knob that controls how adventurous the teacher’s choices are—low is precise, high is diverse. How it works: Lower temperature concentrates probability on top tokens; higher spreads it out to explore alternatives. Why it matters: Only low temp misses variety; only high temp is noisy. We need both. 🍞 Anchor: Study the clearest example first, then look at creative variations.

What happens: For each question, ask the teacher to produce multiple answers at T=0.6 (precise) and T=1.0 (diverse).
Why it exists: Combine clarity (learnable patterns) with coverage (rare but valuable solution paths).
Example: A math proof may have a standard path (low T) and a clever trick path (high T).

Step 3: Divergence-aware sampling (DAS)

What happens: For teacher-generated tokens, record teacher probabilities and student probabilities; prefer samples where the teacher is confident and the student is not.
Why it exists: This aligns gradients with what the student most needs to learn, avoiding misleading pushes.
Example: If the teacher strongly favors “factor here” while the student is unsure, that step gets prioritized.

Step 4: Quality filtering

What happens: Remove too-long answers (beyond context), tool-calls (not the focus here), and repetitive outputs.
Why it exists: Keeps training clean so the student doesn’t learn bad habits like endless repetition.
Example: Discard a solution that repeats the same paragraph three times.

Step 5: Stage 1 training (low-temperature SFT)

What happens: Train the student on the low-temp, DAS-selected set first.
Why it exists: Establish a strong foundation with consistent, high-confidence reasoning patterns.
Example: Learn the standard way to solve a quadratic before exploring unusual substitutions.

Step 6: Stage 2 training (high-temperature SFT)

What happens: Resume training on the high-temp, DAS-selected set.
Why it exists: Expand mode coverage—learn rarer but useful answer paths without destabilizing early progress.
Example: Add alternative proof ideas or coding refactors that solve edge cases.

Step 7: Mixed-policy data construction

What happens: Let the student generate full answers for some training questions. Identify places where it drifts or gets cut off; randomly truncate the student’s prefix and have the teacher complete the rest—keep only good completions.
Why it exists: Reduce exposure bias by letting the student practice from its own partial answers while guided by the teacher’s continuation.
Example: The student writes the first 60% of a proof and stalls; the teacher completes the final argument. That stitched answer becomes training data.

Step 8: Final mixed-policy finetune

What happens: Finetune briefly on a small mix of stitched (student+teacher) data and a balanced slice of off-policy data.
Why it exists: Even a few thousand mixed examples measurably improve stability and conciseness.
Example: Answers get less rambly and stay on track during long chains of thought.

Secret sauce (why the combo matters):

Temperature schedule = curriculum: easy-to-hard without overwhelm.
DAS = right battles: spend compute where the teacher knows and the student wobbles.
Mixed policy = practice recovery: learn to continue well from your own prefixes.

Concrete mini-walkthrough:

Input: “Compute the value of x if 2x + 3 = 11.”
Teacher low T: Straight solve x = 4 with neat steps.
Teacher high T: Alternative path (rearrange, check) still yields x = 4.
DAS sees student underestimates a key step (“subtract 3 from both sides”). That example is kept.
Train Stage 1: Student learns crisp algebra steps.
Train Stage 2: Student absorbs alternate wording and checks.
Mixed policy: Student starts, forgets to check; teacher completes with a validation step. Next time, student remembers to check.

Training details (practical notes, simplified):

Use long context support, pack sequences efficiently, and decay the learning rate over epochs.
Keep domain mix balanced (math, code, science, instructions) to encourage transfer.
Maintain strong filtering to avoid verbosity or repetition creep.

04Experiments & Results

The test: We measure accuracy on four tough benchmarks that demand long chain-of-thought (CoT):

AIME24 and AIME25: 30 very hard math problems each (final-answer grading).
LiveCodeBench v5/v6: coding tasks with execution-based checks, updated over time to avoid contamination.
GPQA-Diamond: doctoral-level multiple-choice science questions designed to resist simple lookup.

The competition: We compare to strong open models—Qwen3 family (4B–32B), DeepSeek-distilled 8B, GLM-Z1 (9B/32B), Mistral 3 (3B/8B), AM-thinking-v1 (32B), OpenThoughts3-7B, NVIDIA OpenReasoning-Nemotron-7B, and others.

The scoreboard (with context):

AIME24: 88.5. That’s like scoring an A+ when many compact models get a B. It exceeds some 32B models.
AIME25: 83.3. Again top-tier for its size, beating prominent 32B baselines.
LiveCodeBench v5: 69.3. Stronger than many 8B–32B open baselines—showing excellent code transfer.
LiveCodeBench v6: 67.5. Outperforms popular 4B and even some larger models on the newer split.
GPQA-Diamond: 68.4. Very competitive for a 4B model on a graduate-level test, nearing larger models’ scores.

Why these results matter: The model uses only 448K training samples—much fewer than many peers—yet it matches or surpasses models 8–60× larger. That’s serious efficiency: fewer tokens, fewer dollars, strong reasoning.

Surprising findings:

High temperature helps even if the training loss looks worse: covering more teacher modes beats simply converging faster on narrow data.
Divergence-aware sampling (DAS) reliably tops random sampling at the same data budget, sometimes rivaling or beating doubled random data.
A tiny mixed-policy set (about 7–13K examples) still gives measurable gains across benchmarks—small but mighty.
Outputs become more concise after mixed-policy finetuning, hinting that practicing from student prefixes reduces rambling.

Ablations (what each stage adds):

Start from the base student: modest scores.
- Low-temp DAS training: big jump—establishes fundamentals.
- High-temp DAS training: another big jump—adds diversity and coverage.
- Mixed-policy: consistent small boosts everywhere—better stability and final polish.

Takeaway: Smart data selection plus a gentle curriculum and on-policy practice can let a compact 4B student learn like a much bigger model, without needing logits alignment or matching tokenizers.

05Discussion & Limitations

Limitations (honest look):

Exposure bias is reduced but not erased: mixed-policy is small and lightweight by design; even more on-policy practice might help further.
Teacher probability access: DAS relies on token probabilities for teacher-generated tokens. Some closed APIs provide them; if not, workarounds are needed.
Teacher quality: If the teacher has systematic mistakes or odd styles, the student may inherit them—filters help, but can’t fix everything.
Domain knowledge vs. reasoning: Benchmarks like GPQA mix knowledge and reasoning. Small models have limited parametric knowledge; retrieval or tools might still be needed.
Long-context cost: Training at long context (64K) requires memory-optimized setups, which not everyone has.

Required resources:

A capable teacher model with probability outputs on sampled tokens.
GPUs with enough memory for long-context SFT and efficient packing.
Data curation pipeline: temperature sampling, DAS scoring, and filtering scripts.

When NOT to use this:

If you need strict token-level logit alignment across mismatched tokenizers—use logit distillation only when tokenizers match and full logits are available.
If the task is mostly tool-calling or retrieval-heavy workflows; this pipeline focuses on long CoT, not tools.
If you cannot access teacher token probabilities at all (even for teacher-generated tokens), DAS loses its edge.

Open questions:

Can sequence-level, distribution-aware reweighting (using teacher sequence probabilities) further improve data efficiency?
How big should the mixed-policy portion be for the best trade-off of cost vs. bias reduction?
What is the best way to combine this with retrieval/tool use so small models compensate for limited knowledge?
How robust is DAS across languages and specialized domains (law, medicine, finance)?
Can we automatically detect and down-weight teacher quirks that don’t help students generalize?

06Conclusion & Future Work

Three-sentence summary: This paper shows how to turn a small student into a strong reasoner by aligning it to the teacher’s whole sequence distribution. The trick is a three-part recipe: temperature-scheduled learning for coverage, divergence-aware sampling for targeted learning, and mixed-policy distillation to reduce exposure bias. With only 448K examples, the 4B model reaches or beats much larger models on hard math, code, and science benchmarks.

Main achievement: A practical, open, data-efficient distillation pipeline that does not require matching tokenizers or full teacher logits, yet yields state-of-the-art small-model reasoning performance through smart sample selection and a gentle curriculum.

Future directions: Add sequence-level reweighting using teacher probabilities, scale and refine mixed-policy data, and integrate retrieval/tool use to boost knowledge-heavy tasks. Explore multilingual and domain-specific extensions. Investigate automatic down-weighting of unhelpful teacher patterns.

Why remember this: It reframes copy-the-teacher SFT as true sequence-level distribution alignment—teaching small models not just “the answer,” but the shape of good reasoning. That shift unlocks big gains with small data and makes powerful, portable reasoning assistants more attainable.

Practical Applications

•Create an on-device math tutor that explains steps clearly without needing the cloud.
•Build lightweight coding assistants that generate and fix code with strong test performance.
•Offer science study helpers that reason through complex concepts concisely and accurately.
•Power customer-support bots that follow multi-step troubleshooting procedures reliably.
•Enable small research assistants to draft proofs or analyses with fewer logical detours.
•Train domain mini-experts (finance, healthcare guidelines) when large compute is unavailable.
•Improve auto-grading systems that assess reasoning chains, not just final answers.
•Enhance multi-step planning tools (itineraries, recipes, DIY guides) with less rambling.
•Deploy compact models in bandwidth-limited settings (remote classrooms, field labs).
•Fine-tune small enterprise models to adopt internal reasoning styles while preserving privacy.

Version: 1