Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang; Mingyoung Lai; Wanxu Zhao; Xiaoran Fan; Zhiheng Xi; Mingqi Wu; Chiyue Huang; Jun Zhao; Haijun Lv; Jian Tong; Yunhua Zhou; Yicheng Zou; Qipeng Guo; Tao Gui; Qi Zhang; Xuanjing Huang

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Intermediate

Yuming Yang, Mingyoung Lai, Wanxu Zhao et al.1/20/2026

arXiv PDF

Key Summary

•The paper asks a simple question: Which step-by-step explanations from a teacher model actually help a student model learn to reason better?
•It shows that using the strongest teacher does not always make the best student; what matters is how well the teacher’s steps fit the student.
•Existing pick-the-data methods mostly trust what the student already finds likely, but that often skips the most educational examples.
•The authors introduce Rank-Surprisal Ratio (RSR), a tiny metric that balances two things at once: alignment (easy enough to follow) and informativeness (new enough to teach).
•RSR is the ratio of average token rank (relative familiarity) to average surprisal (absolute unfamiliarity); lower RSR means better training trajectories.
•Across five student models and 11 teacher models, RSR’s score strongly predicts post-training performance (average Spearman 0.86), beating many popular alternatives.
•RSR helps pick the best trajectory for each problem and the best teacher for a student, even with only 200 sample trajectories per teacher.
•Key design choices—surprisal-weighted averaging and clipping very large ranks—make RSR stable and robust.
•The metric is simple to compute (one forward pass of the student) and needs no special verifiers or test sets.
•Limitations include dependence on the quality/diversity of available candidates and a focus on math tasks; extending to other domains is future work.

Why This Research Matters

Picking the right learning examples can make small models reason much better without huge compute budgets. RSR helps teams choose data that is just hard enough to teach, instead of just familiar or just flashy. That means faster training cycles, lower costs, and stronger gains for student models across tasks. It also supports fairer access: smaller organizations can improve models by smarter data, not only by bigger hardware. In apps like tutoring, coding help, and scientific Q&A, RSR-guided training can produce clearer, more reliable step-by-step reasoning. Finally, the method is simple to implement and robust with limited samples, so it fits real-world, resource-constrained pipelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how some homework examples really help you learn, while others either feel too easy or way too hard? The best examples are just challenging enough to stretch you without snapping your confidence.

🥬 Filling (The Actual Concept)

What it is: This paper studies how to choose the right step-by-step explanations (called reasoning trajectories) from a big teacher AI to teach a smaller student AI to reason better.
How it works (story of the field):
1. Chain-of-Thought (CoT) became popular because it shows the steps to solve hard problems, not just the answer.
2. Teachers generate long CoT solutions for many problems; students learn by copying these steps (supervised fine-tuning, or SFT).
3. People noticed a surprise: bigger, smarter teachers don’t always make better students. The student’s improvement depends on which teacher steps it studies.
4. Earlier filters mostly picked data the student already liked (high probability). That makes the student feel safe but doesn’t teach much new.
5. The paper argues for a balance—choose data the student can almost follow (aligned) but still finds a bit surprising (informative).
Why it matters: Without picking the right examples, we waste training time, get smaller gains, and sometimes even make students worse at reasoning.

🍞 Bottom Bread (Anchor) Imagine you’re practicing basketball. If the coach only makes you shoot from a spot you already mastered, you stop improving. If they only ask you to dunk from the free-throw line, you can’t do it and learn nothing. The best drills are hard but doable—and those are the drills this paper learns to pick.

New Concept 1 — Chain-of-Thought (CoT) 🍞 Hook: Imagine showing all your steps while solving a long division problem so your teacher can see your thinking. 🥬 The Concept: CoT is a step-by-step explanation an AI writes to solve a problem.

How it works: (1) Read the problem. (2) Break it into smaller steps. (3) Solve each step clearly. (4) Box the final answer.
Why it matters: Without CoT, the student copies answers without learning how to think. 🍞 Anchor: In math class, writing “Because 12×3=36, then 36−4=32, so the mean is 32” is a CoT.

New Concept 2 — Likelihood Estimation 🍞 Hook: You know how you can guess the next word in a sentence, like after “peanut butter and …” you expect “jelly”? 🥬 The Concept: Likelihood estimation is how sure a model is about the next word it will write.

How it works: (1) Look at the sentence so far. (2) Score all possible next words. (3) Pick the most likely one to continue.
Why it matters: Without likelihoods, the model can’t judge what seems normal versus surprising. 🍞 Anchor: After “According to this formula, the mean is 32,” words like “Therefore” are likelier than “Admittedly.”

New Concept 3 — Surprisal 🍞 Hook: The more shocking a magic trick, the more surprised you feel. 🥬 The Concept: Surprisal measures how unexpected a word is to the student model; low likelihood means high surprisal.

How it works: (1) The model assigns a probability to a word. (2) Convert that to “surprisal” (unexpectedness). (3) Use higher surprisal to spot new learning signals.
Why it matters: If everything is unsurprising, nothing new is learned. 🍞 Anchor: If a student never uses “Alternatively,” then seeing it in a good place is surprising and could teach a new pattern.

New Concept 4 — Behavioral Alignment 🍞 Hook: When you learn a new dance, it helps if the moves are similar to ones you already know. 🥬 The Concept: Alignment means the teacher’s steps feel familiar enough that the student can follow the pattern.

How it works: (1) Compare the teacher’s tokens to the student’s preferences. (2) Higher-ranked tokens mean they’re among the student’s top candidates. (3) Keep it in the student’s comfort zone, but not too comfy.
Why it matters: If it’s too alien, the student can’t absorb it. 🍞 Anchor: A student who often writes “Therefore” will find “Thus” familiar too; that’s aligned.

The World Before

AI could copy long explanations, but picking which explanations to copy was guessy. People used proxy signals like “the teacher is huge” or “the student already deems this likely.”
That led to two failures: too-easy examples (no learning) or too-weird examples (no comprehension).

The Problem

We need a way to measure if a teacher’s reasoning is both aligned (followable) and informative (teaches something new) for a specific student.

Failed Attempts

Pure probability filters: Pick high-likelihood data. Result: safe but not educational.
Pure difficulty: Pick super-surprising data. Result: too hard to learn from.

The Gap

A single, student-specific metric that balances both forces—alignment and informativeness—was missing.

Real Stakes

Better tutoring bots that explain at your level.
Coding helpers that teach small models new tricks without overwhelming them.
Faster, cheaper training by keeping only the most educational examples.
Fairer access: smaller labs can train capable reasoners by choosing smarter data, not just bigger compute.

02Core Idea

🍞 Top Bread (Hook) Imagine picking a book for a friend: books they’ve already read are boring; books way above their reading level are frustrating. You want the “just right” book that stretches them a bit.

🥬 Filling (The Actual Concept)

What it is (one sentence): The Rank-Surprisal Ratio (RSR) is a small score that says how well a step-by-step explanation will teach a specific student model by balancing familiarity (rank) and surprise (surprisal).

Multiple Analogies (3 ways)

Sports Drills: A drill too easy doesn’t build skill; a drill too advanced can’t be practiced. RSR finds hard-but-doable drills.
Hiking Trails: An easy sidewalk teaches nothing; a cliff is unsafe. RSR picks a trail with some slope but clear markers.
Music Practice: Scales you’ve mastered are dull; a concerto is impossible. RSR chooses a piece just beyond your comfort zone so you learn fastest.

Before vs After

Before: People picked teacher steps the student already liked (high likelihood) or trusted big-name teachers. Results were inconsistent across students.
After: With RSR, we pick trajectories where tokens are still among the student’s top candidates (aligned), yet overall feel uncommon (surprising). This predicts learning gains much better.

Why It Works (intuition)

Tokens with relatively high rank (the student already considers them strong contenders) provide a familiar path.
A trajectory with higher overall surprisal ensures the student encounters new patterns and ideas.
Dividing rank by surprisal ties these two together: we prefer small ratios, meaning “highly teachable novelty.”

Building Blocks (with mini Sandwich explanations) New Concept 5 — Token Rank 🍞 Hook: In a class vote for the next activity, choices get ranked from most to least popular. 🥬 The Concept: Token rank is a token’s position among all possible next tokens by the student’s scores (1 = top choice).

How it works: (1) The student scores every next-token option. (2) Sort by score. (3) The target token’s place is its rank.
Why it matters: High-ranked means the student already finds it reasonable. 🍞 Anchor: After “Therefore,” the token “we” might be rank 3, while “alternatively” might be rank 12.

New Concept 6 — RSR 🍞 Hook: When choosing a puzzle, you want it near your skill but still with some twists. 🥬 The Concept: RSR is the ratio of average token rank (relative familiarity) to average surprisal (absolute unfamiliarity) across a trajectory; lower is better.

How it works: (1) For each token, get rank and surprisal from the student. (2) Clip super-large ranks to avoid noisy extremes. (3) Sum ranks and sum surprisals. (4) Divide sums to get RSR. (5) Optionally weigh by surprisal to focus on the truly educational bits.
Why it matters: Without RSR, you either pick comfy-but-dull or shocking-but-unlearnable examples. 🍞 Anchor: Between two math explanations, the one with tokens the student often considers yet overall feels new gets a smaller RSR and teaches better.

The Aha! Moment (one sentence)

Balance absolute unfamiliarity (surprisal) with relative familiarity (rank): choose explanations that are surprising overall but filled with tokens the student already half-recognizes.

Why It’s Different

Prior metrics looked at likelihood only. RSR explicitly encodes the sweet spot—aligned and informative—so it generalizes across different students.

What Changes Because of This

Data selection becomes student-specific and reliable. It predicts who learns most from which teacher, and even picks the best teacher using very little data.

🍞 Bottom Bread (Anchor) Suppose a student always writes “Therefore,” “So,” “Thus,” but rarely “Alternatively.” The best training example might mostly use those familiar transitions while adding a few new, well-placed steps. RSR sees that mix and gives it a low score—meaning “great for learning.”

03Methodology

🍞 Top Bread (Hook) Imagine sorting a big pile of math solutions to find the ones that will help your friend improve fastest. You’d skim each one and ask: Can they almost follow it? Will they learn something new from it?

🥬 Filling (The Actual Concept)

What it is: A recipe to compute RSR for each trajectory and then use it to pick data and teachers.
High-level flow: Teacher trajectories → compute per-token rank and surprisal using the student → clip extremes and aggregate → trajectory-level RSR → dataset-level RSR → select what to train on.

Step-by-step (like a recipe)

Input: Teacher trajectories

What happens: Gather long, step-by-step solutions (CoT) from several teacher models for the same 5,000 math problems.
Why it exists: We need multiple candidate explanations to choose from.
Example: For “According to this formula, the mean is 32,” teachers might continue with “Therefore,” “So,” or “Alternatively.”

Student Scoring: Token probabilites and surprisals

What happens: For each trajectory, run the student model forward once to get the probability for each token; turn probabilities into surprisals (unexpectedness).
Why it exists: Surprisal tells us how new this trajectory feels to the student; higher surprisal = stronger learning signal.
What breaks without it: We’d only chase familiarity and miss the “teachable” novelty.
Example: If the student rarely uses “Alternatively,” it gets higher surprisal than “Therefore.”

Student Scoring: Token ranks

What happens: For each token, compute its rank among all possible next tokens from the student’s scores.
Why it exists: Rank measures relative familiarity; we want tokens the student can almost predict.
What breaks without it: We might pick super-random tokens the student can’t learn from.
Example: After “According to this formula…,” “Therefore” might be rank 2, “Alternatively” rank 11.

Stabilize: Clip very large ranks

What happens: Replace any extremely large rank with a cap (like 100).
Why it exists: Rare, ultra-low-probability tokens can explode the rank and make the average unstable.
What breaks without it: A few extreme tokens dominate the metric and mislead selection.
Example: A bizarre, never-seen symbol could get rank 30,000; clipping treats it as just “very unfamiliar.”

Aggregate within a trajectory (surprisal-weighted)

What happens: Sum clipped ranks across tokens; sum surprisals across tokens; divide sums to get trajectory-level RSR. This acts like a surprisal-weighted average of per-token ratios.
Why it exists: Emphasizes the parts that carry real learning signal (higher surprisal) without being derailed by easy tokens.
What breaks without it: A few trivial tokens could dilute the signal from informative parts.
Example: A 10-sentence solution where the most surprising 3 sentences carry most of the teaching value—weighting highlights them.

Aggregate to dataset-level (optional)

What happens: To compare teachers, compute a dataset-level RSR by weighting trajectories (again by surprisal) and taking the ratio of summed ranks to summed surprisals.
Why it exists: Gives a single score per teacher for fast teacher selection.
What breaks without it: You can’t compare teachers fairly with limited samples.
Example: With only 200 sampled trajectories per teacher, you still get a stable, comparable score.

Use cases: Selection

Trajectory selection: For each problem, choose the candidate trajectory with the lowest RSR for that student.
Teacher selection: With a small sample (e.g., 200 trajectories) from each teacher, pick the teacher whose dataset gets the lowest RSR.
Why it exists: Turns the metric into action to improve training.
Example: Across 11 teachers × 3 generations each, pick 1 best trajectory per problem using RSR, then fine-tune.

The Secret Sauce

Two knobs make RSR work well: (a) surprisal-weighted averaging so informative tokens count more, and (b) rank clipping so rare noise doesn’t hijack the score.

Concrete Mini Walkthrough

Suppose after “According to this formula, the mean is 32,” the student’s top next tokens are: “Therefore” (rank 2), “So” (rank 3), “Actually” (rank 8), while “Alternatively” is rank 11. If a good teacher solution uses mostly ranks 2–4 tokens but overall feels fresh (moderately high surprisal), its RSR stays low. Another solution with very easy, boring tokens (low surprisal) or very off-pattern tokens (huge ranks) gets higher RSR.

🍞 Bottom Bread (Anchor) Think of RSR like a goldilocks meter for study guides: when it reads low, you’ve found an explanation that’s not too easy, not too hard—just right for learning quickly.

04Experiments & Results

🍞 Top Bread (Hook) If you ran a school, you’d test whether your new way of picking practice problems actually boosts scores across different classes, not just one.

🥬 Filling (The Actual Concept)

The Test: Do lower-RSR trajectories really lead to better reasoning after training?
What they measured: Correlation between each metric’s score on a teacher’s data and the student’s final Acc@4 on four math benchmarks (AIME’24, AIME’25, AMC’23, MATH500). They also tested practical selection: picking the best trajectory per problem and the best teacher per student.

The Competition (baselines)

Teacher size or teacher performance
Average token length
Verified correctness and LLM-judged quality
Probability-only metrics (Avg-Surprisal, local surprisal)
Rank-only metrics (Avg-Rank)
Gradient-based, student-specific metrics (G-Norm, GRACE) and influence scores

The Scoreboard (with context)

Main finding: RSR’s dataset-level score has an average Spearman correlation of 0.86 with post-training performance across five student models, higher than all compared metrics (most alternatives ≤ 0.59).
Meaning: 0.86 is like getting an A+ while others are getting B’s. It means RSR rank-orders teacher datasets very similarly to how the students will actually perform after training.
Trajectory selection: Using RSR to pick one trajectory per problem (from 33 candidates) led to the best average post-training scores across all five students—often matching or beating the best single teacher’s dataset.
Teacher selection under low data: With just 200 sampled trajectories per teacher, picking the teacher with the lowest RSR nearly matched the oracle (true best) teacher and beat other selection methods.

Surprising Findings

Bigger/stronger teachers weren’t always best for every student—fit matters more than fame.
Probability-only filters sometimes chose too-familiar data that didn’t teach much; very-high-surprisal data was also bad. Middle-ground surprisal plus high relative rank was the sweet spot, just as RSR encodes.
Stability tricks mattered: clipping ranks and using surprisal-weighted averages improved correlation dramatically versus naïve token-level ratios.

Robustness Checks

Ablations showed removing rank clipping or the weighting dropped correlation a lot.
Changing the clipping threshold or using only 200 samples per teacher still kept results strong, showing RSR isn’t brittle.
On GPQA-Diamond (science multiple choice), models trained with RSR-selected math data still generalized better than baselines, suggesting broader reasoning gains.

🍞 Bottom Bread (Anchor) It’s like ranking practice worksheets for five different classes and predicting which class will learn the most from which set. RSR’s rankings lined up with the classes’ final test scores far better than other ways of sorting the worksheets.

05Discussion & Limitations

🍞 Top Bread (Hook) Even great recipes need the right ingredients in the pantry.

🥬 Filling (The Actual Concept) Limitations

Candidate quality bounds results: If every available trajectory is either too easy or too alien for a student, selection (even with RSR) can’t create great data from thin air.
Domain focus: The study centers on math reasoning; while early signs on GPQA are positive, code or other domains need more testing.
Theory gap: RSR is intuitive and works well empirically, but a deeper theoretical derivation is still open.

Required Resources

You need the student model to score tokens (one forward pass per trajectory) and a pool of teacher trajectories. That’s much cheaper than running full fine-tunes to compare options.

When NOT to Use

If you have no student access (can’t run forward passes), you can’t compute ranks/surprisals.
If your data must be strictly verified-correct for safety-critical use, RSR alone (which prefers teachability) shouldn’t replace correctness checks—combine them.
If you already have perfectly curated, student-matched data, RSR adds less.

Open Questions

Can RSR guide rewriting or synthesis—turning a too-hard trajectory into a just-right one?
How does RSR behave in code, multimodal reasoning, or extremely long contexts?
Can we connect RSR to a learning-theory optimum for sample efficiency?

🍞 Bottom Bread (Anchor) Think of RSR as a smart librarian. If the library only owns baby books or graduate textbooks, the librarian can’t hand you a perfect middle-schooler book—but can still find the closest match fast.

06Conclusion & Future Work

🍞 Top Bread (Hook) Picture a coach who instantly knows which drill will help you improve next.

🥬 Filling (The Actual Concept) 3-Sentence Summary

This paper introduces the Rank-Surprisal Ratio, a simple, student-specific metric that picks explanations which are both familiar enough to follow and surprising enough to teach.
RSR strongly predicts who learns most from whom, across many teacher–student pairs, and reliably improves both trajectory and teacher selection.
It’s cheap to compute, robust with little data, and beats many popular alternatives in correlating with real post-training gains.

Main Achievement

Turning the fuzzy idea of “just-right difficulty” into a single, actionable number that works across diverse models.

Future Directions

Use RSR not just to select but to rewrite or synthesize better trajectories; test beyond math (e.g., code, science, multimodal); explore deeper theory linking RSR to optimal learning.

Why Remember This

Because picking the right examples matters as much as picking the right teacher. RSR makes that choice simple, fast, and effective.

🍞 Bottom Bread (Anchor) It’s the study-guide sorter that consistently hands you the practice page that will help you grow the most—today.

Practical Applications

•Filter teacher-generated solutions before training by keeping the lowest-RSR trajectories per problem.
•Rapidly compare multiple candidate teachers using only 200 sampled trajectories per teacher and pick the lowest-RSR one.
•Build a mixed dataset tailored to a specific student by selecting, per problem, the teacher trajectory with minimal RSR.
•Schedule curriculum difficulty by sorting trajectories from higher to lower RSR (hard-to-easy) or vice versa (easy-to-hard).
•Combine RSR with correctness checks in safety-critical settings: first verify answers, then choose among correct ones by lowest RSR.
•Use RSR to monitor data quality drift over time and refresh training sets when RSR distributions worsen.
•Guide synthetic data generation: iteratively sample/adjust trajectories to reduce RSR for the target student.
•Allocate training budget by prioritizing problems whose pools contain at least one low-RSR trajectory.
•Diagnose teacher–student mismatch early by comparing dataset-level RSR across candidate teachers.
•Evaluate updates to a student model: if RSR drops on held-out teacher samples, the model is better aligned for learning.

Version: 1