Papers9

#LLM reasoning

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Xiaotong Ji, Rasul Tutunov et al.Jan 29arXiv

The paper shows a fast, training-free way to boost an LLM’s step-by-step reasoning by smartly reusing the model’s own probabilities.

#power distribution sampling#distribution sharpening#low-temperature sampling

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Intermediate

Kishan Panaganti, Zhenwen Liang et al.Jan 27arXiv

LLMs are usually trained by treating every question the same and giving each one the same number of tries, which wastes compute on easy problems and neglects hard ones.

#LLM reasoning#Reinforcement Learning (RL)#GRPO

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Intermediate

Shobhita Sundaram, John Quan et al.Jan 26arXiv

This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.

#meta-reinforcement learning#teacher-student self-play#grounded rewards

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Intermediate

Matthew Y. R. Yang, Hao Bai et al.Jan 20arXiv

The paper introduces Intervention Training (InT), a simple way for a language model to find and fix the first wrong step in its own reasoning using a short, targeted correction.

#Intervention Training#credit assignment#LLM reasoning

DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

Intermediate

Shengda Fan, Xuyan Ye et al.Jan 20arXiv

DARC teaches big language models to get smarter by splitting training into two calm, well-organized steps instead of one chaotic loop.

#DARC#self-play#curriculum learning

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Intermediate

Zhiyuan Hu, Yucheng Wang et al.Jan 13arXiv

The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.

#Uniqueness-Aware Reinforcement Learning#LLM reasoning#strategy clustering

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Intermediate

Peter Chen, Xiaopeng Li et al.Dec 18arXiv

The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.

#RLVR#Group Relative Policy Optimization#ratio clipping

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Intermediate

Zhenwen Liang, Sidi Lu et al.Dec 17arXiv

This paper teaches large language models (LLMs) to explore smarter by listening to their own gradients—the directions they would update—rather than chasing random variety.

#gradient-guided reinforcement learning#GRL#GRPO

State over Tokens: Characterizing the Role of Reasoning Tokens

Intermediate

Mosh Levy, Zohar Elyoseph et al.Dec 14arXiv

Reasoning tokens (the words a model writes before its final answer) help the model think better, but they are not a trustworthy diary of how it really thought.

#State over Tokens#reasoning tokens#chain-of-thought