Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.

#Bottom-up Policy Optimization#internal layer policy#internal modular policy

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Beginner

Charlie Zhang, Graham Neubig et al.Dec 8arXiv

The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.

#edge of competence#process-verified evaluation#process-level rewards

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Beginner

Tong Wu, Yang Liu et al.Dec 8arXiv

This paper teaches a language model to think along several paths at the same time instead of one step after another.

#parallel reasoning#reinforcement learning for LLMs#self-distillation