Papers8

#mathematical reasoning

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.

#code agents#multi-agent systems#mathematical reasoning

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Intermediate

Guobin Shen, Chenxiao Zhao et al.Feb 11arXiv

VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.

#VESPO#off-policy reinforcement learning#importance sampling

Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

Intermediate

Weihao Liu, Dehai Min et al.Feb 10arXiv

The paper introduces LT-Tuning, a way for AI models to “think silently” using special hidden tokens instead of writing every step out loud.

#latent tokens#chain-of-thought#context-prediction fusion

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

Intermediate

Zehao Chen, Gongxun Li et al.Feb 9arXiv

Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.

#weak-driven learning#logit mixing#weak agents

Beyond Imitation: Reinforcement Learning for Active Latent Planning

Intermediate

Zhi Zheng, Wee Sun LeeJan 29arXiv

The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.

#latent reasoning#chain-of-thought#variational autoencoder

Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Intermediate

Wei Du, Shubham Toshniwal et al.Dec 17arXiv

Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.

#mathematical reasoning#long-context fine-tuning#multi-mode supervision

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Intermediate

Zhenpeng Su, Leiyu Pan et al.Dec 5arXiv

Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.

#reinforcement learning#PPO-clip#KL penalty

Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Intermediate

Monishwaran Maheswaran, Rishabh Tiwari et al.Dec 4arXiv

ARBITRAGE makes AI solve step-by-step problems faster by only using the big, slow model when it is predicted to truly help.

#speculative decoding#step-level speculative decoding#advantage-aware routing