Papers15

#policy optimization

Experiential Reinforcement Learning

Taiwei Shi, Sihao Chen et al.Feb 15arXiv

This paper teaches AI models to learn like good students: try, think about what went wrong, fix it, and remember the fix.

#Experiential Reinforcement Learning#self-reflection#distillation

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Intermediate

Shuo He, Lang Feng et al.Feb 11arXiv

Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.

#Kalman filter#importance sampling ratio#policy optimization

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

Intermediate

Zhiqi Yu, Zhangquan Chen et al.Feb 5arXiv

The paper finds a hidden symmetry inside GRPO’s advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.

#GRPO#GRAE#A-GRAE

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Intermediate

Fanfan Liu, Youyang Yin et al.Feb 5arXiv

The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.

#LUSPO#RLVR#GRPO

On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

Intermediate

Shumin Wang, Yuexiang Xie et al.Feb 3arXiv

The paper builds a simple, math-light rule to predict whether training makes a language model more open-minded (higher entropy) or more sure of itself (lower entropy).

#reinforcement fine-tuning#entropy dynamics#GRPO

Self-Hinting Language Models Enhance Reinforcement Learning

Intermediate

Baohao Liao, Hanze Dong et al.Feb 3arXiv

When rewards are rare, a popular training method for language models (GRPO) often stops learning because every try in a group gets the same score, so there is nothing to compare.

#reinforcement learning#GRPO#self-hinting

LatentMem: Customizing Latent Memory for Multi-Agent Systems

Intermediate

Muxin Fu, Guibin Zhang et al.Feb 3arXiv

LatentMem is a new memory system that helps teams of AI agents remember the right things for their specific jobs without overloading them with text.

#multi-agent systems#latent memory#role-aware memory

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Intermediate

Xiao Liang, Zhong-Zhi Li et al.Feb 2arXiv

The paper trains language models to solve hard problems by first breaking them into smaller parts and then solving those parts, instead of only thinking in one long chain.

#divide-and-conquer reasoning#chain-of-thought#reinforcement learning

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Intermediate

Yu Wang, Yi Wang et al.Jan 15arXiv

Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.

#socio-semantic segmentation#vision-language model#reinforcement learning

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Intermediate

Zhiyuan Hu, Yucheng Wang et al.Jan 13arXiv

The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.

#Uniqueness-Aware Reinforcement Learning#LLM reasoning#strategy clustering

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Intermediate

Shih-Yang Liu, Xin Dong et al.Jan 8arXiv

When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.

#GDPO#GRPO#multi-reward reinforcement learning

ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

Intermediate

Muyang Zhao, Qi Qi et al.Jan 7arXiv

The paper teaches AI models to plan their thinking time like a smart test-taker who has to finish several questions before the bell rings.

#meta-cognition#budgeted reasoning#token budget

1 2