Papers2

#tool calling

D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.

#task decomposition#tool use#large reasoning models

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Intermediate

Shih-Yang Liu, Xin Dong et al.Jan 8arXiv

When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.

#GDPO#GRPO#multi-reward reinforcement learning