Papers18

#policy optimization

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang et al.Jan 13arXiv

The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.

#Uniqueness-Aware Reinforcement Learning#LLM reasoning#strategy clustering

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Intermediate

Shih-Yang Liu, Xin Dong et al.Jan 8arXiv

When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.

#GDPO#GRPO#multi-reward reinforcement learning

ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

Intermediate

Muyang Zhao, Qi Qi et al.Jan 7arXiv

The paper teaches AI models to plan their thinking time like a smart test-taker who has to finish several questions before the bell rings.

#meta-cognition#budgeted reasoning#token budget

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Intermediate

Jing Tan, Zhaoyang Zhang et al.Jan 5arXiv

Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.

#text-guided image editing#object-level transformation#reinforcement learning

VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Intermediate

Xinyao Liao, Qiyuan He et al.Dec 22arXiv

Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.

#autoregressive image generation#tokenizer–generator alignment#pixel-space reconstruction

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Intermediate

Changpeng Yang, Jinyang Wu et al.Dec 2arXiv

This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.

#Curriculum Advantage Policy Optimization#advantage-based RL#imitation learning

1 2