Papers160

#reinforcement learning

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Weizhou Shen, Ziyi Yang et al.Dec 15arXiv

QwenLong-L1.5 is a training recipe that helps AI read and reason over very long documents by improving the data it learns from, the way it is trained, and how it remembers important stuff.

#long-context reasoning#reinforcement learning#GRPO

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Intermediate

Zhenyang Cai, Jiaming Zhang et al.Dec 12arXiv

DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.

#DentalGPT#multimodal large language model#dentistry AI

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Intermediate

Songyang Gao, Yuzhe Gu et al.Dec 11arXiv

This paper builds a math problem–solving agent, Intern-S1-MO, that thinks in multiple rounds and remembers proven mini-results called lemmas so it can solve very long, Olympiad-level problems.

#long-horizon reasoning#lemma-based memory#multi-agent reasoning

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Intermediate

Haiteng Zhao, Junhao Shen et al.Dec 11arXiv

This paper builds InternGeometry, a large language model agent that solves Olympiad-level geometry by talking to a math engine, remembering what worked, and trying smart new ideas.

#InternGeometry#geometry theorem proving#auxiliary constructions

MOA: Multi-Objective Alignment for Role-Playing Agents

Intermediate

Chonghua Liao, Ke Wang et al.Dec 10arXiv

Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.

#multi-objective alignment#role-playing agents#reinforcement learning

MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

Intermediate

Mengxi Xiao, Kailai Yang et al.Dec 10arXiv

MentraSuite is a complete toolkit that teaches large language models (LLMs) to reason about mental health step by step, not just sound caring.

#mental health reasoning#LLM post-training#supervised fine-tuning

Rethinking Chain-of-Thought Reasoning for Videos

Intermediate

Yiwu Zhong, Zi-Yuan Hu et al.Dec 10arXiv

The paper shows that video AIs do not need long, human-like chains of thought to reason well.

#video reasoning#chain-of-thought#concise reasoning

Learning Unmasking Policies for Diffusion Language Models

Intermediate

Metod Jazbec, Theo X. Olausson et al.Dec 9arXiv

Diffusion language models write by gradually unmasking hidden words, so deciding which blanks to reveal next is a big deal for both speed and accuracy.

#diffusion language models#masked diffusion#unmasking policy

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Intermediate

Zheng Ding, Weirui YeDec 9arXiv

TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.

#TreeGRPO#reinforcement learning#diffusion models

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Intermediate

Ming Chen, Sheng Tang et al.Dec 6arXiv

The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.

#decoding-based regression#sequence-level reward#reinforcement learning

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Intermediate

Hongyu Li, Manyuan Zhang et al.Dec 5arXiv

EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.

#instruction-based image editing#iterative reasoning#multimodal large language model

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Intermediate

Zhenpeng Su, Leiyu Pan et al.Dec 5arXiv

Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.

#reinforcement learning#PPO-clip#KL penalty

10 11 12 13 14