Papers22

#GRPO

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

Chris Samarinas, Haw-Shiuan Chang et al.Feb 26arXiv

SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.

#retrieval-augmented reasoning#reinforcement learning#GRPO

Not triaged yet

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Beginner

Emre Can Acikgoz, Cheng Qian et al.Feb 24arXiv

Tool-R0 teaches a language model to use software tools (like APIs) with zero human-made training data.

#self-play reinforcement learning#tool calling#function calling

Not triaged yet

iGRPO: Self-Feedback-Driven LLM Reasoning

Beginner

Ali Hatamizadeh, Shrimai Prabhumoye et al.Feb 9arXiv

This paper teaches a language model to improve its own math answers by first writing several drafts and then learning to beat its best draft.

#iGRPO#GRPO#Reinforcement Learning

Not triaged yet

SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

Beginner

Jinyang Wu, Changpeng Yang et al.Jan 30arXiv

Most reinforcement learning agents only get a simple pass/fail reward, which hides how good or bad their attempts really were.

#Sweet Spot Learning#tiered rewards#reinforcement learning with verifiable rewards

Not triaged yet

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Beginner

Yiju Guo, Tianyi Hu et al.Jan 29arXiv

This paper shows that many reasoning failures in AI are caused by just a few distracting words in the prompt, not because the problems are too hard.

#LENS#Interference Tokens#Reinforcement Learning with Verifiable Rewards

Not triaged yet

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Beginner

Zhuoran Yang, Ed Li et al.Jan 28arXiv

This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.

#native reasoning#cybersecurity LLM#chain-of-thought

Not triaged yet

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Beginner

Zhihan Liu, Lin Guan et al.Jan 26arXiv

LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.

#cross-domain generalization#state information richness#planning complexity

Not triaged yet

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Beginner

Kunat Pipatanakul, Pittawat TaveekitworachaiJan 26arXiv

Typhoon-S is a simple, open recipe that turns a basic language model into a helpful assistant and then teaches it important local skills, all on small budgets.

#Typhoon-S#on-policy distillation#full-logits distillation

Not triaged yet

Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Beginner

Zhitao He, Zongwei Lyu et al.Jan 22arXiv

Academic rebuttals are not just about being polite; they are about smart, strategic persuasion under hidden information.

#academic rebuttal#theory of mind#strategic persuasion

Not triaged yet

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Beginner

Zhiwei Zhang, Fei Zhao et al.Jan 22arXiv

Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.

#FISSION-GRPO#error recovery#tool use

Not triaged yet

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Beginner

Zanlin Ni, Shenzhi Wang et al.Jan 21arXiv

Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.

#diffusion language model#arbitrary order generation#autoregressive training

Not triaged yet

Think3D: Thinking with Space for Spatial Reasoning

Beginner

Zaibin Zhang, Yuhan Wu et al.Jan 19arXiv

Think3D lets AI models stop guessing from flat pictures and start exploring real 3D space, like walking around a room in a video game.

#Think3D#spatial reasoning#3D reconstruction

Not triaged yet

1 2