Papers44

#Reinforcement Learning

Efficient RLVR Training via Weighted Mutual Information Data Selection

Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.

#Reinforcement Learning#RLVR#Data Selection

Not triaged yet

Learn Hard Problems During RL with Reference Guided Fine-tuning

Intermediate

Yangzhen Wu, Shanda Li et al.Mar 1arXiv

ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.

#Reference-Guided Fine-Tuning#ReGFT#ReFT

Not triaged yet

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Intermediate

Hongrui Jia, Chaoya Jiang et al.Feb 26arXiv

Large multimodal models (LMMs) can look at pictures and read text, but they still miss tricky cases, like tiny chart labels or multi-step math.

#Large Multimodal Models#Diagnostic-driven Progressive Evolution#Reinforcement Learning

Not triaged yet

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Intermediate

GigaBrain Team, Boyuan Wang et al.Feb 12arXiv

GigaBrain-0.5M* is a robot brain that sees, reads, and acts, and it gets smarter by imagining the future before moving.

#Vision-Language-Action#World Model#Reinforcement Learning

Not triaged yet

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Intermediate

Futing Wang, Jianhao Yan et al.Feb 12arXiv

The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.

#In-Context Exploration#Test-Time Scaling#Chain-of-Thought

Not triaged yet

RISE: Self-Improving Robot Policy with Compositional World Model

Intermediate

Jiazhi Yang, Kunyang Lin et al.Feb 11arXiv

RISE lets a robot learn safely and cheaply by practicing in its imagination instead of always in the real world.

#Reinforcement Learning#World Models#Compositional World Model

Not triaged yet

QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search

Intermediate

Jianzhao Huang, Xiaorui Huang et al.Feb 10arXiv

Search engines on social apps used to rely on many separate mini-models that often misunderstood slang and were hard to keep updated.

#Query Processing#Unified Generative Model#Named Entity Recognition

Not triaged yet

iGRPO: Self-Feedback-Driven LLM Reasoning

Beginner

Ali Hatamizadeh, Shrimai Prabhumoye et al.Feb 9arXiv

This paper teaches a language model to improve its own math answers by first writing several drafts and then learning to beat its best draft.

#iGRPO#GRPO#Reinforcement Learning

Not triaged yet

Rethinking the Trust Region in LLM Reinforcement Learning

Intermediate

Penghui Qi, Xiangxin Zhou et al.Feb 4arXiv

The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.

#Reinforcement Learning#Proximal Policy Optimization#Trust Region

Not triaged yet

Privileged Information Distillation for Language Models

Intermediate

Emiliano Penaloza, Dheeraj Vattikonda et al.Feb 4arXiv

The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.

#Privileged Information#Knowledge Distillation#π-Distill

Not triaged yet

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Intermediate

Zhiyuan Yao, Yi-Kai Zhang et al.Feb 3arXiv

Large language models learn better when we spend more practice time on the right questions at the right moments.

#Reinforcement Learning#RLVR#GRPO

Not triaged yet

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Intermediate

Dylan Zhang, Yufeng Xu et al.Feb 1arXiv

The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.

#Supervised Fine-Tuning#Reinforcement Learning#Distribution Mismatch

Not triaged yet

1 2 3 4