🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers9

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#DAPO

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Intermediate
Shyam Sundhar Ramesh, Xiaotong Ji et al.Feb 5arXiv

Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.

#Multi-Task Learning#GRPO#Reinforcement Learning Post-Training

Your Group-Relative Advantage Is Biased

Intermediate
Fengkai Yang, Zherui Chen et al.Jan 13arXiv

Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.

#Reinforcement Learning from Verifier Rewards#GRPO#GSPO

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Intermediate
Jiangshan Duo, Hanyu Li et al.Jan 13arXiv

JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.

#RLVR#judge-then-generate#discriminative supervision

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Intermediate
Sunzhu Li, Jiale Zhao et al.Jan 13arXiv

RubricHub is a huge (about 110,000) collection of detailed grading guides (rubrics) for many kinds of questions like health, science, writing, and chat.

#RubricHub#coarse-to-fine rubric generation#multi-model aggregation

Evaluating Parameter Efficient Methods for RLVR

Intermediate
Qingyu Yin, Yulun Wu et al.Dec 29arXiv

The paper asks which small, add-on training tricks (PEFT) work best when we teach language models with yes/no rewards we can check (RLVR).

#RLVR#parameter-efficient fine-tuning#LoRA

Rethinking Expert Trajectory Utilization in LLM Post-training

Intermediate
Bowen Ding, Yuhan Chen et al.Dec 12arXiv

The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.

#Supervised Fine-Tuning#Reinforcement Learning#Expert Trajectories

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Intermediate
Yiwen Tang, Zoey Guo et al.Dec 11arXiv

This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.

#Reinforcement Learning#Text-to-3D Generation#Hi-GRPO

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Beginner
Tong Wu, Yang Liu et al.Dec 8arXiv

This paper teaches a language model to think along several paths at the same time instead of one step after another.

#parallel reasoning#reinforcement learning for LLMs#self-distillation

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Intermediate
Zhenpeng Su, Leiyu Pan et al.Dec 5arXiv

Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.

#reinforcement learning#PPO-clip#KL penalty