🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers14

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#pass@k

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Intermediate
Jinpeng Chen, Cheng Gong et al.Mar 2arXiv

CoVe is a way to create training conversations for AI agents that use tools, while guaranteeing the conversations are both challenging and correct.

#constraint-guided verification#multi-turn tool use#user simulator

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Intermediate
Zhongwei Wan, Yun Shen et al.Feb 23arXiv

LLMs trained with simple rewards often latch onto just a few ways of solving problems and stop exploring, which hurts their ability to find other correct answers.

#DSDR#dual-scale diversity#RLVR

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Intermediate
Xiaotong Ji, Rasul Tutunov et al.Jan 29arXiv

The paper shows a fast, training-free way to boost an LLM’s step-by-step reasoning by smartly reusing the model’s own probabilities.

#power distribution sampling#distribution sharpening#low-temperature sampling

Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

Intermediate
Jiecong Wang, Hao Peng et al.Jan 29arXiv

This paper introduces PLaT, a way for AI to think silently in a hidden space (the brain) and only speak when needed (the mouth).

#latent chain-of-thought#planning in latent space#planner-decoder architecture

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Intermediate
Minwu Kim, Safal Shrestha et al.Jan 28arXiv

When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.

#failure-prefix conditioning#RLVR#GRPO

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Intermediate
Kishan Panaganti, Zhenwen Liang et al.Jan 27arXiv

LLMs are usually trained by treating every question the same and giving each one the same number of tries, which wastes compute on easy problems and neglects hard ones.

#LLM reasoning#Reinforcement Learning (RL)#GRPO

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Intermediate
Shobhita Sundaram, John Quan et al.Jan 26arXiv

This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.

#meta-reinforcement learning#teacher-student self-play#grounded rewards

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Intermediate
Matthew Y. R. Yang, Hao Bai et al.Jan 20arXiv

The paper introduces Intervention Training (InT), a simple way for a language model to find and fix the first wrong step in its own reasoning using a short, targeted correction.

#Intervention Training#credit assignment#LLM reasoning

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Intermediate
Zhiyuan Hu, Yucheng Wang et al.Jan 13arXiv

The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.

#Uniqueness-Aware Reinforcement Learning#LLM reasoning#strategy clustering

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Intermediate
Jie Wu, Haoling Li et al.Jan 11arXiv

X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.

#competitive programming#synthetic data generation#feature-based synthesis

Diversity or Precision? A Deep Dive into Next Token Prediction

Intermediate
Haoyuan Wu, Hai Wang et al.Dec 28arXiv

The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.

#next-token prediction#cross-entropy as policy gradient#reward shaping

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Intermediate
Rujiao Long, Yang Li et al.Dec 19arXiv

Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.

#Reasoning Palette#latent contextualization#VAE
12