🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers33

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Reinforcement Learning

Rethinking the Trust Region in LLM Reinforcement Learning

Intermediate
Penghui Qi, Xiangxin Zhou et al.Feb 4arXiv

The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.

#Reinforcement Learning#Proximal Policy Optimization#Trust Region

Privileged Information Distillation for Language Models

Intermediate
Emiliano Penaloza, Dheeraj Vattikonda et al.Feb 4arXiv

The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.

#Privileged Information#Knowledge Distillation#π-Distill

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Intermediate
Zhiyuan Yao, Yi-Kai Zhang et al.Feb 3arXiv

Large language models learn better when we spend more practice time on the right questions at the right moments.

#Reinforcement Learning#RLVR#GRPO

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Intermediate
Dylan Zhang, Yufeng Xu et al.Feb 1arXiv

The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.

#Supervised Fine-Tuning#Reinforcement Learning#Distribution Mismatch

Language-based Trial and Error Falls Behind in the Era of Experience

Intermediate
Haoyu Wang, Guozheng Ma et al.Jan 29arXiv

Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubik’s Cube.

#SCOUT#Reinforcement Learning#Supervised Fine-Tuning

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Intermediate
Yufeng Zhong, Lei Chen et al.Jan 29arXiv

OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.

#Holistic OCR#Vision-Language Model#Supervised Fine-Tuning

Agentic Reasoning for Large Language Models

Intermediate
Tianxin Wei, Ting-Wei Li et al.Jan 18arXiv

This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.

#Agentic Reasoning#LLM Agents#In-Context Learning

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Intermediate
Pingzhi Tang, Yiding Wang et al.Jan 16arXiv

Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.

#Parametric Skill Transfer#Skill Vector#Task Arithmetic

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Intermediate
Changle Qu, Sunhao Dai et al.Jan 15arXiv

MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.

#Tool-Integrated Reasoning#Credit Assignment#Bipartite Matching

SmartSearch: Process Reward-Guided Query Refinement for Search Agents

Intermediate
Tongyu Wen, Guanting Dong et al.Jan 8arXiv

SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.

#Search agents#Process rewards#Query refinement

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Intermediate
Mizanur Rahman, Mohammed Saidul Islam et al.Jan 8arXiv

This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.

#Text-to-Visualization#Reinforcement Learning#GRPO

MiMo-V2-Flash Technical Report

Intermediate
Xiaomi LLM-Core Team, : et al.Jan 6arXiv

MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.

#Mixture-of-Experts#Sliding Window Attention#Global Attention
123