🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers134

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#GRPO

DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

Intermediate
Shengda Fan, Xuyan Ye et al.Jan 20arXiv

DARC teaches big language models to get smarter by splitting training into two calm, well-organized steps instead of one chaotic loop.

#DARC#self-play#curriculum learning

Think3D: Thinking with Space for Spatial Reasoning

Beginner
Zaibin Zhang, Yuhan Wu et al.Jan 19arXiv

Think3D lets AI models stop guessing from flat pictures and start exploring real 3D space, like walking around a room in a video game.

#Think3D#spatial reasoning#3D reconstruction

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Intermediate
Dawei Li, Yuguang Yao et al.Jan 18arXiv

ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.

#process reward model#tool-using agents#offline sampling

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Intermediate
Qiyuan Zhang, Biao Gong et al.Jan 16arXiv

This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.

#physics-aware video generation#rigid body motion#reinforcement learning

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Intermediate
Changle Qu, Sunhao Dai et al.Jan 15arXiv

MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.

#Tool-Integrated Reasoning#Credit Assignment#Bipartite Matching

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Intermediate
Jiarui Yao, Ruida Wang et al.Jan 15arXiv

Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.

#Process Reward Learning#PRL#Reasoning LLMs

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Intermediate
Yutao Mou, Zhangchi Xue et al.Jan 15arXiv

ToolSafe is a new way to keep AI agents safe when they use external tools, by checking each action before it runs.

#step-level safety#tool invocation#LLM agents

M^4olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

Intermediate
Yizhan Li, Florence Cloutier et al.Jan 15arXiv

The paper introduces M^4olGen, a two-stage system that designs new molecules to match exact numbers for several properties (like QED, LogP, MW, HOMO, LUMO) at the same time.

#molecular generation#multi-property optimization#fragment-level editing

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Intermediate
Chi-Pin Huang, Yunze Man et al.Jan 14arXiv

Fast-ThinkAct teaches a robot to plan with a few tiny hidden "thought tokens" instead of long paragraphs, making it much faster while staying smart.

#Vision-Language-Action#latent reasoning#verbalizable planning

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Intermediate
Yao Tang, Li Dong et al.Jan 13arXiv

The paper introduces Multiplex Thinking, a new way for AI to think by sampling several likely next words at once and blending them into a single super-token.

#Multiplex Thinking#chain-of-thought#continuous token

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Intermediate
Zhiyuan Hu, Yucheng Wang et al.Jan 13arXiv

The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.

#Uniqueness-Aware Reinforcement Learning#LLM reasoning#strategy clustering

Your Group-Relative Advantage Is Biased

Intermediate
Fengkai Yang, Zherui Chen et al.Jan 13arXiv

Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.

#Reinforcement Learning from Verifier Rewards#GRPO#GSPO
34567