🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers17

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#verifiable rewards

Specificity-aware reinforcement learning for fine-grained open-world classification

Intermediate
Samuele Angheben, Davide Berasi et al.Mar 3arXiv

This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.

#open-world classification#fine-grained recognition#large multimodal models

Heterogeneous Agent Collaborative Reinforcement Learning

Intermediate
Zhixia Zhang, Zixuan Huang et al.Mar 3arXiv

This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.

#HACRL#HACPO#heterogeneous agents

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Beginner
Emre Can Acikgoz, Cheng Qian et al.Feb 24arXiv

Tool-R0 teaches a language model to use software tools (like APIs) with zero human-made training data.

#self-play reinforcement learning#tool calling#function calling

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Intermediate
Haoxiang Sun, Lizhen Xu et al.Feb 18arXiv

DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.

#DeepVision-103K#multimodal reasoning#RLVR

POINTS-GUI-G: GUI-Grounding Journey

Intermediate
Zhongyin Zhao, Yuan Liu et al.Feb 6arXiv

This paper teaches a computer to find buttons, text, and icons on screens so it can click and type in the right places, a skill called GUI grounding.

#GUI grounding#reinforcement learning#verifiable rewards

Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Intermediate
Junxiao Liu, Zhijun Wang et al.Feb 5arXiv

TRIT is a new training method that teaches AI to translate and think at the same time so it can solve hard problems in many languages without extra helper models.

#multilingual reasoning#translation-reasoning integration#self-translation

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Intermediate
Fanfan Liu, Youyang Yin et al.Feb 5arXiv

The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.

#LUSPO#RLVR#GRPO

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Beginner
Zhuoran Yang, Ed Li et al.Jan 28arXiv

This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.

#native reasoning#cybersecurity LLM#chain-of-thought

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Intermediate
Minwu Kim, Safal Shrestha et al.Jan 28arXiv

When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.

#failure-prefix conditioning#RLVR#GRPO

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Beginner
Zanlin Ni, Shenzhi Wang et al.Jan 21arXiv

Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.

#diffusion language model#arbitrary order generation#autoregressive training

STEP3-VL-10B Technical Report

Beginner
Ailin Huang, Chengyuan Yao et al.Jan 14arXiv

STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.

#multimodal foundation model#unified pre-training#perception encoder

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Intermediate
Jiangshan Duo, Hanyu Li et al.Jan 13arXiv

JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.

#RLVR#judge-then-generate#discriminative supervision
12