Papers36

All Beginner Intermediate Advanced

All Sources arXiv

#Chain-of-Thought

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Intermediate

Qiyuan Zhang, Yufei Wang et al.Mar 2arXiv

Longer explanations are not always better; the shape of thinking matters.

#Generative Reward Models#Chain-of-Thought#Breadth-CoT

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Intermediate

Qiyuan Zhang, Junyi Zhou et al.Mar 2arXiv

RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.

#RubricBench#rubric-guided evaluation#reward models

Learn Hard Problems During RL with Reference Guided Fine-tuning

Intermediate

Yangzhen Wu, Shanda Li et al.Mar 1arXiv

ReGFT is a simple pre-RL step that shows the model partial human hints, then makes it solve problems in its own words, creating correct, model-style solutions for hard questions.

#Reference-Guided Fine-Tuning#ReGFT#ReFT

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Intermediate

Qihua Dong, Kuo Yang et al.Feb 27arXiv

This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.

#Referring Expression Comprehension#Visual Grounding#Multimodal Large Language Models

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Intermediate

Yuanda Xu, Hejian Sang et al.Feb 24arXiv

The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.

#ACE#Reinforcement Learning with Verifiable Rewards#GRPO

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Intermediate

Futing Wang, Jianhao Yan et al.Feb 12arXiv

The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.

#In-Context Exploration#Test-Time Scaling#Chain-of-Thought

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Intermediate

Dawid J. Kopiczko, Sagar Vaze et al.Feb 11arXiv

The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.

#Supervised Fine-Tuning#Chain-of-Thought#Data Repetition

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Intermediate

Xinwu Ye, Yicheng Mao et al.Feb 6arXiv

LatentChem lets AI do chemistry thinking quietly inside continuous vectors instead of writing long step-by-step sentences.

#Latent reasoning#Chain-of-Thought#Chemical LLM

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Intermediate

Jintao Tong, Shilin Yan et al.Feb 5arXiv

SwimBird is a multimodal AI that can switch how it thinks: only in text, only in vision (with hidden picture-like thoughts), or a mix of both.

#SwimBird#switchable reasoning#hybrid autoregressive

BABE: Biology Arena BEnchmark

Intermediate

Junting Zhou, Jin Chen et al.Feb 5arXiv

BABE is a new benchmark that tests if AI can read real biology papers and reason from experiments like a scientist, not just recall facts.

#BABE Benchmark#Experimental Reasoning#Causal Reasoning

Reinforced Attention Learning

Intermediate

Bangzheng Li, Jianmo Ni et al.Feb 4arXiv

This paper teaches AI to pay attention better by training its focus, not just its words.

#Reinforced Attention Learning#attention policy#multimodal LLM

Privileged Information Distillation for Language Models

Intermediate

Emiliano Penaloza, Dheeraj Vattikonda et al.Feb 4arXiv

The paper shows how to train a language model with special extra hints (privileged information) during practice so it can still do well later without any hints.

#Privileged Information#Knowledge Distillation#π-Distill

1 2 3