How I Study AI - Learn AI Papers & Lectures the Easy Way

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Intermediate

Yuanda Xu, Hejian Sang et al.Feb 24arXiv

The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.

#ACE#Reinforcement Learning with Verifiable Rewards#GRPO

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Intermediate

Shyam Sundhar Ramesh, Xiaotong Ji et al.Feb 5arXiv

Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.

#Multi-Task Learning#GRPO#Reinforcement Learning Post-Training

ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Intermediate

Fanmeng Wang, Haotian Liu et al.Jan 30arXiv

Chain-of-Thought (CoT) makes AI think step by step, but it is slow because it writes many tokens one by one.

#Chain-of-Thought#Latent Reasoning#Variational Auto-Encoder

Papers3

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought