Papers12

#GSPO

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang et al.Mar 3arXiv

This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.

#HACRL#HACPO#heterogeneous agents

Not triaged yet

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Intermediate

Haoxiang Sun, Lizhen Xu et al.Feb 18arXiv

DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.

#DeepVision-103K#multimodal reasoning#RLVR

Not triaged yet

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

Intermediate

Futing Wang, Jianhao Yan et al.Feb 12arXiv

The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.

#In-Context Exploration#Test-Time Scaling#Chain-of-Thought

Not triaged yet

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Intermediate

Zixuan Huang, Xin Xia et al.Feb 9arXiv

Big AI reasoning models often keep thinking long after they already found the right answer, wasting time and tokens.

#SAGE#efficient reasoning#chain of thought

Not triaged yet

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Intermediate

Fanfan Liu, Youyang Yin et al.Feb 5arXiv

The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.

#LUSPO#RLVR#GRPO

Not triaged yet

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

Intermediate

Zichen Wen, Boxue Yang et al.Jan 27arXiv

Innovator-VL is a new multimodal AI model that understands both pictures and text to help solve science problems without needing mountains of special data.

#Innovator-VL#multimodal large language model#scientific reasoning

Not triaged yet

Towards Pixel-Level VLM Perception via Simple Points Prediction

Intermediate

Tianhui Song, Haoyu Lu et al.Jan 27arXiv

SimpleSeg teaches a multimodal language model to outline objects by writing down a list of points, like connecting the dots, instead of using a special segmentation decoder.

#SimpleSeg#multimodal large language model#decoder-free segmentation

Not triaged yet

Qwen3-TTS Technical Report

Intermediate

Hangrui Hu, Xinfa Zhu et al.Jan 22arXiv

Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.

#Qwen3-TTS#text-to-speech#voice cloning

Not triaged yet

Your Group-Relative Advantage Is Biased

Intermediate

Fengkai Yang, Zherui Chen et al.Jan 13arXiv

Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.

#Reinforcement Learning from Verifier Rewards#GRPO#GSPO

Not triaged yet

Solar Open Technical Report

Intermediate

Sungrae Park, Sanghoon Kim et al.Jan 11arXiv

Solar Open is a giant bilingual AI (102 billion parameters) that focuses on helping underserved languages like Korean catch up with English-level AI quality.

#Solar Open#Mixture-of-Experts#bilingual LLM

Not triaged yet

TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Intermediate

Yinuo Wang, Mining Tan et al.Jan 8arXiv

TourPlanner is a travel-planning system that first gathers the right places, then lets multiple expert ‘voices’ debate plans, and finally polishes the winner with a learning method that follows rules before style.

#travel planning#multi-agent reasoning#chain-of-thought

Not triaged yet

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Intermediate

Yiwen Tang, Zoey Guo et al.Dec 11arXiv

This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.

#Reinforcement Learning#Text-to-3D Generation#Hi-GRPO

Not triaged yet