Papers131

#reinforcement learning

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.

#pairwise self-verification#test-time scaling#parallel reasoning

Specificity-aware reinforcement learning for fine-grained open-world classification

Intermediate

Samuele Angheben, Davide Berasi et al.Mar 3arXiv

This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.

#open-world classification#fine-grained recognition#large multimodal models

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning

Intermediate

Jiejun Tan, Zhicheng Dou et al.Mar 3arXiv

MemSifter is a smart helper that picks the right memories for a big AI so the big AI doesn’t have to read everything.

#long-term memory#LLM retrieval#proxy model

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Intermediate

Jinpeng Chen, Cheng Gong et al.Mar 2arXiv

CoVe is a way to create training conversations for AI agents that use tools, while guaranteeing the conversations are both challenging and correct.

#constraint-guided verification#multi-turn tool use#user simulator

FireRed-OCR Technical Report

Intermediate

Hao Wu, Haoran Lou et al.Mar 2arXiv

FireRed-OCR turns a general vision-language model into a careful document reader that follows strict rules, so its outputs are usable in the real world.

#FireRed-OCR#structural hallucination#document parsing

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Intermediate

Ahmadreza Jeddi, Kimia Shaban et al.Mar 1arXiv

This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?

#medical vision-language models#reinforcement learning#supervised fine-tuning

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Intermediate

Ibragim Badertdinov, Maksim Nekrashevich et al.Feb 27arXiv

SWE-rebench V2 is a giant, language-agnostic robot pipeline that turns real GitHub pull requests into safe, runnable software tasks for training AI coding agents.

#SWE-rebench V2#software engineering agents#reinforcement learning

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Intermediate

Wenjia Wang, Liang Pan et al.Feb 26arXiv

EmbodMocap is a low-cost, portable way to capture people moving inside real places using just two iPhones, so computers and robots can learn from real life instead of studios.

#Embodied AI#4D human-scene reconstruction#dual-view RGB-D

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Intermediate

Qianben Chen, Tianrui Qin et al.Feb 26arXiv

This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.

#agentic search#parallel evidence acquisition#plan refinement

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Intermediate

Rui Yang, Qianhui Wu et al.Feb 25arXiv

GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.

#GUI agent#visual grounding#long-horizon navigation

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Intermediate

Jihao Qiu, Lingxi Xie et al.Feb 24arXiv

LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.

#long video understanding#video navigation#multimodal large language model

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin et al.Feb 24arXiv

PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.

#agentic multimodal models#reinforcement learning#dynamic tooling

1 2 3 4 5