Papers26

#test-time scaling

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.

#pairwise self-verification#test-time scaling#parallel reasoning

Not triaged yet

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Intermediate

Rituraj Sharma, Weiyuan Chen et al.Mar 3arXiv

PRISM is a new way to help AI think through hard problems by checking each step, not just the final answer.

#DEEPTHINK#Process Reward Model#step-level verification

Not triaged yet

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Intermediate

Yinghao Ma, Haiwen Xia et al.Feb 28arXiv

Modern music AIs can follow text, lyrics, and even example audio, but judges that score these songs have not kept up.

#music reward model#compositional multimodal instruction#text-to-music evaluation

Not triaged yet

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Intermediate

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

dVoting: Fast Voting for dLLMs

Intermediate

Sicheng Feng, Zigeng Chen et al.Feb 12arXiv

Diffusion Large Language Models (dLLMs) can write many parts of an answer at once, not just left to right like usual chatbots.

#diffusion large language models#remasking#test-time scaling

Not triaged yet

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Intermediate

Pinyi Zhang, Ting-En Lin et al.Feb 12arXiv

This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.

#personalized reward modeling#generative reward model#evaluation chain

Not triaged yet

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Intermediate

Chenlong Deng, Mengjie Deng et al.Feb 11arXiv

Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.

#context-aware image retrieval#multimodal agents#visual history exploration

Not triaged yet

Pathwise Test-Time Correction for Autoregressive Long Video Generation

Intermediate

Xunzhi Xiang, Zixuan Duan et al.Feb 5arXiv

This paper fixes a big problem in long video generation: tiny mistakes that snowball over time and make the video drift and flicker.

#test-time correction#autoregressive video diffusion#distilled diffusion

Not triaged yet

Reinforced Attention Learning

Intermediate

Bangzheng Li, Jianmo Ni et al.Feb 4arXiv

This paper teaches AI to pay attention better by training its focus, not just its words.

#Reinforced Attention Learning#attention policy#multimodal LLM

Not triaged yet

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Intermediate

Tong Zheng, Chengsong Huang et al.Feb 3arXiv

Parallel-Probe is a simple add-on that lets many AI “thought paths” think at once but stop early when they already agree.

#parallel thinking#2D probing#consensus-based early stopping

Not triaged yet

SWE-World: Building Software Engineering Agents in Docker-Free Environments

Intermediate

Shuang Sun, Huatong Song et al.Feb 3arXiv

SWE-World lets code-fixing AI agents practice and learn without heavy Docker containers by using smart models that pretend to be the computer and tests.

#SWE-World#software engineering agents#Docker-free training

Not triaged yet

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Intermediate

Jialiang Zhu, Gongrui Zhang et al.Feb 2arXiv

Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.

#Re-TRAC#trajectory compression#deep research agents

Not triaged yet

1 2 3