Papers791

Agentic Reasoning for Large Language Models

Tianxin Wei, Ting-Wei Li et al.Jan 18arXiv

This paper explains how to turn large language models (LLMs) from quiet students that only answer questions into active agents that can plan, act, and learn over time.

#Agentic Reasoning#LLM Agents#In-Context Learning

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Intermediate

Peizhou Huang, Zixuan Zhong et al.Jan 18arXiv

This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.

#Multimodal Deep Research#Benchmark#Citation Grounding

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Intermediate

Dawei Li, Yuguang Yao et al.Jan 18arXiv

ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.

#process reward model#tool-using agents#offline sampling

Agentic-R: Learning to Retrieve for Agentic Search

Intermediate

Wenhan Liu, Xinyu Ma et al.Jan 17arXiv

Agentic-R is a new way to teach a search retriever to find not just similar text, but the text that truly helps an AI get the final answer right.

#agentic search#retriever training#passage utility modeling

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Intermediate

Ruiheng Zhang, Jingfeng Yao et al.Jan 16arXiv

UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.

#UniX#autoregressive branch#diffusion branch

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Intermediate

Yawar Siddiqui, Duncan Frost et al.Jan 16arXiv

ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.

#ShapeR#3D reconstruction#object-centric

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Intermediate

Eilam Shapira, Roi Reichart et al.Jan 16arXiv

The paper shows that simply adding a new AI model to the menu—without anyone actually using it—can push a fairness-focused regulator to change the market rules, shifting money from one side to the other.

#Poisoned Apple effect#AI agents#meta-game

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Intermediate

Linqing Zhong, Yi Liu et al.Jan 16arXiv

Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.

#Vision-Language-Action#Action Chain-of-Thought#Explicit Action Reasoner

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Intermediate

Pingzhi Tang, Yiding Wang et al.Jan 16arXiv

Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.

#Parametric Skill Transfer#Skill Vector#Task Arithmetic

Language of Thought Shapes Output Diversity in Large Language Models

Intermediate

Shaoyang Xu, Wenxuan ZhangJan 16arXiv

The paper shows that changing the language a model 'thinks in' (its language of thought) can make its English answers more varied without making them much worse in quality.

#language of thought#output diversity#multilingual reasoning

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Intermediate

Tanyu Chen, Tairan Chen et al.Jan 16arXiv

Chroma 1.0 is a real-time, end-to-end speech-to-speech system that can talk back in your own cloned voice with sub-second delay.

#end-to-end speech-to-speech#personalized voice cloning#streaming TTS

24 25 26 27 28