Papers1262

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Not triaged yet

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Intermediate

Shiyu Liu, Yongjing Yin et al.Jan 16arXiv

RL-trained search agents often sound confident even when they don’t know, which can mislead people.

#agentic search#reinforcement learning#boundary awareness

Not triaged yet

NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Intermediate

Jiayu Liu, Rui Wang et al.Jan 16arXiv

The paper studies why large language models (LLMs) sound too sure of themselves when using retrieval-augmented generation (RAG) and how to fix it.

#Retrieval-Augmented Generation#Confidence Calibration#Expected Calibration Error

Not triaged yet

When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs

Intermediate

Zhongxiang Sun, Yi Zhan et al.Jan 16arXiv

Personalized AI helpers can accidentally copy a user’s past opinions instead of telling objective facts, which the authors call personalization-induced hallucinations.

#personalized large language models#hallucination#factuality

Not triaged yet

FrankenMotion: Part-level Human Motion Generation and Composition

Beginner

Chuqiao Li, Xianghui Xie et al.Jan 15arXiv

FrankenMotion is a new AI that makes human motion by controlling each body part over time, like a careful puppeteer.

#Human motion generation#Part-level control#Hierarchical conditioning

Not triaged yet

Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

Intermediate

Chongcong Jiang, Tianxingjian Ding et al.Jan 15arXiv

Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.

#Medical image segmentation#Prompt-based segmentation#Foundation models

Not triaged yet

Reasoning Models Generate Societies of Thought

Intermediate

Junsol Kim, Shiyang Lai et al.Jan 15arXiv

The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.

#society of thought#reasoning reinforcement learning#conversational behaviors

Not triaged yet

Alterbute: Editing Intrinsic Attributes of Objects in Images

Intermediate

Tal Reiss, Daniel Winter et al.Jan 15arXiv

Alterbute is a diffusion-based method that changes an object's intrinsic attributes (color, texture, material, shape) in a photo while keeping the object's identity and the scene intact.

#intrinsic attribute editing#visual named entities#identity preservation

Not triaged yet

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Intermediate

Changle Qu, Sunhao Dai et al.Jan 15arXiv

MatchTIR teaches AI agents to judge each tool call step-by-step instead of giving the same reward to every step.

#Tool-Integrated Reasoning#Credit Assignment#Bipartite Matching

Not triaged yet

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

Beginner

Caihua Li, Lianghong Guo et al.Jan 15arXiv

This paper is the first big map of how AI can fix real software problems, not just write short code snippets.

#SWE-bench#issue resolution#AI coding agents

Not triaged yet

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Intermediate

Gilat Toker, Nitay Calderon et al.Jan 15arXiv

This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.

#concept-based explanations#structural counterfactuals#structured causal models

Not triaged yet

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

Not triaged yet

57 58 59 60 61