Papers924

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Zecheng Tang, Baibei Ji et al.Jan 17arXiv

This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.

#reward models#long-term memory#long-context reasoning

Agentic-R: Learning to Retrieve for Agentic Search

Intermediate

Wenhan Liu, Xinyu Ma et al.Jan 17arXiv

Agentic-R is a new way to teach a search retriever to find not just similar text, but the text that truly helps an AI get the final answer right.

#agentic search#retriever training#passage utility modeling

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Intermediate

Ruiheng Zhang, Jingfeng Yao et al.Jan 16arXiv

UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.

#UniX#autoregressive branch#diffusion branch

Building Production-Ready Probes For Gemini

Beginner

János Kramár, Joshua Engels et al.Jan 16arXiv

The paper shows how to build tiny, fast safety checkers (called probes) that look inside a big AI’s brain activity to spot dangerous cyber-attack requests.

#activation probes#misuse mitigation#long-context robustness

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Intermediate

Yawar Siddiqui, Duncan Frost et al.Jan 16arXiv

ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.

#ShapeR#3D reconstruction#object-centric

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Intermediate

Eilam Shapira, Roi Reichart et al.Jan 16arXiv

The paper shows that simply adding a new AI model to the menu—without anyone actually using it—can push a fairness-focused regulator to change the market rules, shifting money from one side to the other.

#Poisoned Apple effect#AI agents#meta-game

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Intermediate

Linqing Zhong, Yi Liu et al.Jan 16arXiv

Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.

#Vision-Language-Action#Action Chain-of-Thought#Explicit Action Reasoner

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

Intermediate

Pingzhi Tang, Yiding Wang et al.Jan 16arXiv

Big language models can learn new facts with simple tutoring (SFT), but that doesn’t automatically teach them how to use those facts well.

#Parametric Skill Transfer#Skill Vector#Task Arithmetic

Language of Thought Shapes Output Diversity in Large Language Models

Intermediate

Shaoyang Xu, Wenxuan ZhangJan 16arXiv

The paper shows that changing the language a model 'thinks in' (its language of thought) can make its English answers more varied without making them much worse in quality.

#language of thought#output diversity#multilingual reasoning

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Intermediate

Tanyu Chen, Tairan Chen et al.Jan 16arXiv

Chroma 1.0 is a real-time, end-to-end speech-to-speech system that can talk back in your own cloned voice with sub-second delay.

#end-to-end speech-to-speech#personalized voice cloning#streaming TTS

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Intermediate

Shuai Tan, Biao Gong et al.Jan 16arXiv

CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.

#multi-subject animation#pose-guided video generation#Unbind–Rebind paradigm

28 29 30 31 32