Search

"AI agents"20 resultsKeyword

Introducing OpenAI Frontier

Not triaged yet

OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao et al.Feb 26arXiv

OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.

#OmniGAIA#OmniAtlas#Tool-Integrated Reasoning

Not triaged yet

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Beginner

Jinyang Wu, Guocheng Zhai et al.Jan 7arXiv

ATLAS is a system that picks the best mix of AI models and helper tools for each question, instead of using just one model or a fixed tool plan.

#ATLAS#LLM routing#tool augmentation

Not triaged yet

Products

Intermediate

AnthropicFeb 19Anthropic

AI agents are computer helpers that can use tools and act on their own a little or a lot, and this paper measures how that happens in real life.

#AI agent autonomy#human-in-the-loop oversight#post-deployment monitoring

Not triaged yet

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Intermediate

Aniketh Garikaparthi, Manasi Patwardhan et al.Feb 16arXiv

ResearchGym is a new "gym" where AI agents are tested on real research projects end to end, not just on toy problems.

#ResearchGym#closed-loop research#objective evaluation

Not triaged yet

Learning Personalized Agents from Human Feedback

Beginner

Kaiqu Liang, Julia Kruk et al.Feb 18arXiv

AI helpers often don’t know new users’ tastes and can’t keep up when those tastes change.

#personalization#human feedback#pre-action clarification

Not triaged yet

Agentic Confidence Calibration

Beginner

Jiaxin Zhang, Caiming Xiong et al.Jan 22arXiv

AI agents often act very sure of themselves even when they are wrong, especially on long, multi-step tasks.

#agentic confidence calibration#holistic trajectory calibration#general agent calibrator

Not triaged yet

Agentic Uncertainty Quantification

Intermediate

Jiaxin Zhang, Prafulla Kumar Choubey et al.Jan 22arXiv

Long AI tasks can go wrong early and keep getting worse, like a snowball of mistakes called the Spiral of Hallucination.

#Agentic Uncertainty Quantification#Spiral of Hallucination#Dual-Process Architecture

Not triaged yet

SkillNet: Create, Evaluate, and Connect AI Skills

Intermediate

Yuan Liang, Ruobin Zhong et al.Feb 26arXiv

Before SkillNet, AI agents kept solving the same kinds of problems over and over without saving what they learned in a clean, reusable way.

#AI skills#Skill ontology#Skill taxonomy

Not triaged yet

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Intermediate

Kaican Li, Lewei Yao et al.Dec 21arXiv

This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.

#multimodal reasoning#generalized visual search#reinforcement learning

Not triaged yet

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Intermediate

Haoyu Dong, Pengkun Zhang et al.Dec 15arXiv

FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.

#FINCH benchmark#finance and accounting AI#spreadsheet agents

Not triaged yet

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Intermediate

Li Puyin, Tiange Xiang et al.Dec 22arXiv

QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.

#QuantiPhy#Vision-Language Models#Physical reasoning

Not triaged yet

Monadic Context Engineering

Intermediate

Yifan Zhang, Yang Yuan et al.Dec 27arXiv

Monadic Context Engineering (MCE) is a way to build AI agents using math-inspired Lego blocks called Functors, Applicatives, and Monads so state, errors, and side effects are handled automatically.

#Monadic Context Engineering#AgentMonad#Functor

Not triaged yet

OpenAI and Amazon announce strategic partnership

Beginner

OpenAI BlogFeb 27OpenAI

OpenAI and Amazon announced a multi-year partnership to make building and running AI apps easier, safer, and faster for businesses of all sizes.

#OpenAI Frontier#Stateful Runtime Environment#Amazon Bedrock

Not triaged yet

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner

Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

Not triaged yet

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

Not triaged yet

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Intermediate

Dongrui Liu, Qihan Ren et al.Jan 26arXiv

AgentDoG is a new ‘diagnostic guardrail’ that watches AI agents step-by-step and explains exactly why a risky action happened.

#AgentDoG#AI agent safety#diagnostic guardrail

Not triaged yet

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Intermediate

Eilam Shapira, Roi Reichart et al.Jan 16arXiv

The paper shows that simply adding a new AI model to the menu—without anyone actually using it—can push a fairness-focused regulator to change the market rules, shifting money from one side to the other.

#Poisoned Apple effect#AI agents#meta-game

Not triaged yet

Memory in the Age of AI Agents

Intermediate

Yuyang Hu, Shichun Liu et al.Dec 15arXiv

This survey explains how AI agents remember things and organizes the whole topic into three clear parts: forms, functions, and dynamics.

#Agent memory#LLM memory#Retrieval-augmented generation

Not triaged yet

Towards a Science of AI Agent Reliability

Intermediate

Stephan Rabanser, Sayash Kapoor et al.Feb 18arXiv

Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.

#AI agent reliability#consistency#robustness

Not triaged yet