AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
MAXS is a new way for AI agents to think a few steps ahead while using tools like search and code, so they make smarter choices.
This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.