Papers3

#GAIA benchmark

Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.

Not triaged yet

AOrchestra is like a smart conductor that builds the right mini-helpers (sub-agents) on demand to solve big, multi-step tasks.

Not triaged yet

DeepVerifier is a plug-in checker that helps Deep Research Agents catch and fix their own mistakes while they are working, without retraining.

Not triaged yet