This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.
Numina-Lean-Agent is a new open system that uses a general coding agent to write and check exact math proofs in Lean without special training.