LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
Most reinforcement learning agents only get a simple pass/fail reward, which hides how good or bad their attempts really were.