Big idea: Make image-making AIs stop, think, check, and fix their own work so they get better at both creating pictures and understanding them.
LongCLI-Bench is a new test that checks how well AI coding agents can handle long, realistic software projects in the command line, not just tiny coding puzzles.
AgentArk teaches one language model to think like a whole team of models that debate, so it can solve tough problems quickly without running a long, expensive debate at answer time.
Big reasoning AIs think in many steps, which is slow and costly.