This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.