This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
MemGovern teaches code agents to learn from past human fixes on GitHub by turning messy discussions into clean, reusable 'experience cards.'
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
DeepCode is an AI coding system that turns long, complicated papers into full, working code repositories.