This paper builds a big, reusable library of computer skills so an AI can use Windows apps more like a careful human, not a clumsy robot.
This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.
This paper shows that making short videos can help AI plan and reason in pictures better than writing out steps in text.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”
Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
The paper teaches large language models to learn from detailed feedback (like error messages) instead of only a simple pass/fail score.
SERA is a new, low-cost way to train coding helpers (agents) that learn the style and secrets of your own codebase.
AgentLongBench is a new test that checks how well AI agents think over very long stories made of their own actions and the world's replies, not just by reading static documents.
This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.