ContextBench is a new benchmark that checks not just whether a coding AI fixes a bug, but whether it found and used the right pieces of code along the way.
This paper teaches a language model to write fast GPU kernels (tiny speed programs) in Triton using reinforcement learning that really cares about meaningful speed, not just being correct.
This paper fixes a big problem in long video generation: tiny mistakes that snowball over time and make the video drift and flicker.
BABE is a new benchmark that tests if AI can read real biology papers and reason from experiments like a scientist, not just recall facts.
Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Robots usually need very detailed, step-by-step directions, but real life often gives only short, simple goals like ‘find the red bench.’
FastVMT is a faster way to copy motion from one video to another without training a new model for each video.
The paper finds a hidden symmetry inside GRPO’s advantage calculation that accidentally stops models from exploring new good answers and from paying the right attention to easy versus hard problems at the right times.
Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.
Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.
ProAct teaches AI agents to think ahead accurately without needing expensive search every time they act.