SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
The paper shows that teaching a language model with a special “reward-shaped” next-token objective can make later reinforcement learning (RL) work much better.
JavisGPT is a single AI that can both understand sounding videos (audio + video together) and also create new ones that stay in sync.
The paper asks what a truly good diffusion-based language model should look like and lists five must-have properties.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
DreamOmni3 lets people edit and create images by combining text, example images, and quick hand-drawn scribbles.
Monadic Context Engineering (MCE) is a way to build AI agents using math-inspired Lego blocks called Functors, Applicatives, and Monads so state, errors, and side effects are handled automatically.
This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.
Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
ProEdit is a training-free, plug-and-play method that fixes a common problem in image and video editing: the model clings too hard to the original picture and refuses to change what you asked for.