This paper teaches text-to-video models to follow real-world physics, so people, balls, water, glass, and fire act the way they should.
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
FIGR is a new way for AI to ‘think by drawing,’ using code to build clean, editable diagrams while it reasons.
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.