This paper turns messy chains of thought from language models into clear, named steps so we can see how they really think through math problems.
This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?
The paper proposes the Prism Hypothesis: meanings (semantics) mainly live in low frequencies, while fine picture details live in high frequencies.
GenEnv is a training system where a student AI and a teacher simulator grow together by exchanging tasks and feedback.
Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.
WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.
Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.
Over++ is a video AI that adds realistic effects like shadows, splashes, dust, and smoke between a foreground and a background without changing the original footage.
StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.
Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.