Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
Endless Terminals is an automatic factory that builds thousands of realistic, checkable computer-terminal tasks so AI agents can practice and improve with reinforcement learning.
Memory-V2V teaches video editing AIs to remember what they already changed so new edits stay consistent with old ones.
Large language models usually get judged one message at a time, but many real tasks need smart planning across a whole conversation.
This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.
This paper shows how to keep training a language model while it is solving one hard, real problem, so it can discover a single, truly great answer instead of many average ones.
Cosmos Policy teaches robots to act by fine-tuning a powerful video model in just one training stage, without changing the model’s architecture.
ActionMesh is a fast, feed-forward AI that turns videos, images + text, text alone, or a given 3D model into an animated 3D mesh.
This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.
SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.