Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.
InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.
DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.
NextFlow is a single, decoder-only Transformer that can read and write both text and images in one continuous sequence.
This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.
Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.
MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.
This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.