MiMo-V2-Flash is a giant but efficient language model that uses a team-of-experts design to think well while staying fast.
AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).
SimpleMem is a new memory system that helps AI remember long conversations without wasting space or tokens.
Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.
InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.
DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.
NextFlow is a single, decoder-only Transformer that can read and write both text and images in one continuous sequence.
This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.
Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.
MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.