Large Vision-Language Models (LVLMs) are great with one picture but get confused when you give them several, often mixing details from different images.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
FantasyVLN teaches a robot to follow language instructions while looking around, using a smart, step-by-step thinking style during training but not at test time.
FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.
CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.
SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.
This paper shows a simple way to turn any strong autoregressive (step-by-step) model into a diffusion vision-language model (parallel, block-by-block) without changing the architecture.
Zoom-Zero helps AI answer questions about videos by first finding the right moment and then zooming in to double-check tiny details.
This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.
The paper shows that video AIs do not need long, human-like chains of thought to reason well.
LongCat-Image is a small (6B) but mighty bilingual image generator that turns text into high-quality, realistic pictures and can also edit images very well.