MIBURI is a system that makes a talking digital character move its body and face expressively in real time while it speaks.
This paper turns a popular image-guidance trick (Classifier-Free Guidance) into a feedback-control problem, just like keeping a car steady in its lane.
The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
This paper builds UniG2U-Bench, a big test to find out when making pictures (generation) actually helps models understand pictures and text together.
Agentic AIs don’t just chat; they plan, use tools, and take many steps, so one wrong click can cause real harm.
This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.
Robots learn better when they think about how things move over time, not by redrawing every pixel of a video.
BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
NOVA is a new video editor that lets you change a few key frames (sparse control) while it carefully keeps the original motion and background details (dense synthesis).
NE-Dreamer is a model-based reinforcement learning agent that skips rebuilding pixels and instead learns by predicting the next step’s hidden features.
DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).