NOVA is a new video editor that lets you change a few key frames (sparse control) while it carefully keeps the original motion and background details (dense synthesis).
STMI is a new way to recognize the same object across different kinds of cameras (color, night-vision, and thermal) without throwing away useful details.
Masked Image Generation Models (MIGMs) make pictures by filling in many blank spots step by step, but each step is slow and repeats a lot of work.
WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.
SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.
SAM 3D Body (3DB) is a model that turns a single photo of a person into a full 3D body, feet, and hands mesh with state-of-the-art accuracy.
This paper introduces Nexus Adapters, tiny helper networks that let a diffusion model follow both a text prompt and a structure map (like edges or depth) at the same time.
DreamID-Omni is one model that can create, edit, and animate human-centered videos with matching voices, all in sync.
Stroke3D lets you draw simple 2D stick-figure strokes plus a short text, and it builds a ready-to-animate 3D model with a skeleton and textures.
OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.
This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.