Papers22

#cross-attention

NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing

NOVA is a new video editor that lets you change a few key frames (sparse control) while it carefully keeps the original motion and background details (dense synthesis).

#video editing#pair-free training#sparse control

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

Intermediate

Xingguo Xu, Zhanyu Liu et al.Feb 28arXiv

STMI is a new way to recognize the same object across different kinds of cameras (color, night-vision, and thermal) without throwing away useful details.

#multi-modal re-identification#RGB-NIR-TIR fusion#segmentation-guided attention

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

Intermediate

Kaiwen Zhu, Quansheng Zeng et al.Feb 27arXiv

Masked Image Generation Models (MIGMs) make pictures by filling in many blank spots step by step, but each step is slow and repeats a lot of work.

#masked image generation#MIGM-Shortcut#latent controlled dynamics

World Guidance: World Modeling in Condition Space for Action Generation

Intermediate

Yue Su, Sijin Chen et al.Feb 25arXiv

WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.

#Vision-Language-Action#world modeling#condition space

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Intermediate

Guibin Chen, Dixuan Lin et al.Feb 25arXiv

SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.

#multimodal diffusion transformer#video-audio generation#inpainting

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Intermediate

Xitong Yang, Devansh Kukreja et al.Feb 17arXiv

SAM 3D Body (3DB) is a model that turns a single photo of a person into a full 3D body, feet, and hands mesh with state-of-the-art accuracy.

#human mesh recovery#3D human pose#Momentum Human Rig

Efficient Text-Guided Convolutional Adapter for the Diffusion Model

Intermediate

Aryan Das, Koushik Biswas et al.Feb 16arXiv

This paper introduces Nexus Adapters, tiny helper networks that let a diffusion model follow both a text prompt and a structure map (like edges or depth) at the same time.

#Nexus Adapter#text-guided adapter#cross-attention

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Intermediate

Xu Guo, Fulong Ye et al.Feb 12arXiv

DreamID-Omni is one model that can create, edit, and animate human-centered videos with matching voices, all in sync.

#audio-video generation#diffusion transformer#identity preservation

Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Intermediate

Ruisi Zhao, Haoren Zheng et al.Feb 10arXiv

Stroke3D lets you draw simple 2D stick-figure strokes plus a short text, and it builds a ready-to-animate 3D model with a skeleton and textures.

#Stroke3D#rigged 3D generation#skeleton-first pipeline

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Intermediate

Yue Ding, Yiyan Ji et al.Feb 4arXiv

OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.

#Omni-LLM#token compression#modality-asymmetric

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Intermediate

Zhixue Fang, Xu He et al.Feb 3arXiv

This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.

#3D-aware motion#implicit motion encoder#motion tokens

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Intermediate

Minh-Quan Le, Gaurav Mittal et al.Feb 2arXiv

This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.

#text-to-video#optimal transport#annotation-free

1 2