Papers31

#diffusion transformer

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang, Zhengguang Zhou et al.Feb 2arXiv

This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.

#grounded human-object interaction#talking avatars#diffusion transformer

SkyReels-V3 Technique Report

Intermediate

Debang Li, Zhengcong Fei et al.Jan 24arXiv

SkyReels-V3 is a single AI model that can make videos in three ways: from reference images, by extending an existing video, and by creating talking avatars from audio.

#video generation#diffusion transformer#multimodal in-context learning

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Intermediate

Ziyi Wu, Daniel Watson et al.Jan 22arXiv

This paper shows how to turn any normal photo or video into a seamless 360° panorama without needing the camera’s settings like field of view or tilt.

#360 panorama generation#equirectangular projection#diffusion transformer

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Intermediate

Pengze Zhang, Yanze Wu et al.Jan 20arXiv

OmniTransfer is a single system that learns from a whole reference video, not just one image, so it can copy how things look (identity and style) and how they move (motion, camera, effects).

#spatio-temporal video transfer#identity transfer#style transfer

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Intermediate

Bin Yu, Shijie Lian et al.Jan 20arXiv

TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.

#Vision-Language-Action#catastrophic forgetting#Asymmetric Mixture-of-Transformers

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Intermediate

Shuai Tan, Biao Gong et al.Jan 16arXiv

CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.

#multi-subject animation#pose-guided video generation#Unbind–Rebind paradigm

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Intermediate

Siqi Kou, Jiachun Jin et al.Jan 15arXiv

Most text-to-image models act like word-to-pixel copy machines and miss the hidden meaning in our prompts.

#think-then-generate#reasoning-aware text-to-image#LLM encoder

NitroGen: An Open Foundation Model for Generalist Gaming Agents

Intermediate

Loïc Magne, Anas Awadalla et al.Jan 4arXiv

NitroGen is a vision-to-action AI that learns to play many video games by watching 40,000 hours of gameplay videos from over 1,000 titles with on-screen controller overlays.

#NitroGen#generalist gaming agent#behavior cloning

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Intermediate

Xu Guo, Fulong Ye et al.Jan 4arXiv

DreamID-V is a new AI method that swaps faces in videos while keeping the body movements, expressions, lighting, and background steady and natural.

#video face swapping#image face swapping#diffusion transformer

GR-Dexter Technical Report

Intermediate

Ruoshi Wen, Guangzeng Chen et al.Dec 30arXiv

GR-Dexter is a full package—new robot hands, a smart AI brain, and lots of carefully mixed data—that lets a two-handed robot follow language instructions to do long, tricky tasks.

#vision-language-action#dexterous manipulation#bimanual robotics

1 2 3