Papers68

#flow matching

SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng, Siwei Zhang et al.Feb 20arXiv

SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.

#spatially aware motion#real-time avatars#causal transformer

Not triaged yet

VLANeXt: Recipes for Building Strong VLA Models

Intermediate

Xiao-Ming Wu, Bin Fan et al.Feb 20arXiv

This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.

#Vision-Language-Action#robot manipulation#flow matching

Not triaged yet

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Intermediate

Hila Manor, Rinon Gal et al.Feb 17arXiv

This paper teaches image models to copy a change shown in one image pair and apply it to a new image, like saying 'hat added here, add a similar hat there.'

#visual analogy learning#LoRA#LoRA basis

Not triaged yet

World Action Models are Zero-shot Policies

Intermediate

Seonghyeon Ye, Yunhao Ge et al.Feb 17arXiv

DreamZero is a robot brain that learns actions by predicting short videos of the future and the matching moves at the same time.

#World Action Models#DreamZero#video diffusion

Not triaged yet

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Intermediate

Jintao Zhang, Kai Jiang et al.Feb 13arXiv

Video generators are slow because attention looks at everything, which takes a lot of time.

#sparse attention#Top-k masking#Top-p masking

Not triaged yet

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Intermediate

Zihan Yang, Shuyuan Tu et al.Feb 9arXiv

ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.

#ArcFlow#few-step distillation#non-linear flow

Not triaged yet

MOVA: Towards Scalable and Synchronized Video-Audio Generation

Intermediate

SII-OpenMOSS Team, Donghua Yu et al.Feb 9arXiv

MOVA is an open-source AI that makes videos and sounds at the same time so mouths, actions, and noises match perfectly.

#video-audio generation#lip synchronization#dual-tower architecture

Not triaged yet

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Intermediate

Tianhe Wu, Ruibin Li et al.Feb 3arXiv

The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.

#diffusion distillation#distribution matching distillation#mode collapse

Not triaged yet

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Intermediate

Zehong Ma, Ruihan Xu et al.Feb 2arXiv

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

Not triaged yet

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate

FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Not triaged yet

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Beginner

Youliang Zhang, Zhengguang Zhou et al.Feb 2arXiv

This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.

#grounded human-object interaction#talking avatars#diffusion transformer

Not triaged yet

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Intermediate

Fu-Yun Wang, Han Zhang et al.Feb 1arXiv

PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.

#PromptRL#flow matching#reinforcement learning

Not triaged yet

1 2 3 4 5