🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers65

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#flow matching

SARAH: Spatially Aware Real-time Agentic Humans

Intermediate
Evonne Ng, Siwei Zhang et al.Feb 20arXiv

SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.

#spatially aware motion#real-time avatars#causal transformer

Not triaged yet

VLANeXt: Recipes for Building Strong VLA Models

Intermediate
Xiao-Ming Wu, Bin Fan et al.Feb 20arXiv

This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.

#Vision-Language-Action#robot manipulation#flow matching

Not triaged yet

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Intermediate
Hila Manor, Rinon Gal et al.Feb 17arXiv

This paper teaches image models to copy a change shown in one image pair and apply it to a new image, like saying 'hat added here, add a similar hat there.'

#visual analogy learning#LoRA#LoRA basis

Not triaged yet

World Action Models are Zero-shot Policies

Intermediate
Seonghyeon Ye, Yunhao Ge et al.Feb 17arXiv

DreamZero is a robot brain that learns actions by predicting short videos of the future and the matching moves at the same time.

#World Action Models#DreamZero#video diffusion

Not triaged yet

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Intermediate
Jintao Zhang, Kai Jiang et al.Feb 13arXiv

Video generators are slow because attention looks at everything, which takes a lot of time.

#sparse attention#Top-k masking#Top-p masking

Not triaged yet

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Intermediate
Zihan Yang, Shuyuan Tu et al.Feb 9arXiv

ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.

#ArcFlow#few-step distillation#non-linear flow

Not triaged yet

MOVA: Towards Scalable and Synchronized Video-Audio Generation

Intermediate
SII-OpenMOSS Team, Donghua Yu et al.Feb 9arXiv

MOVA is an open-source AI that makes videos and sounds at the same time so mouths, actions, and noises match perfectly.

#video-audio generation#lip synchronization#dual-tower architecture

Not triaged yet

Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Intermediate
Tianhe Wu, Ruibin Li et al.Feb 3arXiv

The paper solves a big problem in fast image generators: they got quick, but they lost variety and kept making similar pictures.

#diffusion distillation#distribution matching distillation#mode collapse

Not triaged yet

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Intermediate
Zehong Ma, Ruihan Xu et al.Feb 2arXiv

PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.

#pixel diffusion#perceptual loss#LPIPS

Not triaged yet

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate
FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Not triaged yet

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Intermediate
Fu-Yun Wang, Han Zhang et al.Feb 1arXiv

PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.

#PromptRL#flow matching#reinforcement learning

Not triaged yet

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Intermediate
Anthony Chen, Naomi Ken Korem et al.Jan 29arXiv

This paper shows a simple, one-model way to dub videos that makes the new voice and the lips move together naturally.

#video dubbing#audio-visual diffusion#joint generation

Not triaged yet

12345