🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers19

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#VBench

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

Intermediate
Dahye Kim, Deepti Ghadiyaram et al.Feb 19arXiv

This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.

#Diffusion Transformer#Dynamic Tokenization#Patch Scheduling

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Intermediate
Jintao Zhang, Kai Jiang et al.Feb 13arXiv

Video generators are slow because attention looks at everything, which takes a lot of time.

#sparse attention#Top-k masking#Top-p masking

PISCO: Precise Video Instance Insertion with Sparse Control

Beginner
Xiangbo Gao, Renjie Li et al.Feb 9arXiv

PISCO is a video AI that lets you place a specific object into a real video exactly where and when you want, using just a few keyframes instead of editing every frame.

#video instance insertion#sparse keyframe control#video diffusion

Optimizing Few-Step Generation with Adaptive Matching Distillation

Intermediate
Lichen Bai, Zikai Zhou et al.Feb 7arXiv

Diffusion models make great images and videos but are slow because they usually need many tiny steps.

#diffusion distillation#few-step generation#distribution matching distillation

Pathwise Test-Time Correction for Autoregressive Long Video Generation

Intermediate
Xunzhi Xiang, Zixuan Duan et al.Feb 5arXiv

This paper fixes a big problem in long video generation: tiny mistakes that snowball over time and make the video drift and flicker.

#test-time correction#autoregressive video diffusion#distilled diffusion

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate
FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate
Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Intermediate
Minh-Quan Le, Gaurav Mittal et al.Feb 2arXiv

This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.

#text-to-video#optimal transport#annotation-free

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Intermediate
Tongcheng Fang, Hanling Zhang et al.Jan 23arXiv

Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.

#SALAD#sparse attention#linear attention

A Mechanistic View on Video Generation as World Models: State and Dynamics

Intermediate
Luozhou Wang, Zhifei Chen et al.Jan 22arXiv

This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.

#world models#video generation#state representation

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Intermediate
Qiyuan Zhang, Biao Gong et al.Jan 16arXiv

This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.

#physics-aware video generation#rigid body motion#reinforcement learning

Transition Matching Distillation for Fast Video Generation

Intermediate
Weili Nie, Julius Berner et al.Jan 14arXiv

Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.

#video diffusion#distillation#transition matching
12