🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers14

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#VBench

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate
FSVideo Team, Qingyu Chen et al.Feb 2arXiv

FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.

#FSVideo#image-to-video#video diffusion transformer

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate
Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Intermediate
Minh-Quan Le, Gaurav Mittal et al.Feb 2arXiv

This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.

#text-to-video#optimal transport#annotation-free

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Intermediate
Tongcheng Fang, Hanling Zhang et al.Jan 23arXiv

Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.

#SALAD#sparse attention#linear attention

A Mechanistic View on Video Generation as World Models: State and Dynamics

Intermediate
Luozhou Wang, Zhifei Chen et al.Jan 22arXiv

This paper says modern video generators are starting to act like tiny "world simulators," not just pretty video painters.

#world models#video generation#state representation

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

Intermediate
Qiyuan Zhang, Biao Gong et al.Jan 16arXiv

This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.

#physics-aware video generation#rigid body motion#reinforcement learning

Transition Matching Distillation for Fast Video Generation

Intermediate
Weili Nie, Julius Berner et al.Jan 14arXiv

Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.

#video diffusion#distillation#transition matching

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Intermediate
Longbin Ji, Xiaoxiong Liu et al.Jan 9arXiv

VideoAR is a new way to make videos with AI that writes each frame like a story, one step at a time, while painting details from coarse to fine.

#autoregressive video generation#visual autoregression#next-frame prediction

SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling

Intermediate
Yufan He, Pengfei Guo et al.Dec 29arXiv

SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.

#surgical world model#SATA dataset#inverse dynamics model

Yume-1.5: A Text-Controlled Interactive World Generation Model

Intermediate
Xiaofeng Mao, Zhen Li et al.Dec 26arXiv

Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.

#interactive world generation#video diffusion#temporal-spatial-channel modeling

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Intermediate
Haonan Qiu, Shikun Liu et al.Dec 24arXiv

HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.

#high-resolution video generation#diffusion transformer (DiT)#dual-resolution caching

SemanticGen: Video Generation in Semantic Space

Intermediate
Jianhong Bai, Xiaoshi Wu et al.Dec 23arXiv

SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.

#Video generation#Diffusion model#Semantic representation
12