🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers24

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#temporal consistency

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Intermediate
Yiqi Lin, Guoqiang Liang et al.Mar 2arXiv

Kiwi-Edit is a new video editor that follows your words and also copies looks from a picture you give it.

#reference-guided video editing#instruction-based editing#multimodal large language model

The Trinity of Consistency as a Defining Principle for General World Models

Intermediate
Jingxuan Wei, Siyuan Li et al.Feb 26arXiv

The paper argues that to build an AI that truly understands and simulates the real world, it must be consistent in three ways at once: across different senses (modal), across 3D space (spatial), and across time (temporal).

#world model#trinity of consistency#modal consistency

MIND: Benchmarking Memory Consistency and Action Control in World Models

Intermediate
Yixuan Ye, Xuanyu Lu et al.Feb 8arXiv

MIND is a new benchmark that fairly tests two core skills of world models: remembering the world over time (memory consistency) and following controls exactly (action control).

#world models#memory consistency#action control

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Intermediate
Shuo Chen, Cong Wei et al.Feb 5arXiv

The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.

#autoregressive video generation#long-context modeling#distribution matching distillation

RISE-Video: Can Video Generators Decode Implicit World Rules?

Intermediate
Mingxin Liu, Shuran Ma et al.Feb 5arXiv

RISE-Video is a new test that checks whether video-making AIs follow hidden world rules, not just make pretty pictures.

#Text-Image-to-Video#video generation benchmark#reasoning alignment

FastVMT: Eliminating Redundancy in Video Motion Transfer

Intermediate
Yue Ma, Zhikai Wang et al.Feb 5arXiv

FastVMT is a faster way to copy motion from one video to another without training a new model for each video.

#FastVMT#video motion transfer#Diffusion Transformer

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Intermediate
Sangbeom Lim, Seoung Wug Oh et al.Jan 20arXiv

VideoMaMa is a model that turns simple black-and-white object masks into soft, precise cutouts (alpha mattes) for every frame of a video.

#video matting#alpha matte#binary segmentation mask

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Intermediate
Hongyuan Chen, Xingyu Chen et al.Jan 20arXiv

Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.

#4D synthesis#monocular video#motion reconstruction

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Intermediate
Shuai Tan, Biao Gong et al.Jan 16arXiv

CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.

#multi-subject animation#pose-guided video generation#Unbind–Rebind paradigm

FlowAct-R1: Towards Interactive Humanoid Video Generation

Intermediate
Lizhen Wang, Yongming Zhu et al.Jan 15arXiv

FlowAct-R1 is a new system that makes lifelike human videos in real time, so the on-screen person can react quickly as you talk to them.

#interactive humanoid video#real-time streaming generation#temporal consistency

Motion Attribution for Video Generation

Intermediate
Xindi Wu, Despoina Paschalidou et al.Jan 13arXiv

Motive is a new way to figure out which training videos teach an AI how to move things realistically, not just how they look.

#motion attribution#video diffusion#optical flow

MoCha:End-to-End Video Character Replacement without Structural Guidance

Intermediate
Zhengbo Xu, Jie Ma et al.Jan 13arXiv

MoCha is a new AI that swaps a person in a video with a new character using only one mask on one frame and a few reference photos.

#video diffusion#character replacement#in-context learning
12