🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers67

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#flow matching

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Intermediate
Lingen Li, Guangzhi Wang et al.Mar 4arXiv

CubeComposer is a new AI method that turns a normal forward-facing video into a full 360° VR video at true 4K quality without using super-resolution upscaling.

#360° video generation#cubemap#spatio-temporal autoregression

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Intermediate
Shengbang Tong, David Fan et al.Mar 3arXiv

The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.

#multimodal pretraining#representation autoencoder#RAE

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Intermediate
Yichen Liu, Donghao Zhou et al.Mar 2arXiv

HiFi-Inpaint is a new AI method that fills a missing area in a photo of a person by inserting a specific product, while keeping tiny details like logos, textures, and small text crisp.

#reference-based inpainting#high-frequency map#Shared Enhancement Attention

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Intermediate
Yiqi Lin, Guoqiang Liang et al.Mar 2arXiv

Kiwi-Edit is a new video editor that follows your words and also copies looks from a picture you give it.

#reference-guided video editing#instruction-based editing#multimodal large language model

DreamWorld: Unified World Modeling in Video Generation

Intermediate
Boming Tan, Xiangdong Zhang et al.Feb 28arXiv

DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.

#video diffusion transformer#world model#optical flow

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Intermediate
Shengqu Cai, Weili Nie et al.Feb 27arXiv

Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isn’t much training data for that.

#long video generation#flow matching#distribution matching

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Intermediate
Yasaman Haghighi, Alexandre AlahiFeb 27arXiv

SenCache speeds up video diffusion models by reusing past answers only when the model is predicted to change very little.

#diffusion models#video generation#caching

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Intermediate
Guibin Chen, Dixuan Lin et al.Feb 25arXiv

SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.

#multimodal diffusion transformer#video-audio generation#inpainting

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Intermediate
Euisoo Jung, Byunghyun Kim et al.Feb 25arXiv

Diffusion models make great images but are slow because they fix noise step by step many times.

#diffusion inference#multi-GPU acceleration#data parallelism

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Intermediate
Abdelrahman Shaker, Ahmed Heakl et al.Feb 23arXiv

Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.

#Mobile-O#unified multimodal model#on-device AI

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Intermediate
Kai Liu, Yanhao Zheng et al.Feb 22arXiv

JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.

#joint audio-video generation#multimodal diffusion transformer#modality-specific mixture-of-experts

SARAH: Spatially Aware Real-time Agentic Humans

Intermediate
Evonne Ng, Siwei Zhang et al.Feb 20arXiv

SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.

#spatially aware motion#real-time avatars#causal transformer
12345