Papers65

#flow matching

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Lingen Li, Guangzhi Wang et al.Mar 4arXiv

CubeComposer is a new AI method that turns a normal forward-facing video into a full 360° VR video at true 4K quality without using super-resolution upscaling.

#360° video generation#cubemap#spatio-temporal autoregression

Not triaged yet

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Intermediate

Shengbang Tong, David Fan et al.Mar 3arXiv

The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.

#multimodal pretraining#representation autoencoder#RAE

Not triaged yet

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Intermediate

Yichen Liu, Donghao Zhou et al.Mar 2arXiv

HiFi-Inpaint is a new AI method that fills a missing area in a photo of a person by inserting a specific product, while keeping tiny details like logos, textures, and small text crisp.

#reference-based inpainting#high-frequency map#Shared Enhancement Attention

Not triaged yet

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Intermediate

Yiqi Lin, Guoqiang Liang et al.Mar 2arXiv

Kiwi-Edit is a new video editor that follows your words and also copies looks from a picture you give it.

#reference-guided video editing#instruction-based editing#multimodal large language model

Not triaged yet

$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs

Intermediate

Siting Wang, Xiaofeng Wang et al.Mar 2arXiv

Robots that read images and instructions (VLAs) get stuck following a narrow, fragile path after normal training.

#vision-language-action#flow matching#stochastic differential equations

Not triaged yet

DreamWorld: Unified World Modeling in Video Generation

Intermediate

Boming Tan, Xiangdong Zhang et al.Feb 28arXiv

DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.

#video diffusion transformer#world model#optical flow

Not triaged yet

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Intermediate

Shengqu Cai, Weili Nie et al.Feb 27arXiv

Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isn’t much training data for that.

#long video generation#flow matching#distribution matching

Not triaged yet

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Intermediate

Yasaman Haghighi, Alexandre AlahiFeb 27arXiv

SenCache speeds up video diffusion models by reusing past answers only when the model is predicted to change very little.

#diffusion models#video generation#caching

Not triaged yet

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Intermediate

Guibin Chen, Dixuan Lin et al.Feb 25arXiv

SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.

#multimodal diffusion transformer#video-audio generation#inpainting

Not triaged yet

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Intermediate

Euisoo Jung, Byunghyun Kim et al.Feb 25arXiv

Diffusion models make great images but are slow because they fix noise step by step many times.

#diffusion inference#multi-GPU acceleration#data parallelism

Not triaged yet

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Intermediate

Abdelrahman Shaker, Ahmed Heakl et al.Feb 23arXiv

Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.

#Mobile-O#unified multimodal model#on-device AI

Not triaged yet

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Intermediate

Kai Liu, Yanhao Zheng et al.Feb 22arXiv

JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.

#joint audio-video generation#multimodal diffusion transformer#modality-specific mixture-of-experts

Not triaged yet

1 2 3 4 5