Papers25

#classifier-free guidance

Transition Matching Distillation for Fast Video Generation

Weili Nie, Julius Berner et al.Jan 14arXiv

Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.

#video diffusion#distillation#transition matching

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Intermediate

Longbin Ji, Xiaoxiong Liu et al.Jan 9arXiv

VideoAR is a new way to make videos with AI that writes each frame like a story, one step at a time, while painting details from coarse to fine.

#autoregressive video generation#visual autoregression#next-frame prediction

Boosting Latent Diffusion Models via Disentangled Representation Alignment

Intermediate

John Page, Xuesong Niu et al.Jan 9arXiv

This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.

#Send-VAE#semantic disentanglement#latent diffusion

LTX-2: Efficient Joint Audio-Visual Foundation Model

Intermediate

Yoav HaCohen, Benny Brazowski et al.Jan 6arXiv

LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.

#text-to-video#text-to-audio#audiovisual generation

Over++: Generative Video Compositing for Layer Interaction Effects

Intermediate

Luchao Qi, Jiaye Wu et al.Dec 22arXiv

Over++ is a video AI that adds realistic effects like shadows, splashes, dust, and smoke between a foreground and a background without changing the original footage.

#augmented compositing#video diffusion#video inpainting

EasyV2V: A High-quality Instruction-based Video Editing Framework

Intermediate

Jinjie Mai, Chaoyang Wang et al.Dec 18arXiv

EasyV2V is a simple but powerful system that edits videos by following plain-language instructions like “make the shirt blue starting at 2 seconds.”

#instruction-based video editing#spatiotemporal mask#text-to-video fine-tuning

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Intermediate

Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate

Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate

Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Intermediate

Peiying Zhang, Nanxuan Zhao et al.Dec 11arXiv

DuetSVG is a new AI that learns to make SVG graphics by generating an image and the matching SVG code together, like sketching first and then tracing neatly.

#DuetSVG#multimodal generation#SVG generation

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Intermediate

Tong Zhang, Carlos Hinojosa et al.Dec 11arXiv

Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.

#diffusion models#memorization mitigation#latent feature injection

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Intermediate

Ruihang Chu, Yefei He et al.Dec 9arXiv

Wan-Move is a new way to control how things move in AI-generated videos by guiding motion directly inside the model’s hidden features.

#motion-controllable video generation#latent trajectory guidance#point trajectories

1 2 3