Papers4

#channel-wise concatenation

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev et al.Jan 5arXiv

VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.

#instruction-based image editing#vision-language model#diffusion model

Not triaged yet

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Intermediate

Jiawei Liu, Junqiao Li et al.Dec 24arXiv

DreaMontage is a new AI method that makes long, single-shot videos that feel smooth and connected, even when you give it scattered images or short clips in the middle.

#arbitrary frame conditioning#one-shot video generation#Diffusion Transformer

Not triaged yet

EgoX: Egocentric Video Generation from a Single Exocentric Video

Intermediate

Taewoong Kang, Kinam Kim et al.Dec 9arXiv

EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.

#egocentric video generation#exocentric to egocentric#video diffusion models

Not triaged yet

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Intermediate

Xin He, Longhui Wei et al.Dec 4arXiv

EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.

#EMMA#unified multimodal architecture#32x autoencoder

Not triaged yet