🎓How I Study AIHISA

📖Read

📄Papers 📰Blogs 🎬Courses

💡Learn

🛤️Paths 📚Topics 💡Concepts 🎴Shorts

🎯Practice

📝Daily Log 🎯Prompts 🧠Review

Search Settings

How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers9

All Beginner Intermediate Advanced

All Sources arXiv

#VAE latents

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Chao Li, Tianhong Li et al.Mar 3arXiv

DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).

#DREAM#contrastive learning#masked autoregressive modeling

Not triaged yet

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen, Dixuan Lin et al.Feb 25arXiv

SkyReels-V4 is a single, unified model that makes videos and matching sounds together, while also letting you fix or change parts of a video.

#multimodal diffusion transformer#video-audio generation#inpainting

Not triaged yet

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl et al.Feb 23arXiv

Mobile-O is a small but smart AI that can both understand pictures and make new images, and it runs right on your phone.

#Mobile-O#unified multimodal model#on-device AI

Not triaged yet

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Kai Liu, Yanhao Zheng et al.Feb 22arXiv

JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.

#joint audio-video generation#multimodal diffusion transformer#modality-specific mixture-of-experts

Not triaged yet

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

Shuai Tan, Biao Gong et al.Jan 16arXiv

CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.

#multi-subject animation#pose-guided video generation#Unbind–Rebind paradigm

Not triaged yet

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Junyi Chen, Tong He et al.Jan 5arXiv

VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.

#VINO#unified visual generator#multimodal diffusion transformer

Not triaged yet

SemanticGen: Video Generation in Semantic Space

Jianhong Bai, Xiaoshi Wu et al.Dec 23arXiv

SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.

#Video generation#Diffusion model#Semantic representation

Not triaged yet

REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion

Giorgos Petsangourakis, Christos Sgouropoulos et al.Dec 18arXiv

Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.

#latent diffusion#REGLUE#representation entanglement

Not triaged yet

EgoX: Egocentric Video Generation from a Single Exocentric Video

Taewoong Kang, Kinam Kim et al.Dec 9arXiv

EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.

#egocentric video generation#exocentric to egocentric#video diffusion models

Not triaged yet