🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers64

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#flow matching

Spatia: Video Generation with Updatable Spatial Memory

Intermediate
Jinjing Zhao, Fangyun Wei et al.Dec 17arXiv

Spatia is a video generator that keeps a live 3D map of the scene (a point cloud) as its memory while making videos.

#video generation#spatial memory#3D point cloud

IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

Intermediate
Yuanhang Li, Yiren Song et al.Dec 17arXiv

IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.

#video editing#visual effects#diffusion transformer

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Intermediate
Shengming Yin, Zekai Zhang et al.Dec 17arXiv

The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.

#image decomposition#RGBA layers#alpha blending

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate
Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Intermediate
Yicheng Feng, Wanpeng Zhang et al.Dec 15arXiv

Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.

#Vision-Language-Action#3D spatial grounding#visual-physical alignment

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate
Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate
Minglei Shi, Haolin Wang et al.Dec 12arXiv

This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.

#text-to-image#diffusion transformer#flow matching

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Intermediate
Tsai-Shien Chen, Aliaksandr Siarohin et al.Dec 11arXiv

Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.

#open-vocabulary attribute encoder#attribute disentanglement#visual concept personalization

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate
Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model

OmniPSD: Layered PSD Generation with Diffusion Transformer

Intermediate
Cheng Liu, Yiren Song et al.Dec 10arXiv

OmniPSD is a new AI that can both make layered Photoshop (PSD) files from words and take apart a flat image into clean, editable layers.

#OmniPSD#layered PSD generation#RGBA-VAE

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Intermediate
Zheng Ding, Weirui YeDec 9arXiv

TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.

#TreeGRPO#reinforcement learning#diffusion models

Scaling Zero-Shot Reference-to-Video Generation

Intermediate
Zijian Zhou, Shikun Liu et al.Dec 7arXiv

Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.

#reference-to-video generation#zero-shot video synthesis#masked training
23456