🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers49

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#flow matching

Feedforward 3D Editing via Text-Steerable Image-to-3D

Intermediate
Ziqi Ma, Hongqiao Chen et al.Dec 15arXiv

Steer3D lets you change a 3D object just by typing what you want, like “add a roof rack,” and it does it in one quick pass.

#3D editing#image-to-3D#ControlNet

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Intermediate
Yicheng Feng, Wanpeng Zhang et al.Dec 15arXiv

Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.

#Vision-Language-Action#3D spatial grounding#visual-physical alignment

Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Intermediate
Yifan Pu, Yizeng Han et al.Dec 15arXiv

Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.

#text-to-image#diffusion models#few-step generation

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate
Minglei Shi, Haolin Wang et al.Dec 12arXiv

This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.

#text-to-image#diffusion transformer#flow matching

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Intermediate
Tsai-Shien Chen, Aliaksandr Siarohin et al.Dec 11arXiv

Omni-Attribute is a new image encoder that learns just the parts of a picture you ask for (like hairstyle or lighting) and ignores the rest.

#open-vocabulary attribute encoder#attribute disentanglement#visual concept personalization

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate
Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model

OmniPSD: Layered PSD Generation with Diffusion Transformer

Intermediate
Cheng Liu, Yiren Song et al.Dec 10arXiv

OmniPSD is a new AI that can both make layered Photoshop (PSD) files from words and take apart a flat image into clean, editable layers.

#OmniPSD#layered PSD generation#RGBA-VAE

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Intermediate
Zheng Ding, Weirui YeDec 9arXiv

TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.

#TreeGRPO#reinforcement learning#diffusion models

Scaling Zero-Shot Reference-to-Video Generation

Intermediate
Zijian Zhou, Shikun Liu et al.Dec 7arXiv

Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.

#reference-to-video generation#zero-shot video synthesis#masked training

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Intermediate
Yanran Zhang, Ziyi Wang et al.Dec 4arXiv

This paper teaches a computer to turn one single picture into a moving 3D scene that stays consistent from every camera angle.

#4D scene generation#single-image to 4D#joint geometry and motion

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Intermediate
Xin He, Longhui Wei et al.Dec 4arXiv

EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.

#EMMA#unified multimodal architecture#32x autoencoder

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Intermediate
Zhenglin Cheng, Peng Sun et al.Dec 3arXiv

TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.

#TwinFlow#one-step generation#twin trajectories
12345