🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers943

AllBeginnerIntermediateAdvanced
All SourcesarXiv

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Beginner
Yuran Wang, Bohan Zeng et al.Dec 14arXiv

Scone is a new AI method that makes images from instructions while correctly picking the right subject even when many look similar.

#subject-driven image generation#multi-subject composition#subject distinction

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Intermediate
Jingdi Lei, Di Zhang et al.Dec 14arXiv

Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.

#error-free linear attention#rank-1 matrix exponential#continuous-time dynamics

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Intermediate
Xiaoxuan Tang, Xinping Lei et al.Dec 13arXiv

AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.

#music-to-video generation#multi-agent system#music information retrieval

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Intermediate
Avinash Amballa, Yashas Malur Saidutta et al.Dec 12arXiv

VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.

#VOYAGER#determinantal point process#dataset diversity

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Intermediate
Chenrui Fan, Yijun Liang et al.Dec 12arXiv

This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.

#V-REX#Chain-of-Questions#Exploratory visual reasoning

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Intermediate
Ye Fang, Tong Wu et al.Dec 12arXiv

V-RGBX is a new video editing system that lets you change the true building blocks of a scene—like base color, surface bumps, material, and lighting—rather than just painting over pixels.

#intrinsic video editing#inverse rendering#forward rendering

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Intermediate
Yang Fei, George Stoica et al.Dec 12arXiv

The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.

#video diffusion#structure-preserving motion#SAM2

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate
Minglei Shi, Haolin Wang et al.Dec 12arXiv

This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.

#text-to-image#diffusion transformer#flow matching

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Intermediate
Zhenyang Cai, Jiaming Zhang et al.Dec 12arXiv

DentalGPT is a special AI that looks at dental images and text together and explains what it sees like a junior dentist.

#DentalGPT#multimodal large language model#dentistry AI

Rethinking Expert Trajectory Utilization in LLM Post-training

Intermediate
Bowen Ding, Yuhan Chen et al.Dec 12arXiv

The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.

#Supervised Fine-Tuning#Reinforcement Learning#Expert Trajectories

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Intermediate
Han Lin, Xichen Pan et al.Dec 12arXiv

MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.

#MetaCanvas#MLLM#Diffusion Transformer

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Intermediate
Chao Xu, Suyu Zhang et al.Dec 12arXiv

Vision-Language-Action (VLA) models are robots’ “see–think–do” brains that connect cameras (vision), words (language), and motors (action).

#Vision-Language-Action#Embodied AI#Multimodal Alignment
6869707172