🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers943

AllBeginnerIntermediateAdvanced
All SourcesarXiv

Learning Unmasking Policies for Diffusion Language Models

Intermediate
Metod Jazbec, Theo X. Olausson et al.Dec 9arXiv

Diffusion language models write by gradually unmasking hidden words, so deciding which blanks to reveal next is a big deal for both speed and accuracy.

#diffusion language models#masked diffusion#unmasking policy

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Intermediate
Chuhan Zhang, Guillaume Le Moing et al.Dec 9arXiv

D4RT is a new AI model that turns regular videos into moving 3D scenes (4D) quickly and accurately.

#D4RT#dynamic 4D reconstruction#query-based decoding

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Intermediate
Hongyuan Tao, Bencheng Liao et al.Dec 9arXiv

InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.

#InfiniteVL#linear attention#Gated DeltaNet

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Intermediate
Songqiao Hu, Zeyi Liu et al.Dec 9arXiv

Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.

#Vision-Language-Action#Safety Constraint Layer#Control Barrier Function

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Intermediate
Ruihang Chu, Yefei He et al.Dec 9arXiv

Wan-Move is a new way to control how things move in AI-generated videos by guiding motion directly inside the model’s hidden features.

#motion-controllable video generation#latent trajectory guidance#point trajectories

Thinking with Images via Self-Calling Agent

Intermediate
Wenxi Yang, Yuzhong Zhao et al.Dec 9arXiv

This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.

#Self-Calling Chain-of-Thought#sCoT#interleaved multimodal chain-of-thought

Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Beginner
Yuning Gong, Yifei Liu et al.Dec 9arXiv

Visionary is a web-based platform that lets you view and interact with advanced 3D scenes, right in your browser, with just a click.

#WebGPU#3D Gaussian Splatting#ONNX Runtime Web

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Intermediate
Jiahao Lu, Weitao Xiong et al.Dec 9arXiv

TrackingWorld turns a regular single-camera video into a map of where almost every pixel moves in 3D space over time.

#monocular 3D tracking#world-centric coordinates#camera pose estimation

Towards a Science of Scaling Agent Systems

Beginner
Yubin Kim, Ken Gu et al.Dec 9arXiv

Multi-agent AI teams are not automatically better; their success depends on matching the team’s coordination style to the job’s structure.

#multi-agent systems#agentic evaluation#scaling laws

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Intermediate
Yexin Liu, Manyuan Zhang et al.Dec 9arXiv

OpenSubject is a giant video-based dataset (2.5M samples, 4.35M images) built to help AI make pictures that keep each person or object looking like themselves, even in busy scenes.

#subject-driven generation#identity fidelity#video-derived dataset

EgoX: Egocentric Video Generation from a Single Exocentric Video

Intermediate
Taewoong Kang, Kinam Kim et al.Dec 9arXiv

EgoX turns a regular third-person video into a first-person video that looks like it was filmed from the actor’s eyes.

#egocentric video generation#exocentric to egocentric#video diffusion models

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Intermediate
Meng Wei, Chenyang Wan et al.Dec 9arXiv

Robots that follow spoken instructions used to be slow and jerky because one big model tried to think and move at the same time.

#vision-and-language navigation#VLM planner#dual-system architecture
7273747576