🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers1262

AllBeginnerIntermediateAdvanced
All SourcesarXiv

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Intermediate
Yulu Gan, Ligeng Zhu et al.Dec 11arXiv

FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.

#motion understanding#spatio-temporal reasoning#video question answering

Not triaged yet

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Intermediate
Peiying Zhang, Nanxuan Zhao et al.Dec 11arXiv

DuetSVG is a new AI that learns to make SVG graphics by generating an image and the matching SVG code together, like sketching first and then tracing neatly.

#DuetSVG#multimodal generation#SVG generation

Not triaged yet

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Intermediate
Kehong Gong, Zhengyu Wen et al.Dec 11arXiv

MoCapAnything is a system that turns a single regular video into a 3D animation that can drive any rigged character, not just humans or one animal type.

#motion capture#category-agnostic mocap#monocular video

Not triaged yet

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Intermediate
Zongzhao Li, Xiangzhe Kong et al.Dec 11arXiv

The paper defines Microscopic Spatial Intelligence (MiSI) as the skill AI needs to understand tiny 3D things like molecules from 2D pictures and text, just like scientists do.

#microscopic spatial intelligence#vision-language models#orthographic projection

Not triaged yet

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Intermediate
Jingli Lin, Runsen Xu et al.Dec 11arXiv

This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.

#video-based spatial intelligence#multimodal large language models#spatial construction

Not triaged yet

Scaling Behavior of Discrete Diffusion Language Models

Intermediate
Dimitri von Rütte, Janis Fluri et al.Dec 11arXiv

This paper studies how a newer kind of language model, called a discrete diffusion language model (DLM), gets better as we give it more data, bigger models, and more compute.

#discrete diffusion#language models#scaling laws

Not triaged yet

What matters for Representation Alignment: Global Information or Spatial Structure?

Intermediate
Jaskirat Singh, Xingjian Leng et al.Dec 11arXiv

This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).

#representation alignment#REPA#iREPA

Not triaged yet

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Beginner
Aileen Cheng, Alon Jacovi et al.Dec 11arXiv

The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.

#LLM factuality#benchmarking#multimodal evaluation

Not triaged yet

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

Intermediate
Zijian Wu, Lingkai Kong et al.Dec 11arXiv

Big AI models often write very long step-by-step solutions, but usual checkers either only check the final answer or get lost in the long steps.

#Outcome-based Process Verifier#Chain-of-Thought#Process Verification

Not triaged yet

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Intermediate
Songyang Gao, Yuzhe Gu et al.Dec 11arXiv

This paper builds a math problem–solving agent, Intern-S1-MO, that thinks in multiple rounds and remembers proven mini-results called lemmas so it can solve very long, Olympiad-level problems.

#long-horizon reasoning#lemma-based memory#multi-agent reasoning

Not triaged yet

Sharp Monocular View Synthesis in Less Than a Second

Beginner
Lars Mescheder, Wei Dong et al.Dec 11arXiv

SHARP turns a single photo into a 3D scene you can look around in, and it does this in under one second on a single GPU.

#monocular view synthesis#3D Gaussians#real-time neural rendering

Not triaged yet

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Intermediate
Tong Zhang, Carlos Hinojosa et al.Dec 11arXiv

Diffusion models sometimes copy training images too closely, which can be a privacy and copyright problem.

#diffusion models#memorization mitigation#latent feature injection

Not triaged yet

96979899100