🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers943

AllBeginnerIntermediateAdvanced
All SourcesarXiv

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Intermediate
Ming Li, Chenrui Fan et al.Dec 23arXiv

This paper turns messy chains of thought from language models into clear, named steps so we can see how they really think through math problems.

#Schoenfeld’s Episode Theory#Cognitive Episodes#ThinkARM

How Much 3D Do Video Foundation Models Encode?

Intermediate
Zixuan Huang, Xiang Li et al.Dec 23arXiv

This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?

#video foundation models#3D awareness#temporal reasoning

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Intermediate
Weichen Fan, Haiwen Diao et al.Dec 22arXiv

The paper proposes the Prism Hypothesis: meanings (semantics) mainly live in low frequencies, while fine picture details live in high frequencies.

#Prism Hypothesis#Unified Autoencoding#Frequency-Band Modulator

GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Intermediate
Jiacheng Guo, Ling Yang et al.Dec 22arXiv

GenEnv is a training system where a student AI and a teacher simulator grow together by exchanging tasks and feedback.

#GenEnv#co-evolutionary learning#difficulty-aligned curriculum

VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Intermediate
Xinyao Liao, Qiyuan He et al.Dec 22arXiv

Autoregressive (AR) image models make pictures by choosing tokens one-by-one, but they were judged only on picking likely tokens, not on how good the final picture looks in pixels.

#autoregressive image generation#tokenizer–generator alignment#pixel-space reconstruction

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Beginner
Hanyang Kong, Xingyi Yang et al.Dec 22arXiv

WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.

#Novel View Synthesis#3D Gaussian Splatting#Spatio-Temporal Diffusion

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Beginner
Yuqiao Tan, Minzheng Wang et al.Dec 22arXiv

Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.

#Bottom-up Policy Optimization#internal layer policy#internal modular policy

Over++: Generative Video Compositing for Layer Interaction Effects

Intermediate
Luchao Qi, Jiaye Wu et al.Dec 22arXiv

Over++ is a video AI that adds realistic effects like shadows, splashes, dust, and smoke between a foreground and a background without changing the original footage.

#augmented compositing#video diffusion#video inpainting

StoryMem: Multi-shot Long Video Storytelling with Memory

Intermediate
Kaiwen Zhang, Liming Jiang et al.Dec 22arXiv

StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.

#StoryMem#Memory-to-Video#multi-shot video generation

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Intermediate
Moritz Böhle, Amélie Royer et al.Dec 22arXiv

CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.

#CASA#cross-attention#self-attention

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Intermediate
Li Puyin, Tiange Xiang et al.Dec 22arXiv

QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.

#QuantiPhy#Vision-Language Models#Physical reasoning

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Beginner
Yujie Zhao, Hongwei Fan et al.Dec 22arXiv

Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.

#robotic demonstration generation#depth-controlled video generation#metric-scale 3D reconstruction
5657585960