Papers129

All Beginner Intermediate Advanced

All Sources arXiv

Nested Browser-Use Learning for Agentic Information Seeking

Beginner

Baixuan Li, Jialong Wu et al.Dec 29arXiv

This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.

#information-seeking agents#browser-use#ReAct function-calling

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Beginner

Ethan Chern, Zhulin Hu et al.Dec 29arXiv

LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.

#real-time video diffusion#on-policy distillation#multimodal conditioning

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

SVBench: Evaluation of Video Generation Models on Social Reasoning

Beginner

Wenshuo Peng, Gongxuan Wang et al.Dec 25arXiv

SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.

#social reasoning#video generation#benchmark

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner

Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Beginner

Jin Qin, Zihan Liao et al.Dec 24arXiv

C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.

#code retrieval#embedding model#cross-attention pooling

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Beginner

Jinghan Li, Yang Jin et al.Dec 24arXiv

This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.

#autoregressive video pretraining#masked next-frame prediction#context isolation

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Beginner

Gül Sena Altıntaş, Malikeh Ehghaghi et al.Dec 23arXiv

TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.

#tokenization#tokenizer robustness#Byte Pair Encoding (BPE)

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Beginner

Hanyang Kong, Xingyi Yang et al.Dec 22arXiv

WorldWarp is a new method that turns a single photo plus a planned camera path into a long, steady, 3D-consistent video.

#Novel View Synthesis#3D Gaussian Splatting#Spatio-Temporal Diffusion

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Beginner

Yuqiao Tan, Minzheng Wang et al.Dec 22arXiv

Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.

#Bottom-up Policy Optimization#internal layer policy#internal modular policy

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Beginner

Yujie Zhao, Hongwei Fan et al.Dec 22arXiv

Robots learn better when they see many examples, but collecting lots of real videos is slow and expensive.

#robotic demonstration generation#depth-controlled video generation#metric-scale 3D reconstruction

6 7 8 9 10