Papers196

Recursive Language Models

Alex L. Zhang, Tim Kraska et al.Dec 31arXiv

Recursive Language Models (RLMs) let an AI read and work with prompts that are much longer than its normal memory by treating the prompt like a big external document it can open, search, and study with code.

#Recursive Language Models#RLM#Long-context reasoning

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Beginner

Song Wang, Lingdong Kong et al.Dec 30arXiv

Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.

#Spatial Intelligence#Multi-Modal Pre-Training#Self-Supervised Learning

Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Beginner

Xingyu Zhou, Qifan Li et al.Dec 30arXiv

This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.

#Internal Guidance#Diffusion Transformer#Intermediate Supervision

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Beginner

Zefeng He, Xiaoye Qu et al.Dec 30arXiv

DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.

#DiffThinker#Generative Multimodal Reasoning#Diffusion Models

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Beginner

Zhenyu Zhang, Shujian Zhang et al.Dec 30arXiv

This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.

#RISE#sparse auto-encoder#reasoning vectors

Nested Browser-Use Learning for Agentic Information Seeking

Beginner

Baixuan Li, Jialong Wu et al.Dec 29arXiv

This paper teaches AI helpers to browse the web more like people do, not just by grabbing static snippets.

#information-seeking agents#browser-use#ReAct function-calling

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Beginner

Ethan Chern, Zhulin Hu et al.Dec 29arXiv

LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.

#real-time video diffusion#on-policy distillation#multimodal conditioning

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

SVBench: Evaluation of Video Generation Models on Social Reasoning

Beginner

Wenshuo Peng, Gongxuan Wang et al.Dec 25arXiv

SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.

#social reasoning#video generation#benchmark

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Beginner

Li-Zhong Szu-Tu, Ting-Lin Wu et al.Dec 24arXiv

The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).

#YearGuessr#building age estimation#ordinal regression

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Beginner

Jin Qin, Zihan Liao et al.Dec 24arXiv

C2LLM is a new family of code embedding models that helps computers find the right code faster and more accurately.

#code retrieval#embedding model#cross-attention pooling

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

11 12 13 14 15