Papers1262

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

DrivePI is a single, small (0.5B) multimodal language model that sees with cameras and LiDAR, talks in natural language, and plans driving actions all at once.

#DrivePI#Vision-Language-Action#3D occupancy

Not triaged yet

State over Tokens: Characterizing the Role of Reasoning Tokens

Intermediate

Mosh Levy, Zohar Elyoseph et al.Dec 14arXiv

Reasoning tokens (the words a model writes before its final answer) help the model think better, but they are not a trustworthy diary of how it really thought.

#State over Tokens#reasoning tokens#chain-of-thought

Not triaged yet

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Intermediate

Jingzhe Ding, Shengda Long et al.Dec 14arXiv

NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.

#NL2Repo-Bench#autonomous coding agents#long-horizon reasoning

Not triaged yet

WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment

Intermediate

Mahir Labib Dihan, Tanzima Hashem et al.Dec 14arXiv

WebOperator is a smart way for AI to use a map of choices (a search tree) to navigate websites safely and reach goals.

#web agent#tree search#best-first search

Not triaged yet

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Beginner

Yuran Wang, Bohan Zeng et al.Dec 14arXiv

Scone is a new AI method that makes images from instructions while correctly picking the right subject even when many look similar.

#subject-driven image generation#multi-subject composition#subject distinction

Not triaged yet

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Intermediate

Jingdi Lei, Di Zhang et al.Dec 14arXiv

Standard attention is slow for long texts because it compares every word with every other word, which takes quadratic time.

#error-free linear attention#rank-1 matrix exponential#continuous-time dynamics

Not triaged yet

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Intermediate

Xiaoxuan Tang, Xinping Lei et al.Dec 13arXiv

AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.

#music-to-video generation#multi-agent system#music information retrieval

Not triaged yet

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Intermediate

Avinash Amballa, Yashas Malur Saidutta et al.Dec 12arXiv

VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.

#VOYAGER#determinantal point process#dataset diversity

Not triaged yet

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Intermediate

Chenrui Fan, Yijun Liang et al.Dec 12arXiv

This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.

#V-REX#Chain-of-Questions#Exploratory visual reasoning

Not triaged yet

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

Intermediate

Ye Fang, Tong Wu et al.Dec 12arXiv

V-RGBX is a new video editing system that lets you change the true building blocks of a scene—like base color, surface bumps, material, and lighting—rather than just painting over pixels.

#intrinsic video editing#inverse rendering#forward rendering

Not triaged yet

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Intermediate

Yang Fei, George Stoica et al.Dec 12arXiv

The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.

#video diffusion#structure-preserving motion#SAM2

Not triaged yet

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate

Minglei Shi, Haolin Wang et al.Dec 12arXiv

This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.

#text-to-image#diffusion transformer#flow matching

Not triaged yet

94 95 96 97 98