Papers9

All Beginner Intermediate Advanced

All Sources arXiv

#long-horizon planning

ProAct: Agentic Lookahead in Interactive Environments

Intermediate

Yangbin Yu, Mingyu Yang et al.Feb 5arXiv

ProAct teaches AI agents to think ahead accurately without needing expensive search every time they act.

#ProAct#GLAD#MC-Critic

Steering LLMs via Scalable Interactive Oversight

Intermediate

Enyu Zhou, Zhiheng Xi et al.Feb 4arXiv

The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.

#scalable oversight#interactive alignment#requirement elicitation

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Intermediate

Yinger Zhang, Shutong Jiang et al.Jan 26arXiv

DeepPlanning is a new benchmark that tests whether AI can make long, realistic plans that fit time and money limits.

#long-horizon planning#agentic tool use#global constrained optimization

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Intermediate

Chi-Pin Huang, Yunze Man et al.Jan 14arXiv

Fast-ThinkAct teaches a robot to plan with a few tiny hidden "thought tokens" instead of long paragraphs, making it much faster while staying smart.

#Vision-Language-Action#latent reasoning#verbalizable planning

User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale

Intermediate

Jungho Cho, Minbyul Jeong et al.Jan 13arXiv

The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.

#multi-turn dialogue generation#tool use#user simulation

Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Intermediate

Jiacheng Ye, Shansan Gong et al.Dec 27arXiv

Dream-VL and Dream-VLA use a diffusion language model backbone to understand images, talk about them, and plan actions better than many regular (autoregressive) models.

#diffusion language model#vision-language model#vision-language-action

Active Intelligence in Video Avatars via Closed-loop World Modeling

Intermediate

Xuanhua He, Tianyu Yang et al.Dec 23arXiv

The paper turns video avatars from passive puppets into active doers that can plan, act, check their own work, and fix mistakes over many steps.

#ORCA#L-IVA#Internal World Model

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Intermediate

Minh V. T. Thai, Tue Le et al.Dec 20arXiv

SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.

#SWE-EVO#software evolution#coding agents

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Intermediate

Ruicheng Zhang, Mingyang Zhang et al.Dec 7arXiv

Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.

#hierarchical video generation#robotic manipulation#long-horizon planning