Papers1055

VLS: Steering Pretrained Robot Policies via Vision-Language Models

Shuo Liu, Ishneet Sukhvinder Singh et al.Feb 3arXiv

Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.

#Vision–Language Steering#Inference-time control#Diffusion policy

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Intermediate

Yinyi Luo, Yiqiao Jin et al.Feb 3arXiv

AgentArk teaches one language model to think like a whole team of models that debate, so it can solve tough problems quickly without running a long, expensive debate at answer time.

#multi-agent distillation#process reward model#GRPO

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Intermediate

Tong Zheng, Chengsong Huang et al.Feb 3arXiv

Parallel-Probe is a simple add-on that lets many AI “thought paths” think at once but stop early when they already agree.

#parallel thinking#2D probing#consensus-based early stopping

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Intermediate

Minjun Zhu, Zhen Lin et al.Feb 3arXiv

AutoFigure is an AI system that reads long scientific texts and then thinks, plans, and draws clear, good-looking figures—like a careful student who makes a neat, accurate poster from a long chapter.

#AutoFigure#FigureBench#Reasoned Rendering

FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

Intermediate

Zimu Lu, Houxing Ren et al.Feb 3arXiv

This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.

#agentic coding#multi-agent systems#full-stack development

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Intermediate

Zhixue Fang, Xu He et al.Feb 3arXiv

This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.

#3D-aware motion#implicit motion encoder#motion tokens

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate

Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Intermediate

Ian Wu, Yuxiao Qu et al.Feb 3arXiv

Reasoning Cache (RC) is a new way for AI to think in steps: it writes some thoughts, makes a short summary, throws away the long thoughts, and then keeps going using only the summary.

#Reasoning Cache#iterative decoding#summary-conditioned reasoning

LIVE: Long-horizon Interactive Video World Modeling

Intermediate

Junchao Huang, Ziyang Ye et al.Feb 3arXiv

LIVE is a new way to train video-making AIs so their mistakes don’t snowball over long videos.

#cycle consistency#autoregressive video diffusion#exposure bias

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Intermediate

Guangyi Liu, Pengxiang Zhao et al.Feb 3arXiv

MemGUI-Bench is a new test that checks how well phone-controlling AI agents can remember important information both during a task and across different tries.

#mobile GUI agents#memory benchmarking#short-term memory

No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Intermediate

Vynska Amalia Permadi, Xingwei Tan et al.Feb 3arXiv

This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.

#multi-hop question answering#cultural reasoning#Indonesian culture

Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Intermediate

Yu Zhang, Mufan Xu et al.Feb 3arXiv

The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?

#multimodal instruction following#modality arbitration#instruction tokens

24 25 26 27 28