Papers1055

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan, Zhaoyang Zhang et al.Jan 5arXiv

Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.

#text-guided image editing#object-level transformation#reinforcement learning

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Intermediate

Shuai Yuan, Yantai Yang et al.Jan 5arXiv

InfiniteVGGT is a streaming 3D vision system that can keep working forever on live video without running out of memory.

#InfiniteVGGT#rolling memory#causal attention

DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Intermediate

Renke Wang, Zhenyu Zhang et al.Jan 5arXiv

DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.

#human mesh recovery#SMPL-X#dense correspondence

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Intermediate

Shikun Sun, Liao Qu et al.Jan 5arXiv

Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.

#Visual Autoregressive (VAR)#Reinforcement Learning#GRPO

VIBE: Visual Instruction Based Editor

Intermediate

Grigorii Alekseenko, Aleksandr Gordeev et al.Jan 5arXiv

VIBE is a tiny but mighty image editor that listens to your words and changes pictures while keeping the original photo intact unless you ask otherwise.

#instruction-based image editing#vision-language model#diffusion model

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Intermediate

Huichao Zhang, Liao Qu et al.Jan 5arXiv

NextFlow is a single, decoder-only Transformer that can read and write both text and images in one continuous sequence.

#Next-Scale Prediction#Autoregressive Transformer#Dual-Codebook Tokenization

Confidence Estimation for LLMs in Multi-turn Interactions

Intermediate

Caiqi Zhang, Ruihan Yang et al.Jan 5arXiv

This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.

#multi-turn confidence estimation#LLM calibration#InfoECE

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

Intermediate

Muxi Diao, Lele Yang et al.Jan 5arXiv

Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.

#Entropy-Adaptive Fine-Tuning#confident conflicts#token-level entropy

MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

Intermediate

Zhuofan Shi, Hubao A et al.Jan 5arXiv

MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.

#Molecular Dynamics#LAMMPS#Code Generation

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Intermediate

Hao Bai, Alexey Taymanov et al.Jan 5arXiv

WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.

#WebGym#visual web agents#vision-language models

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Intermediate

Shuhang Chen, Yunqiu Xu et al.Jan 5arXiv

This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).

#visual mathematical reasoning#multimodal large language models#perception-reasoning alignment

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate

Dasol Choi, DongGeon Lee et al.Jan 5arXiv

COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.

#policy alignment#allowlist denylist#enterprise AI safety

59 60 61 62 63