Papers1262

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.

#visual grounding#referring expression comprehension#tool-integrated visual reasoning

Not triaged yet

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Intermediate

Hongyu Li, Manyuan Zhang et al.Dec 5arXiv

EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.

#instruction-based image editing#iterative reasoning#multimodal large language model

Not triaged yet

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Intermediate

Zhiting Mei, Tenny Yin et al.Dec 5arXiv

This paper teaches video-making AI models to say how sure they are about each tiny part of every frame they create.

#controllable video generation#uncertainty quantification#calibration

Not triaged yet

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Intermediate

Wenhao Yan, Sheng Ye et al.Dec 5arXiv

SCAIL is a new AI system that turns a single character image into a studio-quality animation by following the moves in a driving video.

#character animation#3D pose representation#occlusion-aware pose

Not triaged yet

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Beginner

Ziyang Wang, Honglu Zhou et al.Dec 5arXiv

Long Video Understanding (LVU) is hard because the important clues are tiny, far apart in time, and buried in hours of mostly unimportant footage.

#Active Video Perception#Long Video Understanding#Plan-Observe-Reflect

Not triaged yet

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Intermediate

Zhenpeng Su, Leiyu Pan et al.Dec 5arXiv

Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.

#reinforcement learning#PPO-clip#KL penalty

Not triaged yet

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Intermediate

Zijun Wang, Panwen Hu et al.Dec 5arXiv

ProPhy is a new two-step method that helps video AIs follow real-world physics, not just make pretty pictures.

#physics-aware video generation#mixture-of-experts#token-level routing

Not triaged yet

BEAVER: An Efficient Deterministic LLM Verifier

Intermediate

Tarun Suresh, Nalin Wadhwa et al.Dec 5arXiv

BEAVER is a new way to check, with guaranteed certainty, how likely a language model is to give answers that obey important rules.

#BEAVER#deterministic verification#large language models

Not triaged yet

AI & Human Co-Improvement for Safer Co-Superintelligence

Beginner

Jason Weston, Jakob FoersterDec 5arXiv

This paper argues that the fastest and safest path to super-smart AI is for humans and AIs to improve together, not for AI to improve alone.

#Co-improvement#Human-AI collaboration#Co-superintelligence

Not triaged yet

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Intermediate

Elisabetta Fedele, Francis Engelmann et al.Dec 5arXiv

SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.

#3D generative modeling#test-time guidance#latent space intervention

Not triaged yet

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Intermediate

Kevin Cannons, Saeed Ranjbar Alvar et al.Dec 4arXiv

This paper builds TAD, a brand-new test that checks if AI can understand what happens over time in real driving videos.

#Temporal understanding#Autonomous driving#Vision-language models

Not triaged yet

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Intermediate

Yanran Zhang, Ziyi Wang et al.Dec 4arXiv

This paper teaches a computer to turn one single picture into a moving 3D scene that stays consistent from every camera angle.

#4D scene generation#single-image to 4D#joint geometry and motion

Not triaged yet

101 102 103 104 105