Papers943

Unified Video Editing with Temporal Reasoner

VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.

#video editing#diffusion transformer#chain-of-frames

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Beginner

Tong Wu, Yang Liu et al.Dec 8arXiv

This paper teaches a language model to think along several paths at the same time instead of one step after another.

#parallel reasoning#reinforcement learning for LLMs#self-distillation

Scaling Zero-Shot Reference-to-Video Generation

Intermediate

Zijian Zhou, Shikun Liu et al.Dec 7arXiv

Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.

#reference-to-video generation#zero-shot video synthesis#masked training

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Beginner

Ming Ma, Jue Zhang et al.Dec 7arXiv

LLM multi-agent systems often fail quietly (no crash) and leave long, twisty logs that are hard to debug by hand.

#DoVer#intervention-driven debugging#LLM multi-agent systems

MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Intermediate

Ruicheng Zhang, Mingyang Zhang et al.Dec 7arXiv

Robots need lots of realistic, long videos to learn, but collecting them is slow and expensive.

#hierarchical video generation#robotic manipulation#long-horizon planning

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate

Xiaojun Jia, Jie Liao et al.Dec 6arXiv

OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.

#multimodal large language models#jailbreak attacks#safety alignment

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Intermediate

Ming Chen, Sheng Tang et al.Dec 6arXiv

The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.

#decoding-based regression#sequence-level reward#reinforcement learning

Rethinking Training Dynamics in Scale-wise Autoregressive Generation

Intermediate

Gengze Zhou, Chongjian Ge et al.Dec 6arXiv

This paper fixes two big problems in image-making AI that builds pictures step by step: it often practices with perfect answers (teacher forcing) but must perform using its own imperfect guesses later, and the earliest coarse steps are much harder than the later fine steps.

#visual autoregressive modeling#next-scale prediction#exposure bias

VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Intermediate

Yuji Wang, Wenlong Liu et al.Dec 6arXiv

VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.

#visual grounding#referring expression comprehension#tool-integrated visual reasoning

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Intermediate

Hongyu Li, Manyuan Zhang et al.Dec 5arXiv

EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.

#instruction-based image editing#iterative reasoning#multimodal large language model

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Intermediate

Zhiting Mei, Tenny Yin et al.Dec 5arXiv

This paper teaches video-making AI models to say how sure they are about each tiny part of every frame they create.

#controllable video generation#uncertainty quantification#calibration

SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Intermediate

Wenhao Yan, Sheng Ye et al.Dec 5arXiv

SCAIL is a new AI system that turns a single character image into a studio-quality animation by following the moves in a driving video.

#character animation#3D pose representation#occlusion-aware pose

74 75 76 77 78