Papers23

#instruction following

LSRIF: Logic-Structured Reinforcement Learning for Instruction Following

Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.

#instruction following#logical structures#parallel constraints

Not triaged yet

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Beginner

Hengjia Li, Liming Jiang et al.Jan 6arXiv

ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.

#reasoning-centric image editing#reinforcement learning#chain-of-thought

Not triaged yet

Unified Thinker: A General Reasoning Modular Core for Image Generation

Intermediate

Sashuai Zhou, Qiang Zhou et al.Jan 6arXiv

Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.

#reasoning-aware image generation#structured planning#edit-only prompt

Not triaged yet

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Beginner

Junyi Chen, Tong He et al.Jan 5arXiv

VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.

#VINO#unified visual generator#multimodal diffusion transformer

Not triaged yet

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Beginner

Zhe Cao, Tao Wang et al.Dec 24arXiv

T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.

#Text-to-Audio-Video generation#multimodal evaluation#cross-modal alignment

Not triaged yet

IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

Intermediate

Yuanhang Li, Yiren Song et al.Dec 17arXiv

IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.

#video editing#visual effects#diffusion transformer

Not triaged yet

Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Intermediate

Jooyeol Yun, Jaegul ChooDec 16arXiv

Vector Prism helps computers animate SVG images by first discovering which tiny shapes belong together as meaningful parts.

#SVG animation#semantic restructuring#vision–language models

Not triaged yet

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Intermediate

Hao Lu, Ziyang Liu et al.Dec 10arXiv

UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.

#UniUGP#vision-language-action#world model

Not triaged yet

MOA: Multi-Objective Alignment for Role-Playing Agents

Intermediate

Chonghua Liao, Ke Wang et al.Dec 10arXiv

Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.

#multi-objective alignment#role-playing agents#reinforcement learning

Not triaged yet

Position: Universal Aesthetic Alignment Narrows Artistic Expression

Intermediate

Wenqi Marshall Guo, Qingyun Qian et al.Dec 9arXiv

The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.

#universal aesthetic alignment#aesthetic pluralism#reward models

Not triaged yet

Unified Video Editing with Temporal Reasoner

Intermediate

Xiangpeng Yang, Ji Xie et al.Dec 8arXiv

VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.

#video editing#diffusion transformer#chain-of-frames

Not triaged yet

1 2