Papers131

#reinforcement learning

Unified Thinker: A General Reasoning Modular Core for Image Generation

Sashuai Zhou, Qiang Zhou et al.Jan 6arXiv

Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.

#reasoning-aware image generation#structured planning#edit-only prompt

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Intermediate

Jing Tan, Zhaoyang Zhang et al.Jan 5arXiv

Talk2Move is a training recipe that lets an image editor move, rotate, and resize the exact object you mention using plain text, while keeping the rest of the picture stable.

#text-guided image editing#object-level transformation#reinforcement learning

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Intermediate

Hao Bai, Alexey Taymanov et al.Jan 5arXiv

WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.

#WebGym#visual web agents#vision-language models

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

Intermediate

Shuhang Chen, Yunqiu Xu et al.Jan 5arXiv

This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).

#visual mathematical reasoning#multimodal large language models#perception-reasoning alignment

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Intermediate

Xu Guo, Fulong Ye et al.Jan 4arXiv

DreamID-V is a new AI method that swaps faces in videos while keeping the body movements, expressions, lighting, and background steady and natural.

#video face swapping#image face swapping#diffusion transformer

Scaling Open-Ended Reasoning to Predict the Future

Intermediate

Nikhil Chandak, Shashwat Goel et al.Dec 31arXiv

The paper teaches small language models to predict open-ended future events by turning daily news into thousands of safe, graded practice questions.

#open-ended forecasting#calibrated prediction#Brier score

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Intermediate

Weixun Wang, XiaoXiao Xu et al.Dec 31arXiv

This paper builds an open, end-to-end ecosystem (ALE) that lets AI agents plan, act, and fix their own mistakes across many steps in real computer environments.

#agentic LLMs#reinforcement learning#IPA

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

Intermediate

Karthik Dharmarajan, Wenlong Huang et al.Dec 31arXiv

Dream2Flow lets a robot watch a short, AI-generated video of a task and then do that task in real life by following object motion in 3D.

#3D object flow#video generation for robotics#open-world manipulation

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Intermediate

Junru Lu, Jiarui Qin et al.Dec 31arXiv

Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.

#lightweight LLM#agentic mid-training#trajectory data

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

Intermediate

Yuchen Shi, Yuzheng Cai et al.Dec 31arXiv

Youtu-Agent is a build-and-grow factory for AI agents that cuts manual setup and keeps agents improving over time.

#LLM agents#automated agent generation#modular architecture

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Intermediate

Yong Xien Chng, Tao Hu et al.Dec 30arXiv

SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.

#multimodal agent#vision-language model#reinforcement learning

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Intermediate

Zhe Huang, Hao Wen et al.Dec 30arXiv

Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.

#multimodal large language model#video understanding#visual hallucination

6 7 8 9 10