Papers1055

All Beginner Intermediate Advanced

All Sources arXiv

Region-Constraint In-Context Generation for Instructional Video Editing

Intermediate

Zhongwei Zhang, Fuchen Long et al.Dec 19arXiv

ReCo is a new way to edit videos just by telling the computer what to change with words, no extra masks needed.

#instruction-based video editing#in-context generation#region constraint

Not triaged yet

InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

Intermediate

Hoiyeong Jin, Hyojin Jang et al.Dec 19arXiv

InsertAnywhere is a two-stage system that lets you add a new object into any video so it looks like it was always there.

#video object insertion#4D scene geometry#diffusion video generation

Not triaged yet

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Not triaged yet

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework

Intermediate

Tobias Sautter, Jan-Niklas Dihlmann et al.Dec 19arXiv

3D-RE-GEN turns a single photo of a room into a full 3D scene with separate, textured objects and a usable background.

#single-image 3D reconstruction#scene composition#context-aware inpainting

Not triaged yet

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Intermediate

Jiajun Wu, Jian Yang et al.Dec 19arXiv

The paper introduces UCoder, a way to teach a code-generating AI to get better without using any outside datasets, not even unlabeled code.

#unsupervised code generation#self-training#internal probing

Not triaged yet

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Intermediate

Zeyuan Allen-ZhuDec 19arXiv

The paper introduces Canon layers, tiny add-ons that let nearby words share information directly, like passing notes along a row of desks.

#Canon layers#horizontal information flow#transformer architecture

Not triaged yet

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

Intermediate

Yuqing Li, Jiangnan Li et al.Dec 19arXiv

Humans keep a big-picture memory (a “mindscape”) when reading long things; this paper teaches AI to do the same.

#Retrieval-Augmented Generation#Mindscape#Hierarchical Summarization

Not triaged yet

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

Intermediate

Rujiao Long, Yang Li et al.Dec 19arXiv

Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.

#Reasoning Palette#latent contextualization#VAE

Not triaged yet

Reinforcement Learning for Self-Improving Agent with Skill Library

Intermediate

Jiongxiao Wang, Qiaojing Yan et al.Dec 18arXiv

This paper teaches AI agents to learn new reusable skills and get better over time by using reinforcement learning, not just prompts.

#Reinforcement Learning#Skill Library#Sequential Rollout

Not triaged yet

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Intermediate

Chiao-An Yang, Ryo Hachiuma et al.Dec 18arXiv

This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.

#4D perception#multimodal large language models#perceptual distillation

Not triaged yet

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Intermediate

Junbo Li, Peng Zhou et al.Dec 18arXiv

Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.

#Turn-PPO#multi-turn reinforcement learning#agentic LLMs

Not triaged yet

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Intermediate

Hanlin Wang, Hao Ouyang et al.Dec 18arXiv

WorldCanvas lets you make videos where things happen exactly how you ask by combining three inputs: text (what happens), drawn paths called trajectories (when and where it happens), and reference images (who it is).

#WorldCanvas#promptable world events#trajectory-controlled video generation

Not triaged yet

70 71 72 73 74